Close Menu
    Trending
    • Revisiting Benchmarking of Tabular Reinforcement Learning Methods
    • Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Understanding Transformer Models in ML
    Machine Learning

    Understanding Transformer Models in ML

    Team_AIBS NewsBy Team_AIBS NewsJune 14, 2025No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In the event you’ve spent any time in ML over the previous couple of years, you’ve most likely heard of Transformers. And no, not Optimus Prime, although they’ve confirmed to be simply as highly effective.

    Transformers quietly energy lots of of fashions like GPT and BERT. Whether or not we’re prompting a chatbot or utilizing autocomplete, chances are high there’s a transformer working behind the scenes.

    However what precisely is a transformer? Why did it change into the spine of recent AI? And the way did we go from gradual, sequential fashions like RNNs to architectures that scale effortlessly throughout huge datasets and {hardware}?

    Within the final article, we checked out recurrent neural networks (RNNs). Whereas they marked an essential step ahead, the truth that they’re designed to deal with sequences step-by-step got here with limitations: lengthy coaching time, vanishing gradients, and extra.

    Launched within the paper Attention Is All You Need, transformers changed recurrence with a easy thought, which is to let each token attend to each different token straight all of sudden.

    The transformer scans all the enter concurrently and decides what to give attention to as an alternative of studying by means of a sentence phrase by phrase. This operation is named self-attention, it permits the mannequin to dynamically weight the relevance of various elements of the enter when processing every token.

    For math and code examples of Consideration, be at liberty to take a look at my article here.

    The thought of each transformer mannequin is a stack of similar blocks. Every block consists of the next:

    • Multi-head self-attention
    • A position-wise feedforward community
    • Layer normalization and residual connections

    Multi-Head Self-Consideration

    Recall from final article that self-attention computes:

    Earlier, we launched self-attention as a method for every token to compute a weighted common over the opposite tokens. However what if a single consideration mechanism isn’t sufficient to seize the range of relationships in a sequence?

    That’s the place we want multi-head consideration.

    We break up the Q, Okay, V matrices into a number of heads and run separate self-attention operations in parallel, every head can focus on specializing in totally different points like syntactic dependencies or semantic roles, and so forth.

    Mathematically, every head computes:

    Then we concatenate the outputs of all heads and apply a ultimate projection:

    This offers the mannequin extra flexibility and capability with out dramatically growing depth.

    Feedforward Community

    Every place (token) within the sequence is processed independently by a two-layer feedforward community:

    This provides non-linearity and helps the mannequin refine its representations. Sometimes, the hidden layer is 4x wider than the enter/output measurement.

    Layer Normalization and Residual Connections

    Each the eye and feedforward sub-layers are wrapped in residual connections adopted by LayerNorm:

    Residual connections preserve enter info and stabilize gradients, whereas LayerNorm prevents vanishing activations throughout deep stacking.

    Right here’s a minimal transformer block in PyTorch.

    import torch.nn as nn

    class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, ff_hidden_dim):
    tremendous().__init__()
    self.attn = nn.MultiheadAttention(d_model, num_heads, batch_first=True)
    self.norm1 = nn.LayerNorm(d_model)
    self.ff = nn.Sequential(
    nn.Linear(d_model, ff_hidden_dim),
    nn.ReLU(),
    nn.Linear(ff_hidden_dim, d_model)
    )
    self.norm2 = nn.LayerNorm(d_model)

    def ahead(self, x):
    attn_output, _ = self.attn(x, x, x)
    x = self.norm1(x + attn_output)
    ff_output = self.ff(x)
    return self.norm2(x + ff_output)

    Observe how residual connections and layer norm are used after each consideration and feedforward layers.

    Now it’s time to have a look at how these blocks are assembled into full fashions. Transformers can take considered one of three architectural types, relying on the duty:

    • Encoder-Decoder
    • Encoder-Solely
    • Decoder-Solely

    Every setup reuses the identical core transformer blocks, however wires them otherwise. The wiring selections outline how the mannequin processes knowledge, and what it’s good at.

    Encoder-Decoder

    Designed for sequence-to-sequence duties resembling machine translation, the place we wish to map one sequence to a different.

    Now let’s discuss construction.

    Encoder processes the enter sequence x = (x1, x2, …, xn)

    • Applies self-attention + feedforward blocks
    • Outputs contextual representations for every enter token

    Decoder generates the output sequence y = (y1, y2, …, ym)

    • Makes use of self-attention on earlier output tokens
    • Attends to the encoder’s outputs through cross-attention

    Then, cross-attention is:

    This enables the decoder to look again on the encoded supply whereas producing output tokens.

    Encoder-Solely

    These fashions give attention to encoding inputs into wealthy contextual representations. They’re helpful once we wish to perceive textual content and don’t must generate it. The aim is to pretrain for classification, sentence similarity, and so forth.

    Construction

    • Simply the encoder stack
    • Pretrained utilizing targets like MLM
    • Examples embody BERT, RoBERTa, DistilBERT

    These fashions typically feed the final-layer embeddings right into a classification head or one other task-specific module.

    Decoder-Solely

    These fashions drop the encoder, and depend on causal self-attention to generate sequences autoregressively.

    They’re designed for duties the place the mannequin should communicate slightly than pay attention, resembling code completion, summarizing, textual content era, dialog methods, and extra.

    Construction

    • A stack of decoder blocks
    • The mannequin sees solely previous tokens at each step, masked consideration enforces this

    This makes positive {that a} token can by no means attend to future tokens. Examples embody GPT-4, Claude, Grok, any mannequin educated for next-token prediction.

    Self-attention is order-agnostic. Given a set of enter tokens, the eye mechanism doesn’t know whether or not a phrase got here first/final or someplace in between as a result of it sees the entire sequence directly.

    Many domains like pure language are sequential, and phrase order issues. So we have to inject some thought of place into the enter, if the mannequin goes to course of sequences accurately.

    We are going to use positional encoding so as to add a positional sign to every enter embedding. This manner, the mannequin can study patterns that rely not simply on token identification but additionally on token place.

    Possibility 1: Sinusoidal Positional Encoding

    The Consideration Is All You Want paper proposed a hard and fast and deterministic method utilizing sine and cosine features of various frequencies. For a place pos and dimension i, the encoding is:

    Even dimensions use sine and odd dimensions use cosine.

    This scheme provides every place a novel sample of values and permits the mannequin to generalize to sequences longer than seen throughout coaching as a result of it’s not realized.

    Possibility 2: Discovered Positional Embeddings

    Later fashions deal with place like some other token and study a vector for every place.

    Then, enter to the transformer turns into:

    Discovered embeddings are extra versatile, however they lack the generalization property of sinusoidal encodings.

    import torch
    import math

    def sinusoidal_positional_encoding(seq_len, d_model):
    """
    Args:
    seq_len (int): Size of the sequence.
    d_model (int): Embedding dimensionality.

    Returns:
    Tensor of form (seq_len, d_model): positional encodings.
    """

    # Create a tensor of form (seq_len, 1) for place indices
    pos = torch.arange(seq_len, dtype=torch.float32).unsqueeze(1)

    # Create a tensor of form (1, d_model) for dimension indices
    i = torch.arange(d_model, dtype=torch.float32).unsqueeze(0)

    # Compute the angle charges: 1 / 10000^(2i/d_model)
    angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model)

    # Multiply place by angle charges (broadcasting)
    angle_rads = pos * angle_rates # form: (seq_len, d_model)

    # Initialize positional encoding matrix
    pe = torch.zeros(seq_len, d_model)

    # Apply sine to even indices (0, 2, 4, ...)
    pe[:, 0::2] = torch.sin(angle_rads[:, 0::2])

    # Apply cosine to odd indices (1, 3, 5, ...)
    pe[:, 1::2] = torch.cos(angle_rads[:, 1::2])

    return pe

    • Every row within the output represents a place within the sequence
    • Every column represents a frequency-based sign in that dimension
    • Every place within the sequence will get a novel vector by combining sine and cosine waves of various frequencies

    These vectors:

    • Are deterministic and no coaching required
    • Let the mannequin study relative positions
    • Generalize to longer sequences at inference time, as a result of they’re not bounded by a hard and fast realized desk

    There are two solutions:

    • Architectural effectivity: Transformers are parallel and straightforward to optimize
    • Empirical scaling legal guidelines: Efficiency improves properly with scale

    Parallelism and Architectural Simplicity

    Recall that conventional sequence fashions course of tokens sequentially: the output at time step t relies on the computations from t-1. This limits parallelism, which in flip makes coaching gradual.

    Transformers substitute this with self-attention, which computes all token interactions in parallel. This:

    • Can make the most of GPUs properly (matrix-matrix ops)
    • Scales linearly with sequence size in reminiscence
    • Permits batching over longer sequences

    Scaling Legal guidelines (With Math)

    In Scaling Laws for Neural Language Models by Kaplan et al. (2020), the authors display that transformer fashions exhibit power-law scaling behaviour with respect to 3 core components: mannequin measurement N, dataset measurement D, and compute finances C.

    Particularly:

    • The paper exhibits that check loss decreases as a power-law of every of those variables when the opposite two will not be bottlenecks
    • This relationship is expressed by means of formulation resembling:

    What this implies: Scaling mannequin measurement or dataset measurement yields constant and predictable enhancements in efficiency that observe sublinear power-law traits.

    Emergent Skills at Scale

    That is capabilities that don’t exist in smaller fashions all of the sudden seem past sure thresholds. Examples embody:

    • Few-shot studying
    • In-context studying and gear use
    • Instruction following with out fine-tuning
    • Reasoning throughout modalities

    There you’ve got it, the concepts behind transformers. GLHF!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleRobot Videos: Neo Humanoid Robot, NASA Rover, and More
    Next Article Stop Building AI Platforms | Towards Data Science
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025
    Machine Learning

    Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025

    July 2, 2025
    Machine Learning

    From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Revolutionizing Automated Visual Inspection – The Role of Robotics in Modern Automated Visual Inspection

    June 3, 2025

    Run deepseek R1 Locally!. Explanation of FAQ’s and installation… | by Abdullah Javed | Jan, 2025

    January 30, 2025

    Anthropic CEO Predicts AI Will Take Over Coding in 12 Months

    March 15, 2025
    Our Picks

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    Qantas data breach to impact 6 million airline customers

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.