Understanding Transformer Models in ML

In the event you’ve spent any time in ML over the previous couple of years, you’ve most likely heard of Transformers. And no, not Optimus Prime, although they’ve confirmed to be simply as highly effective.

Transformers quietly energy lots of of fashions like GPT and BERT. Whether or not we’re prompting a chatbot or utilizing autocomplete, chances are high there’s a transformer working behind the scenes.

However what precisely is a transformer? Why did it change into the spine of recent AI? And the way did we go from gradual, sequential fashions like RNNs to architectures that scale effortlessly throughout huge datasets and {hardware}?

Within the final article, we checked out recurrent neural networks (RNNs). Whereas they marked an essential step ahead, the truth that they’re designed to deal with sequences step-by-step got here with limitations: lengthy coaching time, vanishing gradients, and extra.

Launched within the paper Attention Is All You Need, transformers changed recurrence with a easy thought, which is to let each token attend to each different token straight all of sudden.

The transformer scans all the enter concurrently and decides what to give attention to as an alternative of studying by means of a sentence phrase by phrase. This operation is named self-attention, it permits the mannequin to dynamically weight the relevance of various elements of the enter when processing every token.

For math and code examples of Consideration, be at liberty to take a look at my article here.

The thought of each transformer mannequin is a stack of similar blocks. Every block consists of the next:

Multi-head self-attention
A position-wise feedforward community
Layer normalization and residual connections

Multi-Head Self-Consideration

Recall from final article that self-attention computes:

Earlier, we launched self-attention as a method for every token to compute a weighted common over the opposite tokens. However what if a single consideration mechanism isn’t sufficient to seize the range of relationships in a sequence?

That’s the place we want multi-head consideration.

We break up the Q, Okay, V matrices into a number of heads and run separate self-attention operations in parallel, every head can focus on specializing in totally different points like syntactic dependencies or semantic roles, and so forth.

Mathematically, every head computes:

Then we concatenate the outputs of all heads and apply a ultimate projection:

This offers the mannequin extra flexibility and capability with out dramatically growing depth.

Feedforward Community

Every place (token) within the sequence is processed independently by a two-layer feedforward community:

This provides non-linearity and helps the mannequin refine its representations. Sometimes, the hidden layer is 4x wider than the enter/output measurement.

Layer Normalization and Residual Connections

Each the eye and feedforward sub-layers are wrapped in residual connections adopted by LayerNorm:

Residual connections preserve enter info and stabilize gradients, whereas LayerNorm prevents vanishing activations throughout deep stacking.

Right here’s a minimal transformer block in PyTorch.

import torch.nn as nnclass TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, ff_hidden_dim):
tremendous().__init__()
self.attn = nn.MultiheadAttention(d_model, num_heads, batch_first=True)
self.norm1 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, ff_hidden_dim),
nn.ReLU(),
nn.Linear(ff_hidden_dim, d_model)
)
self.norm2 = nn.LayerNorm(d_model)
def ahead(self, x):
attn_output, _ = self.attn(x, x, x)
x = self.norm1(x + attn_output)
ff_output = self.ff(x)
return self.norm2(x + ff_output)

Observe how residual connections and layer norm are used after each consideration and feedforward layers.

Now it’s time to have a look at how these blocks are assembled into full fashions. Transformers can take considered one of three architectural types, relying on the duty:

Encoder-Decoder
Encoder-Solely
Decoder-Solely

Every setup reuses the identical core transformer blocks, however wires them otherwise. The wiring selections outline how the mannequin processes knowledge, and what it’s good at.

Encoder-Decoder

Designed for sequence-to-sequence duties resembling machine translation, the place we wish to map one sequence to a different.

Now let’s discuss construction.

Encoder processes the enter sequence x = (x1, x2, …, xn)

Applies self-attention + feedforward blocks
Outputs contextual representations for every enter token

Decoder generates the output sequence y = (y1, y2, …, ym)

Makes use of self-attention on earlier output tokens
Attends to the encoder’s outputs through cross-attention

Then, cross-attention is:

This enables the decoder to look again on the encoded supply whereas producing output tokens.

Encoder-Solely

These fashions give attention to encoding inputs into wealthy contextual representations. They’re helpful once we wish to perceive textual content and don’t must generate it. The aim is to pretrain for classification, sentence similarity, and so forth.

Construction

Simply the encoder stack
Pretrained utilizing targets like MLM
Examples embody BERT, RoBERTa, DistilBERT

These fashions typically feed the final-layer embeddings right into a classification head or one other task-specific module.

Decoder-Solely

These fashions drop the encoder, and depend on causal self-attention to generate sequences autoregressively.

They’re designed for duties the place the mannequin should communicate slightly than pay attention, resembling code completion, summarizing, textual content era, dialog methods, and extra.

Construction

A stack of decoder blocks
The mannequin sees solely previous tokens at each step, masked consideration enforces this

This makes positive {that a} token can by no means attend to future tokens. Examples embody GPT-4, Claude, Grok, any mannequin educated for next-token prediction.

Self-attention is order-agnostic. Given a set of enter tokens, the eye mechanism doesn’t know whether or not a phrase got here first/final or someplace in between as a result of it sees the entire sequence directly.

Many domains like pure language are sequential, and phrase order issues. So we have to inject some thought of place into the enter, if the mannequin goes to course of sequences accurately.

We are going to use positional encoding so as to add a positional sign to every enter embedding. This manner, the mannequin can study patterns that rely not simply on token identification but additionally on token place.

Possibility 1: Sinusoidal Positional Encoding

The Consideration Is All You Want paper proposed a hard and fast and deterministic method utilizing sine and cosine features of various frequencies. For a place pos and dimension i, the encoding is:

Even dimensions use sine and odd dimensions use cosine.

This scheme provides every place a novel sample of values and permits the mannequin to generalize to sequences longer than seen throughout coaching as a result of it’s not realized.

Possibility 2: Discovered Positional Embeddings

Later fashions deal with place like some other token and study a vector for every place.

Then, enter to the transformer turns into:

Discovered embeddings are extra versatile, however they lack the generalization property of sinusoidal encodings.

import torch
import mathdef sinusoidal_positional_encoding(seq_len, d_model):
"""
Args:
seq_len (int): Size of the sequence.
d_model (int): Embedding dimensionality.
Returns:
Tensor of form (seq_len, d_model): positional encodings.
"""
# Create a tensor of form (seq_len, 1) for place indices
pos = torch.arange(seq_len, dtype=torch.float32).unsqueeze(1)
# Create a tensor of form (1, d_model) for dimension indices
i = torch.arange(d_model, dtype=torch.float32).unsqueeze(0)
# Compute the angle charges: 1 / 10000^(2i/d_model)
angle_rates = 1 / torch.pow(10000, (2 * (i // 2)) / d_model)
# Multiply place by angle charges (broadcasting)
angle_rads = pos * angle_rates  # form: (seq_len, d_model)
# Initialize positional encoding matrix
pe = torch.zeros(seq_len, d_model)
# Apply sine to even indices (0, 2, 4, ...)
pe[:, 0::2] = torch.sin(angle_rads[:, 0::2])
# Apply cosine to odd indices (1, 3, 5, ...)
pe[:, 1::2] = torch.cos(angle_rads[:, 1::2])
return pe

Every row within the output represents a place within the sequence
Every column represents a frequency-based sign in that dimension
Every place within the sequence will get a novel vector by combining sine and cosine waves of various frequencies

These vectors:

Are deterministic and no coaching required
Let the mannequin study relative positions
Generalize to longer sequences at inference time, as a result of they’re not bounded by a hard and fast realized desk

There are two solutions:

Architectural effectivity: Transformers are parallel and straightforward to optimize
Empirical scaling legal guidelines: Efficiency improves properly with scale

Parallelism and Architectural Simplicity

Recall that conventional sequence fashions course of tokens sequentially: the output at time step t relies on the computations from t-1. This limits parallelism, which in flip makes coaching gradual.

Transformers substitute this with self-attention, which computes all token interactions in parallel. This:

Can make the most of GPUs properly (matrix-matrix ops)
Scales linearly with sequence size in reminiscence
Permits batching over longer sequences

Scaling Legal guidelines (With Math)

In Scaling Laws for Neural Language Models by Kaplan et al. (2020), the authors display that transformer fashions exhibit power-law scaling behaviour with respect to 3 core components: mannequin measurement N, dataset measurement D, and compute finances C.

Particularly:

The paper exhibits that check loss decreases as a power-law of every of those variables when the opposite two will not be bottlenecks
This relationship is expressed by means of formulation resembling:

What this implies: Scaling mannequin measurement or dataset measurement yields constant and predictable enhancements in efficiency that observe sublinear power-law traits.

Emergent Skills at Scale

That is capabilities that don’t exist in smaller fashions all of the sudden seem past sure thresholds. Examples embody:

Few-shot studying
In-context studying and gear use
Instruction following with out fine-tuning
Reasoning throughout modalities

There you’ve got it, the concepts behind transformers. GLHF!

Source link

Top Tools and Skills for AI/ML Engineers in 2025 | by Raviishankargarapti | Aug, 2025

How to Fine-Tune Large Language Models for Real-World Applications | by Aurangzeb Malik | Aug, 2025

Questioning Assumptions & (Inoculum) Potential | by Jake Winiski | Aug, 2025

Roleplay AI Chatbot Apps with the Best Memory: Tested

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Hundreds of thousands of Grok chats exposed in Google results

Musk signals ‘significantly’ stepping back from Doge as Tesla profits plunge

9 AI Girlfriend Apps (No Sign-Up, No Filter) to Use Now

Our Picks