LLaMA’s modified transformer block rework enter x as:
As a substitute of LayerNorm, LLaMA makes use of RMSNorm:
the place ε is a small fixed for numerical stability and γ is a discovered scaling parameter.
This omits mean-centering, enhancing velocity and numerical stability.
RoPE encodes place through rotation in advanced house. As a substitute of absolute place embeddings, LLaMA makes use of rotary embeddings that encode place instantly into the eye mechanism by rotating the Q and Okay vectors. Rotary Embeddings are applied to the queries and keys earlier than dot product:
Let
Then
This ensures relative positional dependence with out express place vectors. This shift-invariance enabled higher extrapolation and generalization to longer sequences.
The causal masks ensures every token solely attends to earlier tokens:
The place
LLaMA replaces ReLU with SwiGLU:
Then:
The place:
This non-linearity introduces gating that improves expressivity.