Whereas encoder pretraining has historically relied on Masked Language Modeling (MLM), latest proof means that decoder fashions pretrained with Causal Language Modeling (CLM) may be successfully repurposed as encoders, typically surpassing conventional encoders on textual content illustration benchmarks.
This paper addresses whether or not these positive aspects replicate an inherent benefit of the CLM goal or come up from confounding elements reminiscent of mannequin and knowledge scale, by a sequence of large-scale, rigorously managed pretraining ablations. A complete of 38 fashions starting from 210 million to 1 billion parameters had been skilled, and over 15,000 fine-tuning and analysis runs had been performed.
Coaching with MLM usually yields higher efficiency throughout textual content illustration duties. CLM-trained fashions are extra data-efficient and exhibit improved fine-tuning stability.
Fashions:
The mannequin architectures carefully comply with these of the EuroBERT fashions, with sizes of 210M, 610M, and 1B parameters. All fashions use a most context size of two,048 tokens and a RoPE θ worth of 10,000.
Pretraining knowledge:
Fashions are skilled on distinctive English tokens from the FineWeb-Edu dataset, which is understood for supporting environment friendly mannequin coaching.
Pretraining aims:
Fashions are skilled utilizing one in all 3 approaches:
- CLM makes use of next-token prediction, the place every token is predicted autoregressively utilizing a causal consideration masks. The coaching goal is to attenuate the detrimental log-likelihood:
- MLM, in contrast, randomly masks a subset of tokens and trains the mannequin to reconstruct them utilizing a bidirectional consideration masks. The target is:
- A two-stage CLM+MLM strategy sequentially applies CLM pretraining adopted by MLM.
Pretraining hyperparameters:
Pretraining is carried out with a per-device batch dimension of 12 samples throughout 192 GPUs, yielding an efficient batch dimension of two,373,120 tokens. A Warmup-Steady-Decay (WSD) studying price schedule is employed: a 2,000-step warmup section, adopted by 38,000 steps with a relentless studying price of 5e-4, ending with a 2,000-step linear decay section, for a complete of 42,000 coaching steps.
Pretraining Setups
Pretraining From Scratch (PFS):
- Initialization: Fashions are skilled from random initialization.
- Targets: CLM, MLM, or sequential CLM+MLM.
- Studying Fee: Normal WSD scheduler for CLM and MLM. For CLM+MLM, CLM coaching is carried out first, then MLM is resumed from CLM checkpoints that haven’t but undergone studying price decay.
Continued PreTraining (CPT):
- Initialization: Fashions are initialized from current checkpoints pretrained with both CLM or MLM.
- Goal: Coaching is resumed utilizing the MLM goal.
- The pretrained fashions used for CPT have already undergone studying price decay throughout their preliminary coaching.
- CPT begins from checkpoints the place the loss has already converged, not like PFS the place the target change typically happens throughout lively studying.
Positive-tuning duties and datasets:
- Sequence Classification: SST-2, MNLI and QQP.
- Token Classification: English subsets of CoNLL, OntoNotes and UNER.
- Query Answering: SQuAD, SQuAD-v2, and ReCoRD.
- Data Retrieval: MS MARCO, NQ and the English subset of MLDR for long-context analysis.
Positive-tuning Protocol
- Coaching Size: As much as 1,000 steps or one full epoch, whichever comes first.
- Batch Dimension: 32.
- Studying Fee Choice: A grid search is carried out over 6 studying charges (1e-5, 2e-5, 5e-5, 1e-4, 2e-4, and 5e-4) for every model-dataset pair.
- Studying Fee Schedule: 10% warmup adopted by linear decay. The training price yielding the most effective validation efficiency is chosen.
- Consideration Masks: Bidirectional consideration masks is used.
- Loss Capabilities:
- SC: Cross-entropy on mean-pooled token embeddings.
- TC & QA: Token-level cross-entropy.
- IR: InfoNCE loss with in-batch negatives, utilizing imply pooling.
- Stability: Your complete process is repeated throughout 5 random seeds to account for fine-tuning instability generally noticed in BERT-style fashions.
- Coaching Information: Positive-tuning is performed on the in-domain coaching set, aside from NQ and MLDR, that are skilled on MS MARCO.
Analysis protocol:
SC is assessed with accuracy, TC and QA with F1 rating, and IR with NDCG@10.
Outcomes are reported averaged throughout seeds, together with 95% confidence intervals.
- MLM usually outperforms CLM on textual content illustration duties, notably on SC (Sentence Classification) and QA (Query Answering), attributed to its bidirectional consideration throughout pretraining.
- The efficiency hole between MLM and CLM on SC and QA duties is constant throughout mannequin sizes, with QA being notably delicate to the absence of bidirectional consideration.
- Process-specific traits exist for the MLM-to-CLM hole: it widens with growing mannequin dimension on SC however narrows on IR (Data Retrieval).
- CLM fashions can carry out competitively, attaining robust outcomes on token-level duties (TC) and even outperforming MLM on the 610M dimension, regardless of usually underperforming on SC, QA, and IR.
- There is no such thing as a universally optimum masking ratio for MLM pretraining; it is determined by each mannequin dimension and the precise downstream activity, making it a fragile steadiness. (Refers to Determine 3: demonstrates how the optimum masking ratio varies with mannequin dimension and downstream activity).
- Masking ratio preferences differ: bigger fashions have a tendency to profit from increased ratios, IR datasets persistently favor increased ratios, whereas smaller fashions for token-level duties (TC — Token Classification, QA) carry out higher with decrease ratios. Bigger fashions (610M and 1B) exhibit a U-shaped efficiency curve, indicating improved efficiency at each high and low masking ratios.
- CLM is extra data-efficient than MLM within the early phases of coaching, persistently outperforming MLM in downstream efficiency throughout preliminary coaching steps.
- CLM’s early effectivity makes it an interesting possibility for data-scarce situations (e.g., low-resource languages) or as a warmup stage earlier than MLM-based encoder coaching, although MLM fashions are inclined to catch up and surpass CLM in later coaching phases.
- CLM-based pretraining improves fine-tuning stability, demonstrating decrease sensitivity to studying price selections in comparison with MLM.
- CLM pretraining supplies a extra secure initialization for fine-tuning, resulting in extra dependable efficiency and decreasing the necessity for intensive hyperparameter tuning.
budgets.
- Beginning pretraining with CLM and persevering with with MLM (two-stage strategy) yields higher outcomes than utilizing MLM alone below mounted compute constraints.
- Combining CLM and MLM persistently improves downstream efficiency in comparison with utilizing MLM alone, although the impact varies by activity and coaching funds.
- A break up between 25%-75% (CLM-MLM) and 50%-50% (CLM-MLM) seems to supply the most effective steadiness for two-stage pretraining.
- CLM-based fashions exhibit decrease sensitivity to the masking ratio in comparison with absolutely MLM-trained fashions.
- Preliminary CLM pretraining seems to stabilize mannequin weights, making adaptation extra strong to masking ratio selections and yielding extra constant downstream efficiency that’s much less delicate to this design parameter.
- MLM CPT utilized to a CLM-pretrained mannequin persistently achieves superior downstream efficiency in comparison with persevering with MLM-only coaching.
- On Textual content Classification (TC), the robust efficiency of CLM-only fashions was maintained, and the hole to MLM remained.
- For Query Answering (QA) and Data Retrieval (IR), the efficiency hole between CLM and MLM was successfully closed.
- For Semantic Classification (SC), the MLM-adapted CLM mannequin considerably outperformed the MLM-only mannequin.
- It’s not essential to run the total 22,000 CPT steps; robust efficiency corresponding to MLM-only CPT may be achieved with fewer steps.
- As early as 12,000 CPT steps, outcomes are already robust and broadly match these of MLM-only CPT when it comes to loss and downstream efficiency, with higher outcomes on TC and IR, comparable on SC, and almost on par on QA.
- Making use of MLM CPT on a CLM mannequin exhibits a extra promising pattern with a steeper enchancment curve towards the top, whereas MLM-only coaching tends to plateau (notably noticeable on SC).
Ought to We Nonetheless Pretrain Encoders with Masked Language Modeling? 2507.00994