Papers Explained 278: Phi-4. Phi-4 is a 14B parameter model that… | by Ritvik Rastogi

Phi-4 is a 14B parameter mannequin that advances the efficiency of small language fashions by introducing progressive artificial information era strategies for reasoning-focused duties. This development is achieved by optimizing the coaching curriculum and information combination, and by introducing new methods in post-training.

Artificial information constitutes the majority of the coaching information for phi-4 and is generated utilizing a various array of methods, together with multi-agent prompting, self-revision workflows, and instruction reversal. Methods comparable to rejection sampling and a novel strategy to Direct Choice Optimization (DPO) are employed to refine the mannequin’s outputs.

The event of phi-4 is guided by three core pillars:

Artificial Information for Pretraining and Midtraining
Curation and Filtering of Excessive-High quality Natural Information
New refined variations of SFT datasets, in addition to by growing a brand new method to create DPO pairs, primarily based on pivotal token search.

Really helpful Studying [Papers Explained 192: Phi-3.5]

The connection between tokens in natural datasets is commonly advanced and oblique, requiring many reasoning steps to attach the present token to the following. In distinction, every token generated by a language mannequin is predicted by the previous tokens, making it simpler for the mannequin to observe ensuing reasoning patterns. Artificial information could act as a type of “spoon-feeding,” presenting challenges in a digestible and progression-oriented method.

Artificial information is often nearer to the format of outputs anticipated from fashions. Coaching on such information helps align the mannequin’s pretraining expertise with situations encountered throughout inference, making certain that context seen throughout era stays in-distribution with respect to the information the mannequin was pretrained on.

The strategy to producing artificial information for phi-4 is guided by the next ideas:

1. Range: The info ought to comprehensively cowl subtopics and expertise inside every area. This requires curating numerous seeds from natural sources.

2. Nuance and Complexity: Efficient coaching requires nuanced, non-trivial examples that replicate the complexity and the richness of the area. Information should transcend fundamentals to incorporate edge circumstances and superior examples.

3. Accuracy: Code ought to execute accurately, proofs must be legitimate, and explanations ought to adhere to established data, and so on.

4. Chain-of-Thought: Information ought to encourage systematic reasoning, instructing the mannequin numerous approaches to the issues in a step-by-step method. This fosters coherent outputs for advanced duties.

Seed Curation

Net and Code-based Seeds: snippets extracted from net pages, books, and code repositories specializing in complexity, reasoning depth, and academic worth. A two-stage filtering course of is employed: figuring out pages with sturdy academic potential and segmenting chosen pages into passages and scoring them for factual and reasoning content material.
Query Datasets: Collected questions from web sites, boards, and Q&A platforms. Questions are filtered utilizing a plurality-based method to steadiness issue: producing a number of unbiased solutions for every query and making use of majority voting to evaluate response consistency. Questions the place all solutions agreed (too simple) or are fully inconsistent (too troublesome or ambiguous) are discarded.
Creating Query-Reply Pairs from Various Sources: Language fashions are used to extract question-answer pairs from natural sources like books, scientific papers, and code. Deduction chains or logical progressions in textual content are detected to establish key reasoning steps, that are then reformulated into questions and corresponding solutions.

Information Transformation

Rewrite and Increase: Seeds are reworked into artificial information via multi-step prompting workflows: rewriting helpful content material into workouts, discussions, or structured reasoning duties. Self-revision is carried out via a suggestions loop the place a mannequin critiqued and improved its personal outputs, guided by rubrics targeted on reasoning and factual accuracy.
Instruction Reversal for Code and Different Duties: Directions are generated from present code snippets, together with downside descriptions or activity prompts. Artificial information pairs are structured with the instruction previous the code. Solely information with excessive constancy between the unique and regenerated code is retained. This methodology may very well be generalized to different use circumstances.
Validation of Code and Different Scientific Information: Artificial code information is validated via execution loops and exams. Scientific datasets have questions extracted from supplies utilizing a way designed to make sure excessive relevance, groundedness, and issue steadiness.

Curation and Filtering of Net and Q&A Information

Ablation research confirmed that natural questions are considerably simpler than artificial questions. Whereas rewritten questions improved the mannequin’s capabilities, the good points weren’t as pronounced. A good portion of the collected questions lacked correct options. To deal with this, the solutions had been changed with synthetically generated ones and majority-voting was used to extend accuracy.

The main focus is on accumulating reasoning-dense and nuanced materials like tutorial papers, academic boards, and programming tutorials. This information served each as direct coaching information and as seeds for artificial information era. Clear and proper pure information is essential for seeding, as minor errors might considerably degrade the standard of artificial information.

Focused Acquisitions: Repositories with reasoning-dense paperwork, together with publicly permissible sources (e.g., arXiv, PubMed Central, GitHub) and licensed sources (e.g., books), had been acquired, aiming for comprehensiveness, recency, and cleanliness.
Filtering Net Dumps: To seize information-rich sources like boards and blogs, high-quality paperwork are chosen from net dumps utilizing small classifiers educated on LLM-generated annotations. A specialised pipeline amplified high-quality non-STEM content material. Corrupted textual content and binary recordsdata are eliminated primarily based on n-gram statistics and compression ratios.
Multilingual Information: Multilingual datasets (German, Spanish, French, Portuguese, Italian, Hindi, Japanese, and so on.) are integrated from CommonCrawl and Wikipedia. A fastText-based language identification mannequin categorized paperwork into 176 languages, and the identical classifiers used for filtering net dumps had been utilized to make sure high quality. These classifiers are educated on multilingual LLM-generated annotations.
Customized Extraction and Cleansing Pipelines: Customized heuristics and parsers are developed for every information supply to make sure cleanliness and uniformity. A customized HTML-to-text extractor is constructed for basic net information, fastidiously preserving content material like equations, code blocks, tables, and discussion board thread construction typically corrupted by easier parsers. This extractor used alerts like HTML tags, CSS lessons, content material size, and tree depth to tell apart components like boilerplate, commercials, and important content material.

Publish-Coaching datasets

Supervised High quality-Tuning (SFT) Datasets are generated utilizing fastidiously curated person prompts taken from a combination of publicly obtainable datasets and synthetically generated information. A number of mannequin responses are generated and the very best are chosen utilizing an LLM-based analysis course of.
Direct Choice Optimization (DPO) pairs are generated primarily based on rejection sampling and LLM analysis, part of which relies on an strategy to creating pivotal token-based pairs.

The phi-4 mannequin relies on a decoder-only transformer structure with 14B parameters and a default context size of 4096. The structure intently follows phi-3-medium, besides that tiktoken tokenizer (for higher multilingual help) with a padded vocabulary dimension of 100,352 (together with unused tokens) is used and full consideration over the 4K context size, fairly than a 2K sliding window utilized in phi-3-medium, is employed. The mannequin is pre-trained for roughly 10T tokens.

Two key observations are famous from the phi-3 household of fashions.

Net datasets confirmed small advantages on reasoning heavy benchmarks. Prioritizing extra epochs over artificial information led to higher efficiency with respect to including contemporary net tokens.
Fashions educated solely with artificial information underperformed on the knowledge-heavy benchmarks and demonstrated elevated hallucinations.

Information combination for pretraining.

Pretraining is adopted by a shorter midtraining stage to extend the unique context size of 4k to 16k. Excessive-quality non-synthetic datasets (i.e., tutorial, books, and code information) are filtered to separate samples above 8K context. The info subsets which are 16K or increased in size are then up-weighted. New artificial datasets that fulfill the > 4K sequence requirement are additionally created. The ultimate information combination contains 30% of the newly curated longer context information and a 70% portion of recall tokens from the pretraining stage. To accommodate longer context, the bottom frequency of rope place encoding is elevated to 250K.

Publish-training is aimed toward remodeling the pretrained language mannequin into an AI assistant that customers can safely work together with. The pretrained mannequin is aligned with one spherical of SFT, one spherical of DPO on information from our pivotal token search methodology, and one spherical of DPO on full size desire pairs.

Supervised High quality-Tuning

On this part, the pretrained mannequin is fine-tuned on a wide range of information throughout numerous domains, together with math, coding, reasoning, dialog, mannequin id, and security. Multilingual information for 40 languages can also be added. Round 8B tokens of information are used on this part, all formatted within the chatml format.

Direct Choice Optimization

DPO is used to align the mannequin with human preferences, and in addition to steer the mannequin away from undesirable habits via pairs of desired and undesired outputs. DPO information covers chat format information, reasoning, and Accountable AI (RAI) information and improves the mannequin in math, coding, reasoning, robustness, and security. Two rounds of DPO are carried out on the SFT mannequin. A method, Pivotal Token Search (PTS), is launched to generate pairs for DPO for the primary DPO spherical.

Information Combination for Pivotal Token DPO and Decide Guided DPO

For the second spherical, referred to as judge-guided DPO, roughly 850k pairs of desired and undesired outputs are gathered. The prompts are sourced from numerous publicly obtainable instruction tuning datasets and in addition embrace prompts associated to security and Accountable AI (RAI). Subsequent, for every of those prompts, responses are generated from GPT-4o, GPT-4t and the mannequin. From these responses, numerous mixtures of DPO pairs are created and GPT-4o is used as a decide to label constructive or detrimental for a given pair.

Pivotal Token Search

Think about a generative mannequin producing a token-by-token response to a given immediate. For every token produced, which corresponds to a prefix of the mannequin response, one can take into account the conditional likelihood of the mannequin’s reply being appropriate provided that prefix, in addition to the increment on this likelihood with respect to that token (in different phrases, the distinction within the likelihood of being appropriate earlier than and after producing that token). It’s typically the case that the general correctness is extremely depending on a profitable era of a small variety of key tokens.

Illustration of pivotal tokens for GPT-4o at temperature 1.

There are a lot of tokens with chances a lot decrease than that of the pivotal token which might contribute to noise within the gradients diluting the sign from the pivotal token. Furthermore, when two texts considerably deviate from one another, comparability of their particular person next-token log chances (as achieved in DPO) just isn’t very significant. Fairly, it makes extra sense that the sign ought to come from the primary tokens after the 2 texts begin diverging from one another.

To alleviate these results, a way referred to as Pivotal Token Search (PTS) is employed for producing desire information that particularly targets pivotal tokens in isolation, creating DPO pairs by which the desire optimization takes impact with respect to a single token.

Pseudocode for Pivotal Token Search (PTS).

PTS identifies factors in a completion token sequence Tfull = t1, t2, … for some person question Q the place the following token ti has a big influence on the likelihood of success p(success ∣ t1, …, ti). PTS estimates these chances by sampling completions ranging from Q + t1, …, ti, that are checked for correctness with an oracle for Q. The process Subdivide recursively splits the sequence into segments ti, …, tj till the change in likelihood |p(success∣t1,…,ti−1)−p(success∣t1,…,tj)| for every section is under a threshold p_gap or the section is only a single token. Tokens with a pointy change in success likelihood are stored as pivotal. Pivotal tokens are was desire information by taking Q + t1, …, ti−1 because the question, and single tokens t_acc and t_rej that improve/lower p(success ∣ t1, …, ti−1, t_acc/rej) because the accepted and rejected completions, respectively.

Pattern Choice information generated by Pivotal Token Search.

The binary-search algorithm for PTS just isn’t all the time assured to search out all pivotal tokens, nevertheless it solely finds pivotal tokens and it finds all of them if the success likelihood is near-monotone over the course of the answer.

PTS is used to generate desire information for duties the place ground-truth is available, comparable to arithmetic, numerous types of query answering and coding. To enhance pattern effectivity, the goal questions are filtered to solely embrace these with 0.2 ≤ p(success) ≤ 0.8, as pivotal tokens are uncommon for duties which are very simple or onerous.

To guage and reveal the capabilities of the phi-4 language mannequin, Utilized present benchmarks like MMLU, GPQA, MATH, HumanEval, MGSM, SimpleQA, MMLU-Professional, HumanEval+, ArenaHard, and IFEval. In contrast phi-4’s efficiency towards different fashions, together with Qwen-2.5–14B-Instruct and GPT-4o.

Efficiency of phi-4 on a set of normal benchmarks.

Phi-4 outperforms Qwen-2.5–14B-Instruct in 9 out of 12 benchmarks.
Phi-4 excels in STEM Q&A duties, outperforming even its instructor mannequin (GPT-4o) on GPQA and MATH.
Phi-4 achieves excessive scores in coding duties (HumanEval and HumanEval+).
Phi-4 exhibits weaker efficiency on SimpleQA, DROP, and IFEval. Whereas the SimpleQA and DROP outcomes are thought of doubtlessly deceptive, IFEval reveals a real weak spot in instruction following.

Phi-4 Technical Report 412.08905

Really helpful Studying [Small LLMs] [Phi Series]

Source link

AI is nothing but all Software Engineering: you have no place in the industry without software engineering | by Irfan Ullah | Aug, 2025

🔴 20 Most Common ORA- Errors in Oracle Explained in Details | by Pranav Bakare | Aug, 2025

Data Analysis Lecture 2 : Getting Started with Pandas | by Yogi Code | Coding Nexus | Aug, 2025

AI-Powered Content Creation Gives Your Docs and Slides New Life

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

How Businesses Can Capitalize on Emerging Domain Name Trends

Jeff Bezos, Lauren Sánchez Wedding Kicks Off in Venice

3% mortgage rates aren’t dead—housing market sees 127% increase in buyers taking over old loans

Our Picks