This paper conducts an in depth evaluation of mannequin efficiency on the AIME24 dataset to know how reasoning capabilities evolve. A ladder-like construction in downside issue is found, categorizing questions into 4 tiers: Straightforward, Medium, Arduous, and Extraordinarily Arduous (Exh). The particular necessities for advancing between tiers are recognized.
- Development from Straightforward to Medium tier requires adopting an R1 reasoning type with minimal SFT (500–1K cases).
- Arduous-level questions undergo from frequent mannequin errors at every step of the reasoning chain, with accuracy plateauing at 65% regardless of logarithmic scaling.
- Exh-level questions current a basically totally different problem; they require unconventional problem-solving expertise that present fashions uniformly battle with.
The mission is accessible on GitHub.
The AIME 2024 benchmark is chosen for its hierarchical issue, range throughout mathematical domains (algebra, quantity principle, geometry, combinatorics), and fundamental data requirement (highschool arithmetic with occasional undergraduate-level ideas).
The bottom mannequin used is Qwen2.5–32B-Instruct4 as Qwen-series inherently possess cognitive behaviors: verification, backtracking, subgoal setting, and backward chaining that Llama-series fashions lack.
Analysis Metrics:
- Main metric is avg@n, the common go fee obtained by producing a number of options (with temperature set to 1) and averaging the outcomes (n = 8 by default).
- cov@n can be reported, indicating whether or not the mannequin succeeds in no less than one of many n makes an attempt.
AIME24 questions are manually categorized into 4 issue ranges (Straightforward, Medium, Arduous, and Extraordinarily Arduous) based mostly on the efficiency of public fashions (Qwen2.5–32B-Instruct fine-tuned on small-scale SFT datasets, and LLMs with large-scale post-training or software use like R1, QwQ, and STILL3).
- Straightforward degree consists of 4 questions for which the bottom mannequin achieves a mean accuracy above 50%.
- Medium (Med) degree contains 10 questions the place the small-scale SFT mannequin attains over 50% accuracy.
- Extraordinarily Arduous (Exh-level) contains 4 questions that yield lower than 10% accuracy throughout all fashions.
- Arduous degree incorporates the remaining 12 questions that don’t match into the aforementioned classes.
Qwen2.5–32B-Instruct achieved over 50% accuracy on Straightforward-level questions however solely about 10% on Med-level questions, failing completely (0% accuracy) on half of them.
After SFT on roughly 1,000 R1-style trajectories (e.g., S1.1, LIMO), these fashions considerably improved, reaching round 90% common accuracy on Med-level questions, with excellent accuracy on half of them.
This fast enchancment prompted an investigation into which elements of the SFT information influenced this transformation.
All You want is SFT on 1K random R1-style trajectories in any math classes
Variables Analyzed:
- Foundational Math Information ©: Questions from various classes in OpenR1-Math-220k (algebra, calculus, combinatorics, inequalities, logic & puzzles, quantity principle, geometry) have been evenly sampled.
- Dataset Measurement (N): Experiments various the variety of coaching examples: 100, 200, 500, 1000 examples per class.
- CoT Trajectory Size (L): Evaluated three tiers: regular (nm — 1,000 random trajectories), brief (sh — 1,000 shortest), and lengthy (lg — 1,000 longest).
- CoT Trajectory Type (S): In contrast DeepSeek-R1 and Gemini-flash trajectories utilizing 1K questions.
Efficiency P is a operate of Class ©, Variety of trajectories (N), Trajectory Size (L), and Type (S): P = f(C, N, L, S).
To attain efficiency P ≥ 90% on Medium-level questions, the minimal configuration required is:
P = f(C=*, N>500, L=nm/lg, S=R1)
This implies the mannequin constantly meets the passline solely when educated with no less than 500 lengthy, randomly chosen R1-style trajectories, impartial of the precise math class.
SFT leads fashions to related problem-solving methods
To research whether or not small-scale SFT genuinely imparts problem-solving expertise, base fashions are fine-tuned on R1-style trajectories throughout a number of math classes utilizing the configuration: P = f(C ∈{algebra, calculus, combinatorics, …}, N= 1000, L= lg, S= R1).
Analysis Methodology:
- Positive-tuned fashions’ greedily sampled trajectories have been in contrast towards DeepSeek-R1 trajectories on AIME24 medium-level questions.
- GPT-4o-mini is used to summarize every reasoning trajectory into utilized technique and intermediate outcomes.
- GPT-4o-mini then quantitatively assessed trajectory similarity on a 6-point scale (0: completely totally different to five: virtually an identical).
Fashions are inclined to make use of related problem-solving methods. Roughly 50% of trajectories have been rated as “virtually an identical” (rating 5), and the remaining 50% as “principally related” (rating 4), regardless of being educated on various math classes.
Not like the sudden leap in efficiency from Straightforward to Med-level questions, the development from Med-level to Arduous-level is gradual. Small-scale SFT fashions obtain low accuracy (round 25%) on Arduous-level questions
Why fashions fail: instability from exploration and computation of the duty
- A number of Hidden Steps: Arduous-level questions contain a number of sequential hidden steps. For instance, AIME 2024 downside #1 requires discovering coordinates, heart/radius, intersection factors, and lengths. Every step will increase the possibility of pursuing unsuitable reasoning paths, and the general success fee is a product of particular person step success charges, resulting in declining accuracy with extra steps.
- Computational Complexity: Sure steps in Arduous-level questions are computationally intensive. As an example, AIME 2024 downside #5 requires calculating tetrahedron quantity utilizing the Cayley-Menger determinant, which is a major impediment for fashions with limited-scale SFT.
SFT information scaling exhibits logarithmic pattern in Arduous-level query accuracy
Experiments are carried out by various the variety of CoT (Chain-of-Thought) trajectories (50, 100, 200, 500, 1K, 2K, 5K, 10K, 20K) and evaluating SFT’ed fashions (Openthinker-32B, Openthinker2–32B, Qwq-32B, STILL-3) based mostly on Qwen2.5–32B-instruct.
- Efficiency on Arduous-level questions follows a logarithmic scaling sample with respect to dataset dimension, with accuracy enhancements plateauing at roughly 65%.
- Fashions using reinforcement studying (Qwq-32B) or exterior computational instruments (STILL-3) surpass this 65% ceiling, suggesting that integrating exterior instruments considerably enhances stability in CoT trajectories.
- The exact information used for Qwq-32B just isn’t public, leaving the precise benefits of RL over SFT as an open analysis query.
Rigorously curated small-scale SFT dataset doesn’t deviate from the scaling pattern
A curated dataset is constructed by deciding on the highest 90 most related questions from open-r1/OpenR1-Math-220k for every Arduous-level query, utilizing OpenAI’s text-embedding-3-small mannequin, leading to roughly 1K CoT steps.
The SFT-ed mannequin educated on this curated dataset achieved a mean rating of 33.6% on Arduous-level questions, which is 5% greater than the 28.4% rating obtained utilizing a randomly constructed dataset of the identical 1K dimension.
Nonetheless, merely growing the dataset dimension from 1K to 2K (randomly constructed) led to a bigger enchancment of seven%.
Regardless of the “unfair” comparability (curated dataset had data of take a look at questions), the outcomes recommend that scaling up the dataset is usually more practical than cautious curation, notably within the small-scale SFT regime.
Regardless of fine-tuning with various SFT dataset sizes, all fashions, together with R1, obtain 0% accuracy on Exh-level questions. This means that the scaling conduct noticed for Arduous-level questions doesn’t prolong to Exh-level.
To know the lacking capabilities and limitations, R1 is probed with:
- Variations of the issue assertion.
- Suggestive prompts and hints.
- Subproblems of the unique downside.
- Questions designed to check particular sub-capabilities.
R1 is chosen for evaluation because it represents an higher certain for fashions fine-tuned with common SFT strategies (R1-trajectories).
Key Limitations of LLMs (R1):
Rigidity in Widespread Methods:
- LLMs have a tendency to use mounted patterns (e.g., coordinate programs for geometry, inclusion-exclusion for combinatorics), even when these usually are not probably the most possible or environment friendly approaches.
- Instance (Drawback #2 — Octagon Coloring): R1 persistently makes an attempt to make use of the inclusion-exclusion precept with rotation angles, which is overly complicated, as a substitute of the extra simple casework method.
Deficiency in Geometric Instinct:
- Restricted by their 1-D sequential structure, LLMs battle to be taught geometric instinct that’s simple for people.
- Instance (Drawback #21 — Rectangles in a Dodecagon): R1 finds it difficult to find and make the most of rotational symmetry (e.g., multiplying by 3 after figuring out typical situations) for enumerating rectangles, as a substitute trying computationally intensive enumeration.
Restricted Reasoning Context:
- Even with massive context home windows (as much as 32K tokens), fashions fall brief in circumstances requiring in depth exploration of substeps.
- Instance (Drawback #2 — Octagon Coloring): Whereas R1 can appropriately remedy a subcase (e.g., counting configurations with precisely 4 blue vertices) with ample reasoning, it typically rushes to an incorrect conclusion when tackling the complete downside, the place the subcase is only one a part of a prolonged reasoning chain.
SFT serves as an important intermediate step in normal coaching pipelines for reasoning fashions. The findings have a number of implications:
- Significance of SFT Dataset Scale: The size of the SFT dataset stays necessary, although latest research recommend that stronger efficiency will be achieved with fewer samples (~1K).
- Ceiling Impact for Dataset Measurement: Scaling up dataset dimension finally meets a ceiling, notably for extremely difficult (Exh-level) questions, which can’t be successfully addressed by merely increasing the amount of coaching samples.
- Growing Greater-Degree Intelligence: Given preliminary proof indicating that SFT-trained fashions undertake related options for Med-level questions, a key query arises: Can SFT develop higher-level intelligence, resembling using unusual but ingenious options? The analysis goals to open new avenues for developments on this area.
Climbing the Ladder of Reasoning: What LLMs Can-and Nonetheless Can’t-Clear up after SFT? 2504.11741