Paper Explained 316: NuminaMath. The NuminaMath dataset is a… | by Ritvik Rastogi

The NuminaMath dataset is a complete assortment of 860k pairs of competitors math issues and options. Issues vary from high-school-level to advanced-competition-level, all meticulously annotated with accompanying chain-of-thought traces. This dataset is designed to boost the mathematical reasoning capabilities of LLMs and stands as the most important math dataset ever launched within the area.

The undertaking is on the market at GitHub.

The information sources embody Chinese language highschool math workout routines, US and worldwide arithmetic olympiad issues, and issues collected from on-line boards.

MATH and GSM8K: Present reference options are reformatted right into a Chain-of-Thought (CoT) format utilizing GPT-4, following suggestions from DeepseekMath and ReAlign.
Orca-Math: Common expressions are used to extract and simplify solutions from the unique dataset’s answer textual content. Solutions are then enclosed inside boxed{} for constant formatting. The authors word that if a well-formatted model of Orca-Math is already current within the coaching knowledge, this step is perhaps redundant. An alternate is utilizing GPT-4 to generate the ultimate answer.
AMC and AIME: Issues and LaTeX-formatted options are collected from the Artwork of Downside Fixing (AoPS) wiki. The primary answer containing a boxed{} image is chosen. Resulting from overlap with the MATH dataset, a decontamination course of utilizing embeddings is employed, leading to roughly 4,300 issues retained for coaching. Remaining options are then realigned into the CoT format utilizing GPT-4.
AoPS Discussion board: Issues are crawled from the AoPS Contest Assortment web page. Since options aren’t explicitly offered, replies with boxed{} symbols are thought of, prioritizing these with probably the most LaTeX. The chosen reply is handled because the reference answer and rewritten in CoT format by GPT-4.
Chinese language Ok-12 Examination: Ok-12 math workout routines are collected from public examination papers, typically sourced from public assets. OCR and regex segmentation are used to extract problem-solution pairs from PDFs. GPT-4 is then used for translation and realignment of options into the CoT format.
Artificial Information: Artificial issues are generated utilizing the MATH and AMC-AIME coaching cut up datasets, Xwin Math. In contrast to the unique technique, the answer from the preliminary technology stage (utilizing GPT-4 with a temperature of 0.8) is straight used to scale back prices.
World Olympiads Information: 152K problem-solution pairs are collected from varied sources:
Worldwide contests and their shortlists (e.g., IMO, APMO, BMO).
Nationwide and regional contests (see Determine 2 within the unique textual content for nation breakdown).
Downside-solving boards, puzzle and olympiad books, and summer time college supplies.
PDFs are the first supply format; HTML content material is transformed to PDF. A pipeline (described elsewhere within the unique textual content) is then utilized to course of these issues.

Variety of samples per knowledge supply.

Decontamination

The next two-step decontamination technique is used:

All 10-gram actual matches are faraway from all datasets, besides the artificial dataset and the MATH practice set.
To raised decontaminate, Mistral embeddings are computed for every of the issues apart from MATH practice and the artificial datasets. All issues with an embedding distance < 0.15 are then eliminated. This worth is derived empirically, above which contamination will not be noticed in inner checks.

After creating the NuminaMath-CoT dataset, extending to TIR (instrument built-in reasoning) is easy. The identical strategy as ToRA, significantly their immediate, is adopted to pattern TIR knowledge from GPT-4o. The method to create this TIR dataset is as follows:

Extract a subset of roughly 100K issues with worth output from the NuminaMath-CoT dataset.
Pattern an answer utilizing the GPT-4o assistant API for every downside with a temperature of 0.8.
Filter the unfavourable samples the place mannequin generated solutions don’t match the reference reply. For integer output issues actual match is used. For different expressions, a match is set utilizing GPT-4o as a choose.
Repeat the identical course of on the unfavourable issues.

Fashions are fine-tuned utilizing a two-stage course of impressed by the MuMath-Code paper.

High quality-tuning on a big, various dataset of pure language math issues and options, with CoT annotations to facilitate reasoning.
High quality-tuning on an artificial dataset of TIR, the place issues are decomposed into rationales, Python packages, and outputs.

Fashions are skilled at two scales:

7B, based mostly on DeepSeekMath-Base 7B
72B, based mostly on Qwen2–72B

Hyper-parameters used within the experiments.

Software-integrated reasoning (TIR)

Every run of TIR begins with an issue, x. The objective of TIR is to pattern a candidate answer, y. TIR begins by initializing a context, c, with an preliminary immediate c0 containing solely x. This context is then prolonged by means of as much as ok rounds of interplay.

On every iteration, i, TIR makes use of a sampler, S, and an LLM, θ, to pattern textual content containing CoT and Python supply code, zi, till reaching the cease key phrase wstop = “`output. After sampling zi, TIR first checks if a candidate answer has been generated, which might be wrapped within the key phrase wanswer = boxed{}.

If a solution is current, TIR applies a response parser, R, to the output, which acts to sanitize the textual content and return solely the ultimate numerical response with any items and different formatting eliminated. If no legitimate response is current, TIR assesses whether or not any code has been generated by matching a daily expression with the python area key phrase wpython = “`python(.*)“`.

If no such area is on the market, zi is discarded, and TIR proceeds to the following iteration, resampling a recent block of textual content. If such a area is on the market, the Python supply code is handed to the Python interpreter, I, which parses and executes the supply code. The outcome, ri from operating I(zi) could embody the output of print statements, or a truncated Traceback if an exception was raised.

The operating context is then prolonged, continuing to the following spherical of interplay, through setting ci to ci−1 ⊕ zi ⊕ ri, the place ⊕ denotes concatenation. Thus, by the top of interplay, c = c0z1r1z2r2 . . . z≤ok, the place both a candidate reply, y, is efficiently extracted from z≤ok else an error key phrase werror.

Within the case of SC-TIR, n samples are generated from TIR, then a filter, F is utilized, which removes ill-formed responses and eventually self-consistency majority voting is utilized.

Varied 7B, 8B, and 70B parameter language fashions are in contrast on benchmarks, together with GSM8k (grade college math), MATH (math downside fixing), AMC 2023 (competition-level math), and AIME 2024 (competition-level math).

Comparability of varied 7B and 8B parameter language fashions on completely different math benchmarks.

NuminaMath with TIR achieves state-of-the-art (SoTA) efficiency amongst 7B and 8B parameter fashions.

Comparability of varied open weight and proprietary language fashions on completely different math benchmarks.

Fashions with TIR display vital enhancements in problem-solving capabilities, particularly in advanced reasoning duties.
NuminaMath with TIR additionally performs competitively in opposition to bigger fashions (70B parameters) like Claude 3.5 and GPT-4o, outperforming them on some benchmarks and approaching GPT-4o’s efficiency on others.

NuminaMath-1.5 is the second iteration of the NuminaMath dataset, designed to offer high-quality post-training knowledge for competition-level math issues. It comprises roughly 900k issues with Chain of Thought (CoT) options.

The dataset is on the market at HuggingFace.

Downside Metadata: Contains reply, problem_type, and question_type metadata for all issues to make sure verifiable output.

reply: Remaining reply or particular values like “proof” or “notfound”.
problem_type: Mathematical area (Algebra, Geometry, Quantity Principle, and so forth.).
question_type: Downside fashion (multiple-choice, proof, math phrase downside).

New Information:

Olympiads Reference: Manually parsed and verified issues and options from official web sites of nationwide Math Olympiads.
Manually Curated Information: Competitors issues in cn_contest, inequalities, and number_theory.
Eliminated Information: Artificial dataset synthetic_amc is eliminated attributable to efficiency points.

NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions

Source link

Why PDF Extraction Still Feels LikeHack

🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

Implementing IBCS rules in Power BI

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Amazon faces strike threat in US ahead of Christmas

Screen time in bed linked to worse sleep, study finds

The first trial of generative AI therapy shows it might help with depression

Our Picks

Implementing IBCS rules in Power BI

What comes next for AI copyright lawsuits?

Why PDF Extraction Still Feels LikeHack

Paper Explained 316: NuminaMath. The NuminaMath dataset is a… | by Ritvik Rastogi | Feb, 2025

Decontamination

Software-integrated reasoning (TIR)

Related Posts