#1minPapers MSFT’s rStar-Math small language model self-improves and generates own training data | by Gwen Cheni

That is the second time in latest months {that a} small mannequin carried out equally properly (or higher) than the billion-parameter massive fashions. Granted math issues are distinctive: largely quantifiable and verifiable.

“In contrast to options counting on superior LLMs for information synthesis, rStar-Math leverages smaller language fashions (SLMs) with Monte Carlo Tree Search (MCTS) to determine a self-evolutionary course of, iteratively producing higher-quality coaching information.”

End result: “4 rounds of self-evolution with tens of millions of synthesized options for 747k math issues … it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%.”

Course of reward modeling (PRM) supplies fine-grained suggestions on intermediate steps as a result of incorrect intermediate steps considerably lower information high quality in math.

SLM samples candidate nodes, every producing a one-step CoT and the corresponding Python code. Solely nodes with profitable Python code execution are retained, mitigating errors in intermediate steps. MCTS routinely assign (self-annotate) a Q-value to every intermediate step primarily based on its contribution: steps contributing to extra trajectories that result in the proper reply are given larger Q-values and thought of larger high quality.

SLM as a course of choice mannequin (PPM) to foretell reward labels for every math reasoning step. Though Q-values aren’t exact, they will reliably distinguish constructive (right) steps from unfavorable (irrelevant/incorrect) ones. Utilizing choice pairs and pairwise rating loss, as an alternative of instantly utilizing Q-values as reward labels, eradicate the inherently noise and imprecision in stepwise reward task.

Paper on arXiv: https://arxiv.org/abs/2501.04519

Source link

From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

Why PDF Extraction Still Feels LikeHack

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Kernel Case Study: Flash Attention

Inside Amsterdam’s high-stakes experiment to create fair welfare AI

The startup trying to turn the web into a database

Our Picks

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

Using Graph Databases to Model Patient Journeys and Clinical Relationships

#1minPapers MSFT’s rStar-Math small language model self-improves and generates own training data | by Gwen Cheni | Jan, 2025

Related Posts