Pondering Choice Optimization (ThinkPO) makes use of available or simply obtainable brief CoT reasoning responses as rejected solutions and lengthy CoT responses as chosen solutions for a similar query. It then applies direct choice optimization to encourage the mannequin to favor longer reasoning outputs.
The coaching course of in Pondering Choice Optimization consists of two phases: Reasoning SFT (Supervised High quality-Tuning) stage and Reasoning DPO (Direct Choice Optimization) stage.
Within the Reasoning SFT stage, long-reasoning responses are collected for every query to assemble the dataset Dsft. The bottom mannequin is then fine-tuned on Dsft to amass superior reasoning capabilities, which helps to organize the mannequin for the subsequent stage.
Within the second stage, the mannequin is additional inspired to generate prolonged reasoning utilizing the Direct Choice Optimization (DPO) method. First, the long-reasoning information from the preliminary stage is used because the chosen responses. Then, a smaller mannequin with regular Reasoning capacity is utilized to generate shorter reasoning responses as rejected samples. To make sure information high quality, each lengthy and brief reasoning responses endure filtering, together with correctness validation. This course of leads to the dataset Ddpo. Lastly, the mannequin skilled within the first stage is fine-tuned on Ddpo utilizing DPO, encouraging the mannequin to generate longer outputs whereas enhancing its reasoning capacity.
The dataset Dsft = {(q,olong)}N relies on a bespoke stratos dataset. DeepSeek-R1 was used because the instructor reasoning mannequin as an alternative of QwQ-32B-Preview to generate lengthy reasoning response olong and GPT-4o-mini is employed instead of Sky-thought T1’s parsing logic to filter out incorrect mathematical options.
For the dataset Ddpo = {(q, olong, oshort)}N, it was collected within the following method: For every query q in Dsft, Qwen2.5-Math-7B-Instruct ias used to generate a brief reasoning response oshort, pairing it with the lengthy reasoning response olong in Dsft. The samples the place Qwen2.5-Math-7B-Instruct’s reply matched DeepSeek R1’s reply are retained, leading to 8,080 samples. Moreover, 2,000 samples the place Qwen2.5-Math-7B-Instruct’s reply differed from DeepSeek R1’s however adhered to the proper response format, together with extra output distribution, are included in Ddpo. All of those mixed samples consequently type the ultimate dataset Ddpo.
Effectiveness of ThinkPO
- The fine-tuned mannequin achieves scores similar to Bespoke-Stratos-7B, it exhibits enhancements on nearly all datasets, validating the effectiveness of ThinkPO in enhancing LLM reasoning capacity.
ThinkPO can Regularly Enhance Reasoning Capacity of Public Distilled Fashions
- ThinkPO coaching improved the accuracy of each fashions throughout a lot of the 5 datasets examined.
- Bespoke-Stratos-7B confirmed accuracy enhancements on all datasets besides MATH500, with notable enhancements of round 5% on Olympiad Bench Math and GPQA-Diamond.
- DeepSeek-R1-Distill-Qwen-7B confirmed constant or barely improved accuracy, aside from a decline on AIME2024. Its accuracy on MATH500 improved from 87.4% to 91.2%.
- The common response size elevated for each fashions, suggesting enhanced reasoning capacities, aligning with the test-time scaling precept. DeepSeek-R1-Distill-Qwen-7B’s response size elevated by ~500 tokens on MATH500, whereas Bespoke-Stratos-7B’s elevated by ~1000 tokens.
ThinkPO Works for Totally different-Measurement Fashions
- Growing mannequin dimension usually results in improved accuracy throughout datasets after SFT.
- ThinkPO constantly improves efficiency throughout all mannequin sizes (3B, 7B, 14B).
- ThinkPO results in a 1–2% accuracy enchancment on Math500 for all fashions.
- The 3B mannequin exhibits enchancment on all 5 datasets after ThinkPO, whereas the 7B and 14B fashions enhance on 4 datasets.
- ThinkPO demonstrates generalizability and robustness by being efficient throughout totally different mannequin scales.
Pondering Choice Optimization 2502.13173