Close Menu
    Trending
    • Agentic AI with NVIDIA and DataRobot
    • AI Governance in South Africa: New Privacy Laws Every Tech Leader Must Know | by emmanuel.tshikhudo | Jul, 2025
    • fileAI Launches Public Platform Access, Data Collection for Workflow Automation
    • Before You Start Day Trading, Know These Stages
    • How generative AI could help make construction sites safer
    • PCA and SVD: The Dynamic Duo of Dimensionality Reduction | by Arushi Gupta | Jul, 2025
    • 5 Ways Artificial Intelligence Can Support SMB Growth at a Time of Economic Uncertainty in Industries
    • Microsoft Says Its AI Diagnoses Patients Better Than Doctors
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Papers Explained 324: Thinking Preference Optimization | by Ritvik Rastogi | Mar, 2025
    Machine Learning

    Papers Explained 324: Thinking Preference Optimization | by Ritvik Rastogi | Mar, 2025

    Team_AIBS NewsBy Team_AIBS NewsMarch 6, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Pondering Choice Optimization (ThinkPO) makes use of available or simply obtainable brief CoT reasoning responses as rejected solutions and lengthy CoT responses as chosen solutions for a similar query. It then applies direct choice optimization to encourage the mannequin to favor longer reasoning outputs.

    The coaching course of in Pondering Choice Optimization consists of two phases: Reasoning SFT (Supervised High quality-Tuning) stage and Reasoning DPO (Direct Choice Optimization) stage.

    Within the Reasoning SFT stage, long-reasoning responses are collected for every query to assemble the dataset Dsft. The bottom mannequin is then fine-tuned on Dsft to amass superior reasoning capabilities, which helps to organize the mannequin for the subsequent stage.

    Within the second stage, the mannequin is additional inspired to generate prolonged reasoning utilizing the Direct Choice Optimization (DPO) method. First, the long-reasoning information from the preliminary stage is used because the chosen responses. Then, a smaller mannequin with regular Reasoning capacity is utilized to generate shorter reasoning responses as rejected samples. To make sure information high quality, each lengthy and brief reasoning responses endure filtering, together with correctness validation. This course of leads to the dataset Ddpo. Lastly, the mannequin skilled within the first stage is fine-tuned on Ddpo utilizing DPO, encouraging the mannequin to generate longer outputs whereas enhancing its reasoning capacity.

    The dataset Dsft = {(q,olong)}N relies on a bespoke stratos dataset. DeepSeek-R1 was used because the instructor reasoning mannequin as an alternative of QwQ-32B-Preview to generate lengthy reasoning response olong and GPT-4o-mini is employed instead of Sky-thought T1’s parsing logic to filter out incorrect mathematical options.

    For the dataset Ddpo = {(q, olong, oshort)}N, it was collected within the following method: For every query q in Dsft, Qwen2.5-Math-7B-Instruct ias used to generate a brief reasoning response oshort, pairing it with the lengthy reasoning response olong in Dsft. The samples the place Qwen2.5-Math-7B-Instruct’s reply matched DeepSeek R1’s reply are retained, leading to 8,080 samples. Moreover, 2,000 samples the place Qwen2.5-Math-7B-Instruct’s reply differed from DeepSeek R1’s however adhered to the proper response format, together with extra output distribution, are included in Ddpo. All of those mixed samples consequently type the ultimate dataset Ddpo.

    Effectiveness of ThinkPO

    • The fine-tuned mannequin achieves scores similar to Bespoke-Stratos-7B, it exhibits enhancements on nearly all datasets, validating the effectiveness of ThinkPO in enhancing LLM reasoning capacity.

    ThinkPO can Regularly Enhance Reasoning Capacity of Public Distilled Fashions

    • ThinkPO coaching improved the accuracy of each fashions throughout a lot of the 5 datasets examined.
    • Bespoke-Stratos-7B confirmed accuracy enhancements on all datasets besides MATH500, with notable enhancements of round 5% on Olympiad Bench Math and GPQA-Diamond.
    • DeepSeek-R1-Distill-Qwen-7B confirmed constant or barely improved accuracy, aside from a decline on AIME2024. Its accuracy on MATH500 improved from 87.4% to 91.2%.
    • The common response size elevated for each fashions, suggesting enhanced reasoning capacities, aligning with the test-time scaling precept. DeepSeek-R1-Distill-Qwen-7B’s response size elevated by ~500 tokens on MATH500, whereas Bespoke-Stratos-7B’s elevated by ~1000 tokens.

    ThinkPO Works for Totally different-Measurement Fashions

    • Growing mannequin dimension usually results in improved accuracy throughout datasets after SFT.
    • ThinkPO constantly improves efficiency throughout all mannequin sizes (3B, 7B, 14B).
    • ThinkPO results in a 1–2% accuracy enchancment on Math500 for all fashions.
    • The 3B mannequin exhibits enchancment on all 5 datasets after ThinkPO, whereas the 7B and 14B fashions enhance on 4 datasets.
    • ThinkPO demonstrates generalizability and robustness by being efficient throughout totally different mannequin scales.

    Pondering Choice Optimization 2502.13173



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow Altcoins Are Driving Innovation in Blockchain Technology: Key Insights
    Next Article 6 Ways to Spot and Capitalize on Emerging Social Media Trends
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    AI Governance in South Africa: New Privacy Laws Every Tech Leader Must Know | by emmanuel.tshikhudo | Jul, 2025

    July 2, 2025
    Machine Learning

    PCA and SVD: The Dynamic Duo of Dimensionality Reduction | by Arushi Gupta | Jul, 2025

    July 2, 2025
    Machine Learning

    Can AI Replace Doctors? How Technology Is Shaping Healthcare – Healthcare Info

    July 2, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Agentic AI with NVIDIA and DataRobot

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Chinese AI firm on US national security radar

    January 29, 2025

    What is MERN Stack: Everything You Need to Know?

    January 16, 2025

    The Future of Tech and AI. How Robots Will Take Over the World… | by The Quiet Thinker | Readers Club | Dec, 2024

    December 14, 2024
    Our Picks

    Agentic AI with NVIDIA and DataRobot

    July 2, 2025

    AI Governance in South Africa: New Privacy Laws Every Tech Leader Must Know | by emmanuel.tshikhudo | Jul, 2025

    July 2, 2025

    fileAI Launches Public Platform Access, Data Collection for Workflow Automation

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.