Close Menu
    Trending
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    • How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1
    • From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Papers Explained 324: Thinking Preference Optimization | by Ritvik Rastogi | Mar, 2025
    Machine Learning

    Papers Explained 324: Thinking Preference Optimization | by Ritvik Rastogi | Mar, 2025

    Team_AIBS NewsBy Team_AIBS NewsMarch 6, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Pondering Choice Optimization (ThinkPO) makes use of available or simply obtainable brief CoT reasoning responses as rejected solutions and lengthy CoT responses as chosen solutions for a similar query. It then applies direct choice optimization to encourage the mannequin to favor longer reasoning outputs.

    The coaching course of in Pondering Choice Optimization consists of two phases: Reasoning SFT (Supervised High quality-Tuning) stage and Reasoning DPO (Direct Choice Optimization) stage.

    Within the Reasoning SFT stage, long-reasoning responses are collected for every query to assemble the dataset Dsft. The bottom mannequin is then fine-tuned on Dsft to amass superior reasoning capabilities, which helps to organize the mannequin for the subsequent stage.

    Within the second stage, the mannequin is additional inspired to generate prolonged reasoning utilizing the Direct Choice Optimization (DPO) method. First, the long-reasoning information from the preliminary stage is used because the chosen responses. Then, a smaller mannequin with regular Reasoning capacity is utilized to generate shorter reasoning responses as rejected samples. To make sure information high quality, each lengthy and brief reasoning responses endure filtering, together with correctness validation. This course of leads to the dataset Ddpo. Lastly, the mannequin skilled within the first stage is fine-tuned on Ddpo utilizing DPO, encouraging the mannequin to generate longer outputs whereas enhancing its reasoning capacity.

    The dataset Dsft = {(q,olong)}N relies on a bespoke stratos dataset. DeepSeek-R1 was used because the instructor reasoning mannequin as an alternative of QwQ-32B-Preview to generate lengthy reasoning response olong and GPT-4o-mini is employed instead of Sky-thought T1’s parsing logic to filter out incorrect mathematical options.

    For the dataset Ddpo = {(q, olong, oshort)}N, it was collected within the following method: For every query q in Dsft, Qwen2.5-Math-7B-Instruct ias used to generate a brief reasoning response oshort, pairing it with the lengthy reasoning response olong in Dsft. The samples the place Qwen2.5-Math-7B-Instruct’s reply matched DeepSeek R1’s reply are retained, leading to 8,080 samples. Moreover, 2,000 samples the place Qwen2.5-Math-7B-Instruct’s reply differed from DeepSeek R1’s however adhered to the proper response format, together with extra output distribution, are included in Ddpo. All of those mixed samples consequently type the ultimate dataset Ddpo.

    Effectiveness of ThinkPO

    • The fine-tuned mannequin achieves scores similar to Bespoke-Stratos-7B, it exhibits enhancements on nearly all datasets, validating the effectiveness of ThinkPO in enhancing LLM reasoning capacity.

    ThinkPO can Regularly Enhance Reasoning Capacity of Public Distilled Fashions

    • ThinkPO coaching improved the accuracy of each fashions throughout a lot of the 5 datasets examined.
    • Bespoke-Stratos-7B confirmed accuracy enhancements on all datasets besides MATH500, with notable enhancements of round 5% on Olympiad Bench Math and GPQA-Diamond.
    • DeepSeek-R1-Distill-Qwen-7B confirmed constant or barely improved accuracy, aside from a decline on AIME2024. Its accuracy on MATH500 improved from 87.4% to 91.2%.
    • The common response size elevated for each fashions, suggesting enhanced reasoning capacities, aligning with the test-time scaling precept. DeepSeek-R1-Distill-Qwen-7B’s response size elevated by ~500 tokens on MATH500, whereas Bespoke-Stratos-7B’s elevated by ~1000 tokens.

    ThinkPO Works for Totally different-Measurement Fashions

    • Growing mannequin dimension usually results in improved accuracy throughout datasets after SFT.
    • ThinkPO constantly improves efficiency throughout all mannequin sizes (3B, 7B, 14B).
    • ThinkPO results in a 1–2% accuracy enchancment on Math500 for all fashions.
    • The 3B mannequin exhibits enchancment on all 5 datasets after ThinkPO, whereas the 7B and 14B fashions enhance on 4 datasets.
    • ThinkPO demonstrates generalizability and robustness by being efficient throughout totally different mannequin scales.

    Pondering Choice Optimization 2502.13173



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow Altcoins Are Driving Innovation in Blockchain Technology: Key Insights
    Next Article 6 Ways to Spot and Capitalize on Emerging Social Media Trends
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025

    July 2, 2025
    Machine Learning

    From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

    July 1, 2025
    Machine Learning

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Qantas data breach to impact 6 million airline customers

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    6 Ways to Improve Your Predictive Models in Data Science | by Simranjeet Singh | Dec, 2024

    December 25, 2024

    5 AI-Powered APIs You Can Build Side Projects With | by Souradip Pal | Apr, 2025

    April 3, 2025

    Building GenAI Applications with Ragbits and Pydantic | by Ankush k Singal | AI Artistry | Jun, 2025

    June 15, 2025
    Our Picks

    Qantas data breach to impact 6 million airline customers

    July 2, 2025

    He Went From $471K in Debt to Teaching Others How to Succeed

    July 2, 2025

    An Introduction to Remote Model Context Protocol Servers

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.