Close Menu
    Trending
    • This Mac and Microsoft Bundle Pays for Itself in Productivity
    • Candy AI NSFW AI Video Generator: My Unfiltered Thoughts
    • Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025
    • Automating Visual Content: How to Make Image Creation Effortless with APIs
    • A Founder’s Guide to Building a Real AI Strategy
    • Starting Your First AI Stock Trading Bot
    • Peering into the Heart of AI. Artificial intelligence (AI) is no… | by Artificial Intelligence Details | Aug, 2025
    • E1 CEO Rodi Basso on Innovating the New Powerboat Racing Series
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Revolutionizing reinforcement learning for reasoning tasks | by vivek varikuti | Aug, 2025
    Machine Learning

    Revolutionizing reinforcement learning for reasoning tasks | by vivek varikuti | Aug, 2025

    Team_AIBS NewsBy Team_AIBS NewsAugust 1, 2025No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Implementing Group Sequence Coverage Optimization

    When coaching massive language fashions for reasoning duties, most practitioners attain for Proximal Coverage Optimization (PPO) or its variants. Whereas these strategies work properly for a lot of purposes, they undergo from a elementary limitation: they compute significance ratios on the token stage, resulting in instability when coping with advanced reasoning sequences.

    Take into account a mathematical drawback the place the mannequin wants to indicate step-by-step reasoning. Conventional PPO evaluates every token independently, lacking the coherent construction of your entire resolution. This token-level strategy usually ends in coaching instability, particularly when the mannequin wants to keep up logical consistency throughout lengthy sequences.

    Enter Group Sequence Coverage Optimization (GSPO)

    The Qwen Group at Alibaba Inc., led by researchers Chujie Zheng, Shixuan Liu, and colleagues, launched Group Sequence Coverage Optimization (GSPO) to handle these limitations. Their key perception: consider whole response sequences relatively than particular person tokens.

    GSPO introduces three elementary enhancements:

    1. Sequence-level significance ratios: As a substitute of computing ratios for every token, GSPO evaluates the probability of full response sequences
    2. Size normalization: Applies normalization to deal with variable sequence lengths pretty
    3. Enhanced clipping stability: Maintains coaching stability even beneath excessive clipping charges (50–75% vs 2–3% for conventional strategies)

    The Implementation Journey

    Once I first learn the GSPO paper, I used to be intrigued by the theoretical class however puzzled about sensible implementation, particularly for large-scale coaching on trendy {hardware}. The paper supplied the algorithmic basis, however translating it into production-ready code optimized for NVIDIA H100 GPUs required addressing a number of challenges.

    Core Algorithm Implementation

    The guts of GSPO lies in computing sequence-level log chances:

    “`python
    def compute_sequence_log_prob(self, mannequin, input_ids, attention_mask,
    response_start_idx, response_end_idx):
    “””Compute log likelihood of whole response sequence”””
    with torch.autocast(device_type=’cuda’, dtype=torch.bfloat16):
    outputs = mannequin(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits

    # Concentrate on response tokens solely
    response_logits = logits[:, response_start_idx:response_end_idx, :]
    response_labels = input_ids[:, response_start_idx+1:response_end_idx+1]

    # Compute sequence log likelihood
    log_probs = F.log_softmax(response_logits, dim=-1)
    token_log_probs = log_probs.collect(dim=-1, index=response_labels.unsqueeze(-1))

    # Sum over sequence (key distinction from token-level strategies)
    sequence_log_prob = token_log_probs.sum(dim=1)

    return sequence_log_prob
    “`

    H100 Optimization Challenges

    Implementing GSPO for H100 GPUs revealed a number of optimization alternatives:

    Reminiscence Effectivity: H100’s 80GB HBM3 permits for bigger batch sizes, however GSPO’s sequence-level computations require cautious reminiscence administration. I applied gradient checkpointing and 8-bit optimizers utilizing bitsandbytes to maximise reminiscence utilization.

    Blended Precision Coaching: H100’s Transformer Engine advantages considerably from bfloat16, however sequence-level computations wanted cautious numerical stability concerns.

    Size Normalization: The important thing perception right here is making use of normalization by response size:

    “`python
    def compute_importance_ratio(self, current_log_prob, old_log_prob, response_lengths):
    “””Compute length-normalized significance ratios”””
    # Size normalization — essential for sequence-level stability
    log_ratio = (current_log_prob — old_log_prob) / response_lengths.clamp(min=1.0)

    # Apply significance ratio with clipping
    importance_ratio = torch.exp(log_ratio.clamp(min=-10, max=10))

    return importance_ratio
    “`

    Validation and Outcomes

    To validate the implementation, I carried out complete comparisons towards PPO and GRPO baselines utilizing the identical mannequin (DeepSeek-R1-Distill-Qwen-1.5B) and datasets.

    The outcomes confirmed GSPO’s theoretical benefits:

    | Technique | Reward Enchancment | Clipping Stability | Coaching Stability |
    | — — — — | — — — — — — — — — -| — — — — — — — — — -| — — — — — — — — — -|
    | GSPO | 1.4% | 50–75%| Steady|
    | GRPO | -3.8% | 0.01% | Unstable |
    | PPO | -2.9% | 0.02% | Degraded |

    Probably the most hanging discovering was GSPO’s skill to keep up coaching stability beneath excessive clipping charges. Whereas PPO and GRPO grew to become unstable with minimal clipping, GSPO continued coaching successfully with 50–75% of significance ratios being clipped.

    Reasoning Efficiency

    On reasoning benchmarks, the sequence-level strategy confirmed clear benefits:

    ZebraLogic Reasoning: 60.0% accuracy on logical puzzle duties
    Customized Math Issues: 75.8% accuracy on step-by-step mathematical reasoning
    Baseline Enchancment: +20% efficiency enchancment over PPO

    Why Sequence-Degree Works Higher

    The success of GSPO for reasoning duties makes intuitive sense. When fixing a mathematical drawback, the coherence of your entire resolution issues greater than particular person token predictions. A mannequin that generates “2 + 2 = 5” ought to obtain unfavorable suggestions for your entire sequence, not simply the ultimate token.

    Implementation Challenges

    Reminiscence Administration: Sequence-level computations require storing further tensors. Cautious tensor lifecycle administration and strategic use of `torch.cuda.empty_cache()` proved important.

    Numerical Stability: Size normalization helps, however excessive sequence size variations can nonetheless trigger points. Implementing sturdy clamping and NaN/Inf detection was essential.

    Previous Mannequin Updates: GSPO requires sustaining a reference “outdated mannequin” for significance ratio computation. The frequency of updates considerably impacts coaching dynamics.

    Hyperparameter Sensitivity

    GSPO confirmed totally different sensitivity patterns in comparison with PPO:

    Studying Fee: Extra tolerant of upper studying charges as a consequence of sequence-level stability
    Clipping Vary: Might use tighter ranges (±0.002) successfully
    Group Dimension: Optimum at 4, according to the unique paper

    {Hardware} Necessities

    For sensible deployment, think about these specs:

    Minimal: 24GB VRAM (RTX 4090) for inference and light-weight coaching
    Beneficial: 80GB VRAM (H100, A100) for full coaching workflows
    Coaching Time: 4–8 hours for full coaching on H100

    Integration with Current Workflows

    The implementation offers a drop-in substitute for normal PPO coaching:

    “`python
    from gspo import GSPOTrainer, GSPOConfig

    # Customary configuration
    config = GSPOConfig(
    learning_rate=1e-7,
    left_clip_range=0.002,
    right_clip_range=0.002,
    group_size=4
    )

    # Initialize coach (similar interface as PPO)
    coach = GSPOTrainer(mannequin, tokenizer, config)

    # Coaching proceeds usually
    coach.train_step(queries, reward_function)
    “`

    Future Instructions and Analysis Alternatives

    The GSPO implementation opens a number of analysis avenues:

    1. Multi-Scale Sequence Optimization: Combining token-level and sequence-level ratios
    2. Dynamic Size Normalization: Adaptive normalization based mostly on sequence complexity
    3. Hierarchical Sequence Constructions: Making use of GSPO to structured reasoning duties
    4. Cross-Modal Functions: Extending sequence-level optimization to vision-language duties

    Implementing GSPO from paper to manufacturing strengthened a key lesson in AI analysis: theoretical class usually requires cautious engineering to comprehend sensible advantages. The sequence-level strategy represents a elementary shift in how we take into consideration coverage optimization for reasoning duties.

    The implementation is totally open-sourced, together with:
    Full codebase: [GitHub Repository](https://github.com/vivekvar-dl/gpso)
    Educated mannequin: [HuggingFace Model](https://huggingface.co/vivekvar/GSPO-DeepSeek-R1-Distill-Qwen-1.5B)
    Coaching logs: [Wandb Experiments](https://wandb.ai/domainluther1234-usha-rama-college-of-engineering-and-te/gspo-robust-training/runs/pmyrt2ul/overview)

    For researchers engaged on reasoning duties, GSPO provides a compelling various to conventional coverage optimization strategies. The mixture of theoretical soundness and sensible effectiveness makes it a priceless addition to the reinforcement studying toolkit.

    This work implements the GSPO algorithm developed by Chujie Zheng, Shixuan Liu, and colleagues on the Qwen Group, Alibaba Inc. Particular due to the unique authors for making their analysis publicly out there and contributing to the development of coverage optimization strategies.

    Wish to strive GSPO in your analysis? The whole implementation, documentation, and skilled fashions can be found open-source. Contributions and suggestions from the neighborhood are welcome as we proceed bettering sequence-level optimization for reasoning duties.

    Concerning the Creator: Early-career researcher centered on reinforcement studying and huge language mannequin optimization. All the time wanting to study from the neighborhood and enhance implementation methods.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleCelsius Energy Drink May Contain Alcohol in Labeling Mixup
    Next Article Anaconda AI Roars with $1.5 Billion Valuation in Fresh Series C Funding Round
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025

    August 2, 2025
    Machine Learning

    Peering into the Heart of AI. Artificial intelligence (AI) is no… | by Artificial Intelligence Details | Aug, 2025

    August 2, 2025
    Machine Learning

    Why I Still Don’t Believe in AI. Like many here, I’m a programmer. I… | by Ivan Roganov | Aug, 2025

    August 2, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    This Mac and Microsoft Bundle Pays for Itself in Productivity

    August 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Shake It Up — Dunkin’ Debuts Star-Backed Winter Menu

    January 4, 2025

    Naive Bayes Explained | Medium

    February 25, 2025

    The obstacles facing indie developers

    December 14, 2024
    Our Picks

    This Mac and Microsoft Bundle Pays for Itself in Productivity

    August 2, 2025

    Candy AI NSFW AI Video Generator: My Unfiltered Thoughts

    August 2, 2025

    Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025

    August 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.