Close Menu
    Trending
    • Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025
    • How Smart Entrepreneurs Turn Mid-Year Tax Reviews Into Long-Term Financial Wins
    • Become a Better Data Scientist with These Prompt Engineering Tips and Tricks
    • Meanwhile in Europe: How We Learned to Stop Worrying and Love the AI Angst | by Andreas Maier | Jul, 2025
    • Transform Complexity into Opportunity with Digital Engineering
    • OpenAI Is Fighting Back Against Meta Poaching AI Talent
    • Lessons Learned After 6.5 Years Of Machine Learning
    • Handling Big Git Repos in AI Development | by Rajarshi Karmakar | Jul, 2025
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo
    Artificial Intelligence

    How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

    Team_AIBS NewsBy Team_AIBS NewsFebruary 28, 2025No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Welcome to half 2 of my LLM deep dive. In case you’ve not learn Half 1, I extremely encourage you to check it out first. 

    Beforehand, we coated the primary two main levels of coaching an LLM:

    1. Pre-training — Studying from large datasets to kind a base mannequin.
    2. Supervised fine-tuning (SFT) — Refining the mannequin with curated examples to make it helpful.

    Now, we’re diving into the following main stage: Reinforcement Studying (RL). Whereas pre-training and SFT are well-established, RL continues to be evolving however has change into a important a part of the coaching pipeline.

    I’ve taken reference from Andrej Karpathy’s widely popular 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the thought.

    Let’s go 🚀

    What’s the aim of reinforcement studying (RL)?

    People and LLMs course of data in a different way. What’s intuitive for us — like primary arithmetic — might not be for an LLM, which solely sees textual content as sequences of tokens. Conversely, an LLM can generate expert-level responses on complicated subjects just because it has seen sufficient examples throughout coaching.

    This distinction in cognition makes it difficult for human annotators to offer the “excellent” set of labels that constantly information an LLM towards the suitable reply.

    RL bridges this hole by permitting the mannequin to study from its personal expertise.

    As a substitute of relying solely on specific labels, the mannequin explores totally different token sequences and receives suggestions — reward indicators — on which outputs are most helpful. Over time, it learns to align higher with human intent.

    Instinct behind RL

    LLMs are stochastic — that means their responses aren’t fastened. Even with the identical immediate, the output varies as a result of it’s sampled from a likelihood distribution.

    We will harness this randomness by producing 1000’s and even thousands and thousands of attainable responses in parallel. Consider it because the mannequin exploring totally different paths — some good, some unhealthy. Our aim is to encourage it to take the higher paths extra usually.

    To do that, we prepare the mannequin on the sequences of tokens that result in higher outcomes. In contrast to supervised fine-tuning, the place human consultants present labeled knowledge, reinforcement studying permits the mannequin to study from itself.

    The mannequin discovers which responses work greatest, and after every coaching step, we replace its parameters. Over time, this makes the mannequin extra prone to produce high-quality solutions when given comparable prompts sooner or later.

    However how can we decide which responses are greatest? And the way a lot RL ought to we do? The small print are tough, and getting them proper is just not trivial.

    RL is just not “new” — It will probably surpass human experience (AlphaGo, 2016)

    A terrific instance of RL’s energy is DeepMind’s AlphaGo, the primary AI to defeat an expert Go participant and later surpass human-level play.

    Within the 2016 Nature paper (graph under), when a mannequin was skilled purely by SFT (giving the mannequin tons of excellent examples to mimic from), the mannequin was in a position to attain human-level efficiency, however by no means surpass it.

    The dotted line represents Lee Sedol’s efficiency — one of the best Go participant on the earth.

    It’s because SFT is about replication, not innovation — it doesn’t enable the mannequin to find new methods past human information.

    Nonetheless, RL enabled AlphaGo to play in opposition to itself, refine its methods, and finally exceed human experience (blue line).

    Picture taken from AlphaGo 2016 paper

    RL represents an thrilling frontier in AI — the place fashions can discover methods past human creativeness after we prepare it on a various and difficult pool of issues to refine it’s considering methods.

    RL foundations recap

    Let’s rapidly recap the important thing elements of a typical RL setup:

    Picture by writer
    • Agent — The learner or resolution maker. It observes the present scenario (state), chooses an motion, after which updates its behaviour primarily based on the result (reward).
    • Atmosphere  — The exterior system through which the agent operates.
    • State —  A snapshot of the setting at a given step t. 

    At every timestamp, the agent performs an motion within the setting that can change the setting’s state to a brand new one. The agent may also obtain suggestions indicating how good or unhealthy the motion was.

    This suggestions is known as a reward, and is represented in a numerical kind. A optimistic reward encourages that behaviour, and a unfavorable reward discourages it.

    By utilizing suggestions from totally different states and actions, the agent regularly learns the optimum technique to maximise the overall reward over time.

    Coverage

    The coverage is the agent’s technique. If the agent follows a superb coverage, it’s going to constantly make good choices, resulting in increased rewards over many steps.

    In mathematical phrases, it’s a perform that determines the likelihood of various outputs for a given state — (πθ(a|s)).

    Worth perform

    An estimate of how good it’s to be in a sure state, contemplating the long run anticipated reward. For an LLM, the reward would possibly come from human suggestions or a reward mannequin. 

    Actor-Critic structure

    It’s a in style RL setup that mixes two elements:

    1. Actor — Learns and updates the coverage (πθ), deciding which motion to absorb every state.
    2. Critic — Evaluates the worth perform (V(s)) to present suggestions to the actor on whether or not its chosen actions are resulting in good outcomes. 

    The way it works:

    • The actor picks an motion primarily based on its present coverage.
    • The critic evaluates the result (reward + subsequent state) and updates its worth estimate.
    • The critic’s suggestions helps the actor refine its coverage in order that future actions result in increased rewards.

    Placing all of it collectively for LLMs

    The state will be the present textual content (immediate or dialog), and the motion will be the following token to generate. A reward mannequin (eg. human suggestions), tells the mannequin how good or unhealthy it’s generated textual content is. 

    The coverage is the mannequin’s technique for choosing the following token, whereas the worth perform estimates how useful the present textual content context is, when it comes to ultimately producing top quality responses.

    DeepSeek-R1 (revealed 22 Jan 2025)

    To spotlight RL’s significance, let’s discover Deepseek-R1, a reasoning mannequin attaining top-tier efficiency whereas remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1.

    • DeepSeek-R1-Zero was skilled solely through large-scale RL, skipping supervised fine-tuning (SFT).
    • DeepSeek-R1 builds on it, addressing encountered challenges.

    Deepseek R1 is among the most wonderful and spectacular breakthroughs I’ve ever seen — and as open supply, a profound reward to the world. 🤖🫡

    — Marc Andreessen 🇺🇸 (@pmarca) January 24, 2025

    Let’s dive into a few of these key factors. 

    1. RL algo: Group Relative Coverage Optimisation (GRPO)

    One key recreation altering RL algorithm is Group Relative Coverage Optimisation (GRPO), a variant of the broadly in style Proximal Coverage Optimisation (PPO). GRPO was introduced in the DeepSeekMath paper in Feb 2024. 

    Why GRPO over PPO?

    PPO struggles with reasoning duties because of:

    1. Dependency on a critic mannequin.
      PPO wants a separate critic mannequin, successfully doubling reminiscence and compute.
      Coaching the critic will be complicated for nuanced or subjective duties.
    2. Excessive computational price as RL pipelines demand substantial sources to judge and optimise responses. 
    3. Absolute reward evaluations
      While you depend on an absolute reward — that means there’s a single normal or metric to evaluate whether or not a solution is “good” or “unhealthy” — it may be arduous to seize the nuances of open-ended, numerous duties throughout totally different reasoning domains. 

    How GRPO addressed these challenges:

    GRPO eliminates the critic mannequin through the use of relative analysis — responses are in contrast inside a gaggle slightly than judged by a set normal.

    Think about college students fixing an issue. As a substitute of a instructor grading them individually, they evaluate solutions, studying from one another. Over time, efficiency converges towards increased high quality.

    How does GRPO match into the entire coaching course of?

    GRPO modifies how loss is calculated whereas protecting different coaching steps unchanged:

    1. Collect knowledge (queries + responses)
      – For LLMs, queries are like questions
      – The outdated coverage (older snapshot of the mannequin) generates a number of candidate solutions for every question
    2. Assign rewards — every response within the group is scored (the “reward”).
    3. Compute the GRPO loss
      Historically, you’ll compute a loss — which exhibits the deviation between the mannequin prediction and the true label.
      In GRPO, nonetheless, you measure:
      a) How possible is the brand new coverage to supply previous responses?
      b) Are these responses comparatively higher or worse?
      c) Apply clipping to forestall excessive updates.
      This yields a scalar loss.
    4. Again propagation + gradient descent
      – Again propagation calculates how every parameter contributed to loss
      – Gradient descent updates these parameters to cut back the loss
      – Over many iterations, this regularly shifts the brand new coverage to favor increased reward responses
    5. Replace the outdated coverage often to match the brand new coverage.
      This refreshes the baseline for the following spherical of comparisons.

    2. Chain of thought (CoT)

    Conventional LLM coaching follows pre-training → SFT → RL. Nonetheless, DeepSeek-R1-Zero skipped SFT, permitting the mannequin to instantly discover CoT reasoning.

    Like people considering by means of a tricky query, CoT allows fashions to interrupt issues into intermediate steps, boosting complicated reasoning capabilities. OpenAI’s o1 mannequin additionally leverages this, as famous in its September 2024 report: o1’s efficiency improves with extra RL (train-time compute) and extra reasoning time (test-time compute).

    DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning. 

    A key graph (under) within the paper confirmed elevated considering throughout coaching, resulting in longer (extra tokens), extra detailed and higher responses.

    Picture taken from DeepSeek-R1 paper

    With out specific programming, it started revisiting previous reasoning steps, bettering accuracy. This highlights chain-of-thought reasoning as an emergent property of RL coaching.

    The mannequin additionally had an “aha second” (under) — an interesting instance of how RL can result in sudden and complex outcomes.

    Picture taken from DeepSeek-R1 paper

    Notice: In contrast to DeepSeek-R1, OpenAI doesn’t present full actual reasoning chains of thought in o1 as they’re involved a couple of distillation danger — the place somebody is available in and tries to mimic these reasoning traces and get better lots of the reasoning efficiency by simply imitating. As a substitute, o1 simply summaries of those chains of ideas.

    Reinforcement studying with Human Suggestions (RLHF)

    For duties with verifiable outputs (e.g., math issues, factual Q&A), AI responses will be simply evaluated. However what about areas like summarisation or inventive writing, the place there’s no single “appropriate” reply? 

    That is the place human suggestions is available in — however naïve RL approaches are unscalable.

    Picture by writer

    Let’s have a look at the naive method with some arbitrary numbers.

    Picture by writer

    That’s one billion human evaluations wanted! That is too pricey, sluggish and unscalable. Therefore, a wiser answer is to coach an AI “reward mannequin” to study human preferences, dramatically lowering human effort. 

    Rating responses can also be simpler and extra intuitive than absolute scoring.

    Picture by writer

    Upsides of RLHF

    • Could be utilized to any area, together with inventive writing, poetry, summarisation, and different open-ended duties.
    • Rating outputs is far simpler for human labellers than producing inventive outputs themselves.

    Downsides of RLHF

    • The reward mannequin is an approximation — it could not completely mirror human preferences.
    • RL is nice at gaming the reward mannequin — if run for too lengthy, the mannequin would possibly exploit loopholes, producing nonsensical outputs that also get excessive scores.

    Do notice that Rlhf is just not the identical as conventional RL.

    For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and uncover novel methods. RLHF, then again, is extra like a fine-tuning step to align fashions with human preferences.

    Conclusion

    And that’s a wrap! I hope you loved Half 2 🙂 In case you haven’t already learn Half 1 — do check it out here.

    Received questions or concepts for what I ought to cowl subsequent? Drop them within the feedback — I’d love to listen to your ideas. See you within the subsequent article!





    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticlePapers Explained 320: SigLIP 2. SigLIP 2 is a family of new… | by Ritvik Rastogi | Feb, 2025
    Next Article How Businesses Can Capitalize on Emerging Domain Name Trends
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

    July 1, 2025
    Artificial Intelligence

    Lessons Learned After 6.5 Years Of Machine Learning

    July 1, 2025
    Artificial Intelligence

    Prescriptive Modeling Makes Causal Bets – Whether You Know it or Not!

    June 30, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Robinhood’s New Bank Accounts Offer Cash Deliveries

    March 28, 2025

    Airbnb Now Offers Bookings for Massages, Chefs, Fitness

    May 15, 2025

    How to Prevent $60 Trillion in Generational Wealth from Vanishing

    February 26, 2025
    Our Picks

    Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025

    July 1, 2025

    How Smart Entrepreneurs Turn Mid-Year Tax Reviews Into Long-Term Financial Wins

    July 1, 2025

    Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.