Close Menu
    Trending
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    • Musk’s X appoints ‘king of virality’ in bid to boost growth
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    • Implementing IBCS rules in Power BI
    • What comes next for AI copyright lawsuits?
    • Why PDF Extraction Still Feels LikeHack
    • GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»DeepSeek Technical Analysis — (3) Multi-Token Prediction | by Jinpeng Zhang | Jan, 2025
    Machine Learning

    DeepSeek Technical Analysis — (3) Multi-Token Prediction | by Jinpeng Zhang | Jan, 2025

    Team_AIBS NewsBy Team_AIBS NewsJanuary 30, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The is the third weblog of my DeepSeek Mannequin technical evaluation collection weblog, for the entire background please consult with the first weblog of this collection “DeepSeek Technical Analysis — (1) MoE”.

    Within the final 2 blogs, I defined Mixture-of-Experts(MoE) and Multi-Head Latent Attention(MLA) respectively. The MoE decreased the coaching value considerably by decreasing the variety of activated parameters for every token, for instance the variety of activated parameters of DeepSeek-V3-671B is 37B. The MLA decreased the KV cache dimension by 93.3% (evaluate to unique Multi-Head Consideration), and boosted the inference velocity by a number of occasions.

    On this weblog, I’ll give attention to one other method adopted by DeepSeek (begin from model V3) — Multi-Token Prediction which may enhance the efficiency(accuracy) of the mannequin.

    Massive language fashions similar to GPT and Llama are skilled with a next-token prediction loss. These fashions study a big textual content corpus x1, . . . xT by implementing a next-token prediction activity. Formally, the educational goal is to attenuate the cross entropy loss:

    the place Pθ is our giant language mannequin below coaching, as to maximise the chance of xt+1 as the following future token, given the historical past of previous tokens xt:1 = xt, . . . , x1.

    Better & Faster Large Language Models via Multi-token Prediction — 2024 this paper generalized the above by implementing a multi-token prediction activity, the place at every place of the coaching corpus, the mannequin is instructed to foretell n future tokens without delay in parallel utilizing unbiased output heads. This interprets into the cross-entropy loss:

    This paper did a number of experiments to search out the optimum n(what number of future tokens to foretell), additionally verified that multi-token prediction can enhance the efficiency(accuracy) over next-token prediction in coaching language fashions for generative or reasoning duties.

    This paper additionally demonstrated that multi-token prediction results in qualitative adjustments in mannequin capabilities and generalization behaviors. The explanation behind that in all probability is as a result of multi-token prediction mitigates the distributional discrepancy between training-time trainer forcing and inference-time autoregressive era.

    DeepSeek V3 adopted the principle thought of Multi-Token Prediction talked about in above paper, however with the change that sequentially predict further tokens and maintain the entire causal chain at every prediction depth.

    Variant of Multi-Token Prediction in DeepSeek

    As an alternative of a parallel-structure within the unique paper, it’s a chain-structure Multi-Token Prediction within the DeepSeek. The enter tokens [t1,t2,t3,t4] undergo the principle mannequin’s transformer blocks after which undergo the output head of fundamental mannequin to provide subsequent predicted token t5. In the meantime the illustration of the enter tokens[t1,t2,t3,t4](output of fundamental mannequin’s transformer blocks) will probably be handed to the MTP module and mix with new enter tokens’ embedding[t2,t3,t4,t5(new predicted)] to assist produce further token t6… In DeepSeek-V3, the mannequin predicts subsequent 2 tokens.

    On this DeepSeek-V3 Technical Report, authors demonstrated that the Multi-Token Prediction can enhance the efficiency most often.

    Does Multi-Token Prediction have enchancment for all instances? Better & Faster Large Language Models via Multi-token Prediction — 2024 this paper demonstrated that multi-token prediction could introduce regression for multiple-choice and likelihood-based benchmarks. The MMLU(Huge Multitask Language Understanding) regression (67.5 -> 66.6) in DeepSeek with MTP matches this conclusion.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleMicrosoft Continues A.I. Spending Growth as Profit Grows 10%
    Next Article Machine Learning Incidents in AdTech | by Ben Weber | Jan, 2025
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025
    Machine Learning

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Machine Learning

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    At Hillmantok, a Digital H.B.C.U., Class Is in Session

    February 5, 2025

    LA Rental Prices Skyrocketing Despite Price Gouging Laws

    January 14, 2025

    5 Things You Need to Stop Doing as a Solopreneur

    May 17, 2025
    Our Picks

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z

    July 1, 2025

    Musk’s X appoints ‘king of virality’ in bid to boost growth

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.