Close Menu
    Trending
    • 🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025
    • Futurwise: Unlock 25% Off Futurwise Today
    • 3D Printer Breaks Kickstarter Record, Raises Over $46M
    • People are using AI to ‘sit’ with them while they trip on psychedelics
    • Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025
    • How This Man Grew His Beverage Side Hustle From $1k a Month to 7 Figures
    • Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025
    • How Smart Entrepreneurs Turn Mid-Year Tax Reviews Into Long-Term Financial Wins
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Reading this paper from DeepSeek made me rethink everything I knew about AI efficiency | by Karthik Mudaliar | May, 2025
    Machine Learning

    Reading this paper from DeepSeek made me rethink everything I knew about AI efficiency | by Karthik Mudaliar | May, 2025

    Team_AIBS NewsBy Team_AIBS NewsMay 17, 2025No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Breaking down how DeepSeek constructed a 671B parameter mannequin utilizing simply 2,048 GPUs

    I spent my Saturday diving right into a technical paper titled Insights into DeepSeek-V3: Scaling Challenges and Reflections on {Hardware} for AI Architectures [PDF]. It wasn’t your typical weekend learn, but it surely turned out to be extra insightful than I anticipated. I’m penning this to distill what stood out to me and hopefully spark curiosity for others who love techniques, scaling, or simply understanding how the heart of AI infrastructure work.

    Fundamental structure of DeepSeek V-3 (Credit: DeepSeek)

    The crew behind DeepSeek-V3 skilled their 671B parameter mannequin utilizing simply 2,048 NVIDIA H800 GPUs. As compared, OpenAI reportedly utilized roughly 25,000 NVIDIA A100 GPUs to coach GPT-4. That’s spectacular, however what’s extra attention-grabbing is how they did it. They didn’t throw extra money on the drawback. They constructed across the constraints of their {hardware}, which appears like a uncommon type of engineering knowledge as we speak.

    As an alternative of blindly scaling up, they paid consideration to how bandwidth, reminiscence limitations, and compute bottlenecks truly have an effect on coaching and inference in apply. They squeezed each little bit of efficiency by redesigning how fashions and {hardware} speak to one another.

    The paper hammered residence one thing I’ve vaguely recognized however by no means absolutely appreciated — reminiscence is the true bottleneck when scaling LLMs. Whereas mannequin sizes are rising exponentially, high-speed reminiscence capability (like HBM) is rising at a a lot slower tempo.

    To struggle this, they used one thing known as Multi-head Latent Consideration (MLA), which compresses the key-value cache required throughout inference. With MLA, they introduced down the reminiscence utilization per token to 70 KB, in comparison with over 500 KB in another fashions.

    Consider it like this — as a substitute of storing a full high-res picture album for each dialog, MLA shops a well-organized abstract scrapbook. It doesn’t hold each pixel from each picture, but it surely retains simply sufficient to recollect the context and proceed the story and that’s all of the mannequin wants throughout inference. It’s like zipping up a suitcase smarter, not simply smaller.

    That’s a game-changer for long-context purposes and resource-constrained techniques.

    Combination-of-Consultants (MoE) architectures often sound like an instructional train. However right here, they made a powerful case for MoE being production-ready and sensible. DeepSeek-V3 makes use of a sparse MoE format the place just a few of the 671B parameters are energetic throughout any inference. That’s the way it manages to stay environment friendly.

    It’s like having an enormous crew of subject-matter consultants of various area verticals, however solely calling within the 2 or 3 consultants who’re truly wanted for the duty at hand. You don’t ask the entire workplace to unravel one drawback, however simply the proper individuals on the proper time. That manner, you save time, vitality, and sources whereas nonetheless getting one of the best outcome.

    It made me take into consideration the way forward for private LLMs. Operating a robust mannequin on an area system with restricted compute may not be far-fetched if architectures like this evolve additional.

    I had solely heard of FP8 floating factors round on NVIDIA slides, however this paper will get into the nitty-gritty of the way it’s truly being utilized in coaching, not simply inference.

    FP8 stands for 8-bit floating level. Not like normal FP32 (32-bit) and even BF16 (16-bit) codecs which were generally utilized in deep studying, FP8 compresses every quantity into simply 8 bits. Which means manner much less reminiscence use and far quicker knowledge motion, which is an enormous deal if you’re coaching large fashions throughout 1000’s of GPUs.

    However there’s a trade-off.

    With much less area, you get much less precision. That may result in instability when doing high-precision operations like matrix multiplications or gradient accumulation. And due to how NVIDIA’s tensor cores work, FP8’s restricted accumulation precision may cause lack of info if not dealt with correctly. On prime of that, utilizing fine-grained quantization to squeeze values into FP8 creates extra overhead, particularly when transferring between cores or making use of scaling elements.

    The DeepSeek crew tackled this by designing a framework the place FP8 is utilized in simply the proper spots, mixed with methods like tile-wise quantization for activations and block-wise quantization for weights. It’s not simply intelligent math, it’s sensible engineering that bridges what the {hardware} can do with what the mannequin wants to remain correct.

    Consider it like organizing a wardrobe. You’ve acquired completely different gadgets to retailer — shirts, jackets, equipment, socks, similar to the inputs throughout inference. But when each part of your wardrobe was the identical measurement, it wouldn’t make sense. Your socks don’t want as a lot area as your coats.

    That’s the place DeepSeek’s use of FP8 is available in. As an alternative of giving each calculation the identical cumbersome 32-bit slot, they use a lot smaller 8-bit floating factors for the components of the mannequin that don’t want full precision. It’s smarter area administration — saving reminiscence and bandwidth by giving every process solely as a lot room because it actually wants.

    And so they didn’t simply speak concept. They really skilled enormous fashions with this setup and located that the accuracy loss was lower than 0.25% in comparison with BF16. That’s wild when you concentrate on the reminiscence and bandwidth financial savings they’re getting in return.

    They clarify how the decrease precision helps scale back reminiscence and bandwidth load, but in addition point out the issues it causes, like restricted accumulation precision and elevated register stress. Their workaround? Superb-grained quantization and cautious co-design with software program and {hardware}.

    One of many coolest components for me was how they changed the standard three-layer fat-tree community with a two-layer, multi-plane fat-tree topology. It not solely lowered community prices but in addition saved latencies low and scaled effectively to 1000’s of GPUs. There’s one thing deeply satisfying about watching engineers scale back one thing complicated into one thing environment friendly.

    H800 node interconnection (Credit: DeepSeek)

    Studying this paper made me admire the class of co-design. It jogged my memory that nice AI fashions aren’t simply born out of higher algorithms or extra GPUs. They arrive from individuals who obsess over limitations, query the defaults, and redesign techniques from the bottom up.

    In the event you’re constructing something in AI or infrastructure, particularly on a funds, I like to recommend giving the paper a learn. You’ll come out with a greater instinct of how deep studying truly works at scale, not simply in concept, however in messy, hardware-constrained actuality.

    Hello, I’m Karthik. I like desirous about how we will construct smarter, extra environment friendly techniques, not simply greater ones. If this resonated with you, let’s join. You’ll discover me on X or LinkedIn.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOutfit Your Team with Android Tablets for Just $75 Each
    Next Article Turn Your Emails into Trust-Building, Revenue-Driving Machines — Without Ever Touching The Spam Folder
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025
    Machine Learning

    Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

    July 1, 2025
    Machine Learning

    Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    A Specialized Lawyer for Foreign Marriages and Divorces in Jordan If you are a foreigner looking to legalize or prove your marriage in Jordan, or if you need to establish a divorce while protecting… – محامي شرعي، محامي شرعي الأردن، اربد، احمد الجراح،

    January 24, 2025

    Pydantic AI Explained: Simplifying LLM Workflows with Real-World Examples | by Advait Dharmadhikari | May, 2025

    May 27, 2025

    Natural Language Processing with Deep Learning | by Keerthanams | Apr, 2025

    April 20, 2025
    Our Picks

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025

    Futurwise: Unlock 25% Off Futurwise Today

    July 1, 2025

    3D Printer Breaks Kickstarter Record, Raises Over $46M

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.