Close Menu
    Trending
    • How generative AI could help make construction sites safer
    • PCA and SVD: The Dynamic Duo of Dimensionality Reduction | by Arushi Gupta | Jul, 2025
    • 5 Ways Artificial Intelligence Can Support SMB Growth at a Time of Economic Uncertainty in Industries
    • Microsoft Says Its AI Diagnoses Patients Better Than Doctors
    • From Reporting to Reasoning: How AI Is Rewriting the Rules of Data App Development
    • Can AI Replace Doctors? How Technology Is Shaping Healthcare – Healthcare Info
    • Singapore police can now seize bank accounts to stop scams
    • How One Founder Is Rethinking Supplements With David Beckham
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Using Model Flops Utilization (MFU) | by Jaideep Ray | Better ML | May, 2025
    Machine Learning

    Using Model Flops Utilization (MFU) | by Jaideep Ray | Better ML | May, 2025

    Team_AIBS NewsBy Team_AIBS NewsMay 11, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Within the subsequent part, let’s focus on a number of widespread questions round latest advances in LLM coaching and the way MFU computation is impacted.

    Activation or gradient checkpointing is a reminiscence optimization approach generally used when coaching very giant fashions that will in any other case exceed the GPU reminiscence capability. It really works by selectively discarding the activations computed through the ahead go and recomputing them through the backward go.

    Whereas activation checkpointing considerably reduces reminiscence footprint, it comes at a computational price. Since activations must be recomputed, the variety of ahead go operations will increase. This enhance in computation straight impacts the “Achieved Flops” in our MFU calculation.

    Subsequently, when reporting or analyzing MFU for a coaching run that makes use of activation checkpointing, it’s essential to think about its impact individually for a number of causes:

    • Inflated Achieved FLOPs: With out accounting for recomputed activations, the measured “Achieved Flops” shall be larger than if these computations hadn’t been needed on account of reminiscence constraints. This will result in an artificially inflated MFU if the “Whole Required Flops” isn’t adjusted accordingly.
    • Commerce-off Evaluation: Activation checkpointing represents a trade-off between reminiscence utilization and computational effectivity. By contemplating its affect on MFU, we will higher perceive the price of saving reminiscence when it comes to elevated computation time.
    • Honest Comparisons: When evaluating the MFU of coaching runs with and with out activation checkpointing, or with completely different ranges of checkpointing, it’s important to acknowledge and doubtlessly normalize for the additional computations launched by this method to make sure a good comparability of the underlying {hardware} and software program effectivity.

    Ideally, the “Whole Required Flops” within the MFU calculation ought to mirror the inherent computational price of coaching the mannequin with out reminiscence constraints. The rise in “Achieved Flops” on account of activation recomputation needs to be seen as an overhead incurred to allow coaching throughout the accessible reminiscence.

    Computing MFU for Combination of Consultants (MoE) fashions introduces a further layer of complexity on account of their conditional computation nature. In an MoE layer, solely a subset of the “specialists” (sometimes smaller neural networks) are activated for every enter token, as decided by a “gating community.”

    Subsequently, the “Whole Flops Required” for an MoE mannequin must account for:

    • The FLOPs of the gating community, that are computed for all enter tokens.
    • The FLOPs of the chosen specialists, that are computed just for the tokens routed to them. This will depend on the variety of specialists, their measurement, and the routing technique (e.g., top-k routing, the place every token is routed to the top-k specialists).

    The calculation of “Achieved Flops” stays comparable — based mostly on the coaching time and the sustained FLOPS of the {hardware}. Nevertheless, when computing MFU for MoE fashions:

    MFUMoE​=Whole Flops Required (Contemplating Activated Consultants)Achieved Flops​

    The “Whole Flops Required” within the denominator ought to characterize the precise FLOPs carried out by the mannequin throughout a full coaching run, contemplating the sparsity launched by the MoE structure (i.e., solely the FLOPs of the gating community and the chosen specialists).

    Key Issues for MoE MFU:

    • Dynamic Computation: The variety of FLOPs carried out per coaching step can range relying on the routing choices of the gating community. Subsequently, estimating the “Whole Flops Required” usually entails averaging over many coaching steps or making assumptions in regards to the routing chances.
    • Professional Imbalance: If the routing of tokens to specialists is uneven, some specialists may be utilized greater than others. This will affect the general effectivity and must be thought-about when analyzing MFU.
    • Communication Prices: MoE fashions usually contain important communication between completely different components of the mannequin and doubtlessly throughout units to route tokens and collect the outputs of the specialists. These communication prices contribute to overhead and cut back MFU.

    Instruments like calflops and different profiling strategies must be tailored or prolonged to precisely account for the conditional computation in MoE fashions when estimating their FLOPs for MFU calculation.

    FlashAttention is a latest innovation within the environment friendly computation of the eye mechanism in transformer fashions. The usual consideration mechanism, whereas essential for capturing long-range dependencies, has quadratic time and reminiscence complexity with respect to the sequence size. This turns into a serious bottleneck when coaching LLMs with lengthy enter sequences.

    FlashAttention addresses these limitations by way of a number of key optimizations:

    • Tiling: It divides the enter sequences and a spotlight matrices into smaller blocks (tiles) and performs consideration computation inside these tiles. This enables for becoming the intermediate outcomes into the sooner on-chip SRAM (Static RAM) of the GPU.
    • Kernel Fusion: It fuses a number of operations within the consideration computation right into a single GPU kernel, decreasing kernel launch overheads and enhancing information locality.
    • Backward Go Optimization: FlashAttention additionally optimizes the backward go by recomputing the normalization statistics and a spotlight chances solely when wanted, additional saving reminiscence bandwidth and computation.

    How FlashAttention Improves MFU:

    • Diminished Reminiscence Accesses: By protecting intermediate consideration ends in sooner SRAM, FlashAttention considerably reduces the variety of costly reads and writes to the slower high-bandwidth reminiscence (HBM). This alleviates reminiscence bandwidth bottlenecks, which are sometimes a limiting consider reaching excessive MFU, particularly for lengthy sequences.
    • Elevated Computational Throughput: The optimized kernels and decreased overheads in FlashAttention permit the GPU’s compute models to spend extra time performing precise FLOPs associated to the eye mechanism, resulting in a better “Achieved Flops” for a similar coaching time.
    • Enabling Longer Sequences: By making consideration computation extra memory-efficient, FlashAttention permits for coaching with longer enter sequences with out working out of reminiscence. Longer sequences usually result in extra informative gradients and doubtlessly sooner convergence, additional enhancing the general effectivity when it comes to MFU.

    In essence, FlashAttention makes the eye computation itself extra hardware-friendly, permitting the GPU to function nearer to its peak efficiency when processing this essential a part of the transformer structure. This direct enchancment within the effectivity of a key computational block interprets to a better total Mannequin Flops Utilization for LLM coaching.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article4 Reminders Every Mompreneur Needs This Mother’s Day
    Next Article A manager’s guide to helping grieving employees
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    PCA and SVD: The Dynamic Duo of Dimensionality Reduction | by Arushi Gupta | Jul, 2025

    July 2, 2025
    Machine Learning

    Can AI Replace Doctors? How Technology Is Shaping Healthcare – Healthcare Info

    July 2, 2025
    Machine Learning

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How generative AI could help make construction sites safer

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Report on Prompt Injection Attacks: 2025 Implications for AI Cybersecurity | by NeoRusI | Apr, 2025

    April 25, 2025

    Starting with NLP. My first encounter with programming was… | by Brcnacar | Jun, 2025

    June 27, 2025

    Machine Learning + openAI: solving a text classification problem | by Ricardo Ribas

    January 11, 2025
    Our Picks

    How generative AI could help make construction sites safer

    July 2, 2025

    PCA and SVD: The Dynamic Duo of Dimensionality Reduction | by Arushi Gupta | Jul, 2025

    July 2, 2025

    5 Ways Artificial Intelligence Can Support SMB Growth at a Time of Economic Uncertainty in Industries

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.