Using Model Flops Utilization (MFU) | by Jaideep Ray | Better ML

Within the subsequent part, let’s focus on a number of widespread questions round latest advances in LLM coaching and the way MFU computation is impacted.

Activation or gradient checkpointing is a reminiscence optimization approach generally used when coaching very giant fashions that will in any other case exceed the GPU reminiscence capability. It really works by selectively discarding the activations computed through the ahead go and recomputing them through the backward go.

Whereas activation checkpointing considerably reduces reminiscence footprint, it comes at a computational price. Since activations must be recomputed, the variety of ahead go operations will increase. This enhance in computation straight impacts the “Achieved Flops” in our MFU calculation.

Subsequently, when reporting or analyzing MFU for a coaching run that makes use of activation checkpointing, it’s essential to think about its impact individually for a number of causes:

Inflated Achieved FLOPs: With out accounting for recomputed activations, the measured “Achieved Flops” shall be larger than if these computations hadn’t been needed on account of reminiscence constraints. This will result in an artificially inflated MFU if the “Whole Required Flops” isn’t adjusted accordingly.
Commerce-off Evaluation: Activation checkpointing represents a trade-off between reminiscence utilization and computational effectivity. By contemplating its affect on MFU, we will higher perceive the price of saving reminiscence when it comes to elevated computation time.
Honest Comparisons: When evaluating the MFU of coaching runs with and with out activation checkpointing, or with completely different ranges of checkpointing, it’s important to acknowledge and doubtlessly normalize for the additional computations launched by this method to make sure a good comparability of the underlying {hardware} and software program effectivity.

Ideally, the “Whole Required Flops” within the MFU calculation ought to mirror the inherent computational price of coaching the mannequin with out reminiscence constraints. The rise in “Achieved Flops” on account of activation recomputation needs to be seen as an overhead incurred to allow coaching throughout the accessible reminiscence.

Computing MFU for Combination of Consultants (MoE) fashions introduces a further layer of complexity on account of their conditional computation nature. In an MoE layer, solely a subset of the “specialists” (sometimes smaller neural networks) are activated for every enter token, as decided by a “gating community.”

Subsequently, the “Whole Flops Required” for an MoE mannequin must account for:

The FLOPs of the gating community, that are computed for all enter tokens.
The FLOPs of the chosen specialists, that are computed just for the tokens routed to them. This will depend on the variety of specialists, their measurement, and the routing technique (e.g., top-k routing, the place every token is routed to the top-k specialists).

The calculation of “Achieved Flops” stays comparable — based mostly on the coaching time and the sustained FLOPS of the {hardware}. Nevertheless, when computing MFU for MoE fashions:

MFUMoE=Whole Flops Required (Contemplating Activated Consultants)Achieved Flops

The “Whole Flops Required” within the denominator ought to characterize the precise FLOPs carried out by the mannequin throughout a full coaching run, contemplating the sparsity launched by the MoE structure (i.e., solely the FLOPs of the gating community and the chosen specialists).

Key Issues for MoE MFU:

Dynamic Computation: The variety of FLOPs carried out per coaching step can range relying on the routing choices of the gating community. Subsequently, estimating the “Whole Flops Required” usually entails averaging over many coaching steps or making assumptions in regards to the routing chances.
Professional Imbalance: If the routing of tokens to specialists is uneven, some specialists may be utilized greater than others. This will affect the general effectivity and must be thought-about when analyzing MFU.
Communication Prices: MoE fashions usually contain important communication between completely different components of the mannequin and doubtlessly throughout units to route tokens and collect the outputs of the specialists. These communication prices contribute to overhead and cut back MFU.

Instruments like calflops and different profiling strategies must be tailored or prolonged to precisely account for the conditional computation in MoE fashions when estimating their FLOPs for MFU calculation.

FlashAttention is a latest innovation within the environment friendly computation of the eye mechanism in transformer fashions. The usual consideration mechanism, whereas essential for capturing long-range dependencies, has quadratic time and reminiscence complexity with respect to the sequence size. This turns into a serious bottleneck when coaching LLMs with lengthy enter sequences.

FlashAttention addresses these limitations by way of a number of key optimizations:

Tiling: It divides the enter sequences and a spotlight matrices into smaller blocks (tiles) and performs consideration computation inside these tiles. This enables for becoming the intermediate outcomes into the sooner on-chip SRAM (Static RAM) of the GPU.
Kernel Fusion: It fuses a number of operations within the consideration computation right into a single GPU kernel, decreasing kernel launch overheads and enhancing information locality.
Backward Go Optimization: FlashAttention additionally optimizes the backward go by recomputing the normalization statistics and a spotlight chances solely when wanted, additional saving reminiscence bandwidth and computation.

How FlashAttention Improves MFU:

Diminished Reminiscence Accesses: By protecting intermediate consideration ends in sooner SRAM, FlashAttention considerably reduces the variety of costly reads and writes to the slower high-bandwidth reminiscence (HBM). This alleviates reminiscence bandwidth bottlenecks, which are sometimes a limiting consider reaching excessive MFU, particularly for lengthy sequences.
Elevated Computational Throughput: The optimized kernels and decreased overheads in FlashAttention permit the GPU’s compute models to spend extra time performing precise FLOPs associated to the eye mechanism, resulting in a better “Achieved Flops” for a similar coaching time.
Enabling Longer Sequences: By making consideration computation extra memory-efficient, FlashAttention permits for coaching with longer enter sequences with out working out of reminiscence. Longer sequences usually result in extra informative gradients and doubtlessly sooner convergence, additional enhancing the general effectivity when it comes to MFU.

In essence, FlashAttention makes the eye computation itself extra hardware-friendly, permitting the GPU to function nearer to its peak efficiency when processing this essential a part of the transformer structure. This direct enchancment within the effectivity of a key computational block interprets to a better total Mannequin Flops Utilization for LLM coaching.

Source link

From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

Why PDF Extraction Still Feels LikeHack

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Social media time limits for children considered by government

Right-Wing Crusade Against USAID Has Been Fueled by Falsehoods

The Risks and Rewards of Trading Altcoins: Maximise Gains, Minimise Risks

Our Picks

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

Using Graph Databases to Model Patient Journeys and Clinical Relationships

Using Model Flops Utilization (MFU) | by Jaideep Ray | Better ML | May, 2025

Related Posts