Model Compression: Make Your Machine Learning Models Lighter and Faster

Whether or not you’re making ready for interviews or constructing Machine Studying programs at your job, mannequin compression has develop into vital ability. Within the period of LLMs, the place fashions are getting bigger and bigger, the challenges round compressing these fashions to make them extra environment friendly, smaller, and usable on light-weight machines have by no means been extra related.

On this article, I’ll undergo 4 elementary compression methods that each ML practitioner ought to perceive and grasp. I discover pruning, quantization, low-rank factorization, and Knowledge Distillation, every providing distinctive benefits. I may also add some minimal PyTorch code samples for every of those strategies.

I hope you benefit from the article!

Mannequin pruning

Pruning might be probably the most intuitive compression method. The thought could be very easy: take away a number of the weights of the community, both randomly or take away the “much less essential” ones. After all, after we discuss “eradicating” weights within the context of neural networks, it means setting the weights to zero.

Mannequin pruning (Picture by the creator and ChatGPT | Inspiration: [3])

Structured vs unstructured pruning

Let’s begin with a easy heuristic: eradicating weights smaller than a threshold.

[ w’_{ij} = begin{cases} w_{ij} & text{if } |w_{ij}| ge theta_0
0 & text{if } |w_{ij}| < theta_0
end{cases} ]

After all, this isn’t excellent as a result of we would want to discover a solution to discover the proper threshold for our downside! A extra sensible method is to take away a specified proportion of weights with the smallest magnitudes (norm) inside one layer. There are 2 frequent methods of implementing pruning in a single layer:

Structured pruning: take away whole parts of the community (e.g. a random row from the load tensor, or a random channel in a convulational layer)
Unstructured pruning: take away particular person weights no matter their positions and of the construction of the tensor

We are able to additionally use international pruning with both of the 2 above strategies. It will take away the chosen proportion of weights throughout a number of layers, and probably have totally different elimination charges relying on the variety of parameters in every layer.

PyTorch makes this gorgeous simple (by the way in which, you could find all code snippets in my GitHub repo).

import torch.nn.utils.prune as prune

# 1. Random unstructured pruning (20% of weights at random)
prune.random_unstructured(mannequin.layer, identify="weight", quantity=0.2)                           

# 2. L1‑norm unstructured pruning (20% of smallest weights)
prune.l1_unstructured(mannequin.layer, identify="weight", quantity=0.2)

# 3. World unstructured pruning (40% of all weights by L1 norm throughout layers)
prune.global_unstructured(
    [(model.layer1, "weight"), (model.layer2, "weight")],
    pruning_method=prune.L1Unstructured,
    quantity=0.4
)                                             

# 4. Structured pruning (take away 30% of rows with lowest L2 norm)
prune.ln_structured(mannequin.layer, identify="weight", quantity=0.3, n=2, dim=0)

Observe: if in case you have taken statistics courses, you most likely discovered regularization-induced strategies that additionally implicitly prune some weights throughout coaching, through the use of L0 or L1 norm regularization. Pruning differs from that as a result of it’s utilized as a post-Model Compression method

Why does pruning work? The Lottery Ticket Speculation

I wish to conclude that part with a fast point out of the Lottery Ticket Speculation, which is each an utility of pruning and an fascinating clarification of how eradicating weights can typically enhance a mannequin. I like to recommend studying the related paper ([7]) for extra particulars.

Authors use the next process:

Prepare the complete mannequin to convergence
Prune the smallest-magnitude weights (say 10%)
Reset the remaining weights to their unique initialization values
Retrain this pruned community
Repeat the method a number of occasions

After doing this 30 occasions, you find yourself with solely 0.9³⁰ ~ 4% of the unique parameters. And surprisingly, this community can do in addition to the unique one.

This means that there’s essential parameter redundancy. In different phrases, there exists a sub-network (“a lottery ticket”) that truly does many of the work!

Pruning is one solution to unveil this sub-network.

I like to recommend this superb video that covers the subject!

Quantization

Whereas pruning focuses on eradicating parameters totally, Quantization takes a distinct method: decreasing the precision of every parameter.

Do not forget that each quantity in a pc is saved as a sequence of bits. A float32 worth makes use of 32 bits (see instance image under), whereas an 8-bit integer (int8) makes use of simply 8 bits.

An instance of how float32 numbers are represented with 32 bits (Picture by the creator and ChatGPT | Inspiration: [2])

Most deep studying fashions are educated utilizing 32-bit floating-point numbers (FP32). Quantization converts these high-precision values to lower-precision codecs like 16-bit floating-point (FP16), 8-bit integers (INT8), and even 4-bit representations.

The financial savings listed below are apparent: INT8 requires 75% much less reminiscence than FP32. However how can we truly carry out this conversion with out destroying our mannequin’s efficiency?

The mathematics behind quantization

To transform from floating-point to integer illustration, we have to map the continual vary of values to a discrete set of integers. For INT8 quantization, we’re mapping to 256 potential values (from -128 to 127).

Suppose our weights are normalized between -1.0 and 1.0 (frequent in deep studying):

[ text{scale} = frac{text{float_max} – text{float_min}}{text{int8_max} – text{int8_min}} = frac{1.0 – (-1.0)}{127 – (-128)} = frac{2.0}{255} ]

Then, the quantized worth is given by

[text{quantized_value} = text{round}(frac{text{original_value}}{text{scale}} ] + textual content{zero_point})

Right here, zero_point=0 as a result of we wish 0 to be mapped to 0. We are able to then spherical this worth to the closest integer to get integers between -127 and 128.

And, you guessed it: to get integers again to drift, we are able to use the inverse operation: [text{float_value} = text{integer_value} times text{scale} – text{zero_point} ]

Observe: in follow, the scaling issue is decided primarily based on the vary values we quantize.

Tips on how to apply quantization?

Quantization may be utilized at totally different levels and with totally different methods. Listed here are a number of methods price realizing about: (under, the phrase “activation” refers back to the output values of every layer)

Submit-training quantization (PTQ):
- Static Quantization: quantize each weights and activations offline (after coaching and earlier than inference)
- Dynamic Quantization: quantize weights offline, however activations on-the-fly throughout inference. That is totally different from offline quantization as a result of the scaling issue is decided primarily based on the values seen thus far throughout inference.
Quantize-aware coaching (QAT): simulate quantization throughout coaching by rounding values, however calculations are nonetheless performed with floating-point numbers. This makes the mannequin be taught weights which can be extra sturdy to quantization, which can be utilized after coaching. Beneath the hood, the concept is to add “faux” operations: x -> dequantize(quantize(x)): this new worth is near x, however it nonetheless helps the mannequin tolerate the 8-bit rounding and clipping noise.

import torch.quantization as tq

# 1. Submit‑coaching static quantization (weights + activations offline)
mannequin.eval()
mannequin.qconfig = tq.get_default_qconfig('fbgemm') # assign a static quantization config
tq.put together(mannequin, inplace=True)
# we have to use a calibration dataset to find out the ranges of values
with torch.no_grad():
    for knowledge, _ in calibration_data:
        mannequin(knowledge)
tq.convert(mannequin, inplace=True) # convert to a totally int8 mannequin

# 2. Submit‑coaching dynamic quantization (weights offline, activations on‑the‑fly)
dynamic_model = tq.quantize_dynamic(
    mannequin,
    {torch.nn.Linear, torch.nn.LSTM}, # layers to quantize
    dtype=torch.qint8
)

# 3. Quantization‑Conscious Coaching (QAT)
mannequin.prepare()
mannequin.qconfig = tq.get_default_qat_qconfig('fbgemm')  # arrange QAT config
tq.prepare_qat(mannequin, inplace=True) # insert faux‑quant modules
# [here, train or fine‑tune the model as usual]
qat_model = tq.convert(mannequin.eval(), inplace=False) # convert to actual int8 after QAT

Quantization could be very versatile! You may apply totally different precision ranges to totally different components of the mannequin. As an illustration, you may quantize most linear layers to 8-bit for optimum velocity and reminiscence financial savings, whereas leaving vital parts (e.g. consideration heads, or batch-norm layers) at 16-bit or full-precision.

Low-Rank Factorization

Now let’s discuss low-rank factorization — a technique that has been popularized with the rise of LLMs.

The important thing statement: many weight matrices in neural networks have efficient ranks a lot decrease than their dimensions recommend. In plain English, which means there may be a number of redundancy within the parameters.

Observe: if in case you have ever used PCA for dimensionality discount, you have got already encountered a type of low-rank approximation. PCA decomposes massive matrices into merchandise of smaller, lower-rank components that retain as a lot info as potential.

The linear algebra behind low-rank factorization

Take a weight matrix W. Each actual matrix may be represented utilizing a Singular Worth Decomposition (SVD):

[ W = USigma V^T ]

the place Σ is a diagonal matrix with singular values in non-increasing order. The variety of constructive coefficients truly corresponds to the rank of the matrix W.

SVD visualized for a matrix of rank r (Picture by the creator and ChatGPT | Inspiration: [5])

To approximate W with a matrix of rank okay < r, we are able to choose the okay best components of sigma, and the corresponding first okay columns and first okay rows of U and V respectively:

[ begin{aligned} W_k &= U_k,Sigma_k,V_k^T
[6pt] &= underbrace{U_k,Sigma_k^{1/2}}_{Ainmathbb{R}^{mtimes okay}} underbrace{Sigma_k^{1/2},V_k^T}_{Binmathbb{R}^{ktimes n}}. finish{aligned} ]

See how the brand new matrix may be decomposed because the product of A and B, with the overall variety of parameters now being m * okay + okay * n = okay*(m+n) as an alternative of m*n! It is a enormous enchancment, particularly when okay is far smaller than m and n.

In follow, it’s equal to changing a linear layer x → Wx with 2 consecutive ones: x → A(Bx).

In PyTorch

We are able to both apply low-rank factorization earlier than coaching (parameterizing every linear layer as two smaller matrices – not likely a compression technique, however a design selection) or after coaching (making use of a truncated SVD on weight matrices). The second method is by far the most typical one and is carried out under.

import torch

# 1. Extract weight and select rank
W = mannequin.layer.weight.knowledge # (m, n)
okay = 64 # desired rank

# 2. Approximate low-rank SVD
U, S, V = torch.svd_lowrank(W, q=okay) # U: (m, okay), S: (okay, okay), V: (n, okay)

# 3. Kind components A and B
A = U * S.sqrt() # [m, k]
B = V.t() * S.sqrt().unsqueeze(1) # [k, n]

# 4. Exchange with two linear layers and insert the matrices A and B
orig = mannequin.layer
mannequin.layer = torch.nn.Sequential(
    torch.nn.Linear(orig.in_features, okay, bias=False),
    torch.nn.Linear(okay, orig.out_features, bias=False),
)
mannequin.layer[0].weight.knowledge.copy_(B)
mannequin.layer[1].weight.knowledge.copy_(A)

LoRA: an utility of low-rank approximation

LoRA fine-tuning: W is fastened, A and B are educated (supply: [1])

I feel it’s essential to say LoRA: you have got most likely heard of LoRA (Low-Rank Adaptation) if in case you have been following LLM fine-tuning developments. Although not strictly a compression method, LoRA has develop into extraordinarily fashionable for effectively adapting massive language fashions and making fine-tuning very environment friendly.

The thought is easy: throughout fine-tuning, slightly than modifying the unique mannequin weights W, LoRA freezes them and be taught trainable low-rank updates:

$$W’ = W + Delta W = W + AB$$

the place A and B are low-rank matrices. This permits for task-specific adaptation with only a fraction of the parameters.

Even higher: QLoRA takes this additional by combining quantization with low-rank adaptation!

Once more, this can be a very versatile method and may be utilized at numerous levels. Often, LoRA is utilized solely on particular layers (for instance, Consideration layers’ weights).

Data Distillation

Data distillation takes a basically totally different method from what we’ve got seen thus far. As an alternative of modifying an present mannequin’s parameters, it transfers the “information” from a massive, complicated mannequin (the “trainer”) to a smaller, extra environment friendly mannequin (the “scholar”). The aim is to coach the scholar mannequin to mimic the habits and replicate the efficiency of the trainer, typically a neater activity than fixing the unique downside from scratch.

The distillation loss

Let’s clarify some ideas within the case of a classification downside:

The trainer mannequin is normally a big, complicated mannequin that achieves excessive efficiency on the duty at hand
The scholar mannequin is a second, smaller mannequin with a distinct structure, however tailor-made to the identical activity
Mushy targets: these are the trainer’s mannequin predictions (possibilities, and never labels!). They are going to be utilized by the scholar mannequin to imitate the trainer’s behaviors. Observe that we use uncooked predictions and never labels as a result of additionally they include details about the arrogance of the predictions
Temperature: along with the trainer’s prediction, we additionally use a coefficient T (known as temperature) within the softmax operate to extract extra info from the gentle targets. Growing T softens the distribution and helps the scholar mannequin give extra significance to mistaken predictions.

In follow, it’s fairly simple to coach the scholar mannequin. We mix the standard loss (customary cross-entropy loss primarily based on exhausting labels) with the “distillation” loss (primarily based on the trainer’s gentle targets):

$$ L_{textual content{complete}} = alpha L_{textual content{exhausting}} + (1 – alpha) L_{textual content{distill}} $$

The distillation loss is nothing however the KL divergence between the trainer and scholar distribution (you possibly can see it as a measure of the space between the two distributions).

$$ L_{textual content{distill}} = D{KL}(q_{textual content{trainer}} | | q_{textual content{scholar}}) = sum_i q_{textual content{trainer}, i} log left( frac{q_{textual content{trainer}, i}}{q_{textual content{scholar}, i}} proper) $$

As for the opposite strategies, it’s potential and inspired to adapt this framework relying on the use case: for instance, one may also evaluate logits and activations from intermediate layers within the community between the scholar and trainer mannequin, as an alternative of solely evaluating the ultimate outputs.

Data distillation in follow

Just like the earlier methods, there are two choices:

Offline distillation: the pre-trained trainer mannequin is fastened, and a separate scholar mannequin is educated to imitate it. Each fashions are utterly separate, and the trainer’s weights stay frozen throughout the distillation course of.
On-line distillation: each fashions are educated concurrently, with information switch taking place throughout the joint coaching course of.

And under, a simple solution to apply offline distillation (the final code block of this text 🙂):

import torch.nn.practical as F

def distillation_loss_fn(student_logits, teacher_logits, labels, temperature=2.0, alpha=0.5):
    # Commonplace Cross-Entropy loss with exhausting labels
    student_loss = F.cross_entropy(student_logits, labels)

    # Distillation loss with gentle targets (KL Divergence)
    soft_teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)

		# kl_div expects log possibilities as enter for the primary argument!
    distill_loss = F.kl_div(
        soft_student_log_probs,
        soft_teacher_probs.detach(), # do not calculate gradients for trainer
        discount='batchmean'
    ) * (temperature ** 2) # non-compulsory, a scaling issue

    # Mix losses in response to system
    total_loss = alpha * student_loss + (1 - alpha) * distill_loss
    return total_loss

teacher_model.eval()
student_model.prepare()
with torch.no_grad():
     teacher_logits = teacher_model(inputs)
	 student_logits = student_model(inputs)
	 loss = distillation_loss_fn(student_logits, teacher_logits, labels, temperature=T, alpha=alpha)
	 loss.backward()
	 optimizer.step()

Conclusion

Thanks for studying this text! Within the period of LLMs, with billions and even trillions of parameters, mannequin compression has develop into a elementary idea, important in nearly each situation to make fashions extra environment friendly and simply deployable.

However as we’ve got seen, mannequin compression isn’t nearly decreasing the mannequin dimension – it’s about making considerate design choices. Whether or not selecting between on-line and offline strategies, compressing the whole community, or concentrating on particular layers or channels, every selection considerably impacts efficiency and usefulness. Most fashions now mix a number of of those methods (take a look at this model, as an example).

Past introducing you to the principle strategies, I hope this text additionally evokes you to experiment and develop your individual artistic options!

Don’t overlook to take a look at the GitHub repository, the place you’ll discover all of the code snippets and a side-by-side comparability of the 4 compression strategies mentioned on this article.

Try my earlier articles:

References

[1] Hu, E., et al. (2021). Low-rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685.
[2] Lightning AI. Accelerating Large Language Models with Mixed Precision Techniques. Lightning AI Weblog.
[3] TensorFlow Weblog. Pruning API in TensorFlow Model Optimization Toolkit. TensorFlow Weblog, Might 2019.
[4] Towards AI. A Gentle Introduction to Knowledge Distillation. In the direction of AI, Aug 2022.
[5] Ju, A. ML Algorithm: Singular Value Decomposition (SVD). LinkedIn Pulse.
[6] Algorithmic Simplicity. THIS is why large language models can understand the world. YouTube, Apr 2023.
[7] Frankle, J., & Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. arXiv preprint arXiv:1803.03635.

Source link

How to Perform Comprehensive Large Scale LLM Validation

What If I Had AI in 2020: Rent The Runway Dynamic Pricing Model

BofA’s Quiet AI Revolution—$13 Billion Tech Plan Aims to Make Banking Smarter, Not Flashier

PwC Reducing Entry-Level Hiring, Changing Processes

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Predicting 30-Day Hospital Readmissions with a CNN-LSTM Model: How Deep Learning Can Help Improve Patient Outcomes | by Hadassah Galapo | Jul, 2025

CIOs to Control 50% of Fortune 100 Budgets by 2030

A couple lines of code to apply 40 ML models | by ZHEMING XU | Top Python Libraries | Jun, 2025

Our Picks