Training with Ten Thousand GPUs: Distributed Machine Learning(learning notes) | by Chiao-Min Chang

The target perform calculates the gradient for every parameter within the mannequin and updates the mannequin’s parameters.

Gradient: An estimation of the slope course and magnitude of the target perform, used to attenuate the loss perform.

L(w) = (w — 3)²
w = 5
dL/dw = 2(5–3) = 4
W_new = W_old — η * gradient
η = 0.1, W_new = 4.6
d = Spinoff → how L modifications with w
Spinoff: charge of change

MapReduce comparability: A distributed computing mannequin proposed by Google (for giant knowledge distribution like Hadoop/Spark), able to horizontal scaling and never restricted by the efficiency of a single node.

Instance: Counting the frequency of every phrase in a doc set
Map: Break up enter into key-value pairs for distributed processing → (phrase, 1)
Cut back: Mixture and summarize processed key-value pairs → summing an identical phrases

Node: A bodily machine or GPU unit. E.g., 4 GPUs = 4 nodes, every with its personal reminiscence and processor
An entire compute unit (may embody CPU, RAM, disk, community)
Single machine: 1 node
Multi-GPU: 1 node + a number of accelerators (GPU, TPU; e.g., 1 server with 4x A100 → course of a number of batches or break up mannequin)
Cloud cluster: A number of nodes + distributed communication

1. Compute & Communication Intensive:
Constructed on linear algebra (multi-matrix multiplication)
Ahead go: y = Wx + b → PyTorch / TensorFlow makes use of underlying linear algebra libraries (cuBLAS)

NVIDIA GPU-accelerated BLAS (Primary Linear Algebra Subprograms), for matrix multiplication, and so forth.
CUDA = Compute Unified System Structure

2. Iterative: Particular knowledge entry patterns — every iteration trains on a small batch (random or sequential) repeatedly

Every coaching iteration: e.g., fraud detection utilizing transaction knowledge.
Practice on 512 samples per iteration = mini-batch iteration.
512 is perfect — too massive might trigger GPU reminiscence overflow (OOM: Out of Reminiscence)
Every epoch: shuffle all knowledge, choose totally different batch once more → displays international knowledge distribution. Use sliding window or retrain periodically.
Pipeline instance: characteristic extraction (entrance), threat evaluation (again), e.g., Kafka + ML mannequin

3. Fault Tolerance:
Probabilistic mannequin + a number of iterations. A small step error will likely be averaged out by the general optimization course of. For instance, optimizers comparable to SGD/Adam permit noise, gradient or parameter asynchrony (packet loss and community instability resulting in knowledge loss).

In contrast to non-logical reasoning, as soon as an error happens, will probably be a mistake all the way in which. But when the error is massive, convergence will likely be gradual and efficiency will lower. Whether it is too frequent, your complete coaching will fail.

Sluggish convergence: Many iterations, however loss doesn’t lower (e.g., 10,000 iterations nonetheless with excessive loss)
Efficiency degradation: Accuracy, Recall < 0.5; AUC < 0.7–0.8; loss plateaued → mannequin fails to be taught logic
Switch errors: Excessive latency, packet loss, reminiscence overflow (particularly Ethernet vs. InfiniBand)
Ethernet: gradual (1–10ms), vulnerable to congestion and packet loss
InfiniBand: for HPC, low latency (~1μs) μs=0.000001 second, helps RDMA
RDMA = Distant Direct Reminiscence Entry → servers entry one another’s reminiscence instantly with out CPU → steady, utilized in distributed coaching

4. Hundreds of parameter updates wanted for convergence; non-determinism:
Mannequin construction, batch dimension have an effect on habits.
In large-scale knowledge (billions), deep fashions (Transformer-Language Mannequin), gradient oscillation or uneven knowledge distribution

Instance loss: 1.2 → 1.1 → 1.3 → 1.0 → 1.4 → no steady lower
Tough to course of steady on-line knowledge directly (preprocessing helps: class balancing, normalization)
Function house might shift throughout coaching (e.g., evolving fraud techniques)
Options: sliding window retraining, weekly retraining with newest knowledge, or incremental/on-line studying (e.g., fashions that constantly adapt)

5. Sparse replace in some subsets:
Sparse knowledge construction → e.g., Lasso solely updates few parameters every time

Sparse replace: replace just a few parameters
Lasso (L1-regularized regression): many parameters shrink to 0 → mannequin turns into compact, interpretable
Instance: 10,000 options → solely 10 non-zero → relaxation zero

6. Community Bottlenecks: Frequent parameter updates devour excessive bandwidth. Sooner GPU = extra visitors

Mannequin construction: CNN (stacked convolution), Transformer (multi-layer self-attention), every structure will have an effect on the reminiscence and communication design
Deep structure (12+ layers) → preliminary epochs produce unstable gradients, stabilize over time
L1: makes weights zero (sparse), L2: reduces weight magnitude however to not zero (prevents explosion)
If the educational charge is just too massive or there is no such thing as a normalization, some parameters will develop into extraordinarily massive, inflicting the mannequin to be unstable or NaN.
NaN = not-a-number (0/0, log(0), overflow). Could be bounded, however higher to repair mannequin design
Regression = predict steady values (e.g., value, temperature)
Regression: Predicting steady values (value, temperature): Linear Regression, Ridge, Lasso
Actual-time knowledge storage: Kafka, S3, Knowledge Lake
Stream processing: Spark Streaming → real-time prediction
Matrix multiplication: X * W = Z
X: 3×4 matrix, W: 4×2 → Z: 3×2
[age, amount, location, time] → output [fraud or not, risk level]
Inside dimensions should match (4), outer form = end result (3×2 → 6 components)

Source link

Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025

From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

Revisiting Benchmarking of Tabular Reinforcement Learning Methods

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

How a 27-Year-Old’s ‘Crazy’ Side Hustle Hit $30,000 a Month

OpenAI’s new image generator aims to be practical enough for designers and advertisers

#1minPapers MSFT’s rStar-Math small language model self-improves and generates own training data | by Gwen Cheni | Jan, 2025

Our Picks

Revisiting Benchmarking of Tabular Reinforcement Learning Methods

Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

Qantas data breach to impact 6 million airline customers

Training with Ten Thousand GPUs: Distributed Machine Learning(learning notes) | by Chiao-Min Chang | Jun, 2025

Related Posts