Close Menu
    Trending
    • Revisiting Benchmarking of Tabular Reinforcement Learning Methods
    • Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Training with Ten Thousand GPUs: Distributed Machine Learning(learning notes) | by Chiao-Min Chang | Jun, 2025
    Machine Learning

    Training with Ten Thousand GPUs: Distributed Machine Learning(learning notes) | by Chiao-Min Chang | Jun, 2025

    Team_AIBS NewsBy Team_AIBS NewsJune 10, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The target perform calculates the gradient for every parameter within the mannequin and updates the mannequin’s parameters.

    Gradient: An estimation of the slope course and magnitude of the target perform, used to attenuate the loss perform.

    L(w) = (w — 3)²
    w = 5
    dL/dw = 2(5–3) = 4
    W_new = W_old — η * gradient
    η = 0.1, W_new = 4.6
    d = Spinoff → how L modifications with w
    Spinoff: charge of change

    MapReduce comparability: A distributed computing mannequin proposed by Google (for giant knowledge distribution like Hadoop/Spark), able to horizontal scaling and never restricted by the efficiency of a single node.

    Instance: Counting the frequency of every phrase in a doc set
    Map: Break up enter into key-value pairs for distributed processing → (phrase, 1)
    Cut back: Mixture and summarize processed key-value pairs → summing an identical phrases

    • Node: A bodily machine or GPU unit. E.g., 4 GPUs = 4 nodes, every with its personal reminiscence and processor
    • An entire compute unit (may embody CPU, RAM, disk, community)
    • Single machine: 1 node
    • Multi-GPU: 1 node + a number of accelerators (GPU, TPU; e.g., 1 server with 4x A100 → course of a number of batches or break up mannequin)
    • Cloud cluster: A number of nodes + distributed communication

    1. Compute & Communication Intensive:
    Constructed on linear algebra (multi-matrix multiplication)
    Ahead go: y = Wx + b → PyTorch / TensorFlow makes use of underlying linear algebra libraries (cuBLAS)

    NVIDIA GPU-accelerated BLAS (Primary Linear Algebra Subprograms), for matrix multiplication, and so forth.
    CUDA = Compute Unified System Structure

    2. Iterative: Particular knowledge entry patterns — every iteration trains on a small batch (random or sequential) repeatedly

    Every coaching iteration: e.g., fraud detection utilizing transaction knowledge.
    Practice on 512 samples per iteration = mini-batch iteration.
    512 is perfect — too massive might trigger GPU reminiscence overflow (OOM: Out of Reminiscence)
    Every epoch: shuffle all knowledge, choose totally different batch once more → displays international knowledge distribution. Use sliding window or retrain periodically.
    Pipeline instance: characteristic extraction (entrance), threat evaluation (again), e.g., Kafka + ML mannequin

    3. Fault Tolerance:
    Probabilistic mannequin + a number of iterations. A small step error will likely be averaged out by the general optimization course of. For instance, optimizers comparable to SGD/Adam permit noise, gradient or parameter asynchrony (packet loss and community instability resulting in knowledge loss).

    In contrast to non-logical reasoning, as soon as an error happens, will probably be a mistake all the way in which. But when the error is massive, convergence will likely be gradual and efficiency will lower. Whether it is too frequent, your complete coaching will fail.

    • Sluggish convergence: Many iterations, however loss doesn’t lower (e.g., 10,000 iterations nonetheless with excessive loss)
    • Efficiency degradation: Accuracy, Recall < 0.5; AUC < 0.7–0.8; loss plateaued → mannequin fails to be taught logic
    • Switch errors: Excessive latency, packet loss, reminiscence overflow (particularly Ethernet vs. InfiniBand)
    • Ethernet: gradual (1–10ms), vulnerable to congestion and packet loss
    • InfiniBand: for HPC, low latency (~1μs) μs=0.000001 second, helps RDMA
    • RDMA = Distant Direct Reminiscence Entry → servers entry one another’s reminiscence instantly with out CPU → steady, utilized in distributed coaching

    4. Hundreds of parameter updates wanted for convergence; non-determinism:
    Mannequin construction, batch dimension have an effect on habits.
    In large-scale knowledge (billions), deep fashions (Transformer-Language Mannequin), gradient oscillation or uneven knowledge distribution

    • Instance loss: 1.2 → 1.1 → 1.3 → 1.0 → 1.4 → no steady lower
    • Tough to course of steady on-line knowledge directly (preprocessing helps: class balancing, normalization)
    • Function house might shift throughout coaching (e.g., evolving fraud techniques)
    • Options: sliding window retraining, weekly retraining with newest knowledge, or incremental/on-line studying (e.g., fashions that constantly adapt)

    5. Sparse replace in some subsets:
    Sparse knowledge construction → e.g., Lasso solely updates few parameters every time

    • Sparse replace: replace just a few parameters
    • Lasso (L1-regularized regression): many parameters shrink to 0 → mannequin turns into compact, interpretable
    • Instance: 10,000 options → solely 10 non-zero → relaxation zero

    6. Community Bottlenecks: Frequent parameter updates devour excessive bandwidth. Sooner GPU = extra visitors

    • Mannequin construction: CNN (stacked convolution), Transformer (multi-layer self-attention), every structure will have an effect on the reminiscence and communication design
    • Deep structure (12+ layers) → preliminary epochs produce unstable gradients, stabilize over time
    • L1: makes weights zero (sparse), L2: reduces weight magnitude however to not zero (prevents explosion)
    • If the educational charge is just too massive or there is no such thing as a normalization, some parameters will develop into extraordinarily massive, inflicting the mannequin to be unstable or NaN.
    • NaN = not-a-number (0/0, log(0), overflow). Could be bounded, however higher to repair mannequin design
    • Regression = predict steady values (e.g., value, temperature)
    • Regression: Predicting steady values ​​(value, temperature): Linear Regression, Ridge, Lasso
    • Actual-time knowledge storage: Kafka, S3, Knowledge Lake
    • Stream processing: Spark Streaming → real-time prediction
    • Matrix multiplication: X * W = Z
    • X: 3×4 matrix, W: 4×2 → Z: 3×2
    • [age, amount, location, time] → output [fraud or not, risk level]
    • Inside dimensions should match (4), outer form = end result (3×2 → 6 components)



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleChina’s electric cars are cheaper, but is there a deeper cost?
    Next Article Trying to Stay Sane in the Age of AI
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025
    Machine Learning

    Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025

    July 2, 2025
    Machine Learning

    From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    How a 27-Year-Old’s ‘Crazy’ Side Hustle Hit $30,000 a Month

    February 14, 2025

    OpenAI’s new image generator aims to be practical enough for designers and advertisers

    March 25, 2025

    #1minPapers MSFT’s rStar-Math small language model self-improves and generates own training data | by Gwen Cheni | Jan, 2025

    January 12, 2025
    Our Picks

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    Qantas data breach to impact 6 million airline customers

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.