Close Menu
    Trending
    • How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1
    • From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025
    • Using Graph Databases to Model Patient Journeys and Clinical Relationships
    • Cuba’s Energy Crisis: A Systemic Breakdown
    • AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000
    • STOP Building Useless ML Projects – What Actually Works
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Technology»Reinforcement Learning Uncovers Silent Data Errors
    Technology

    Reinforcement Learning Uncovers Silent Data Errors

    Team_AIBS NewsBy Team_AIBS NewsApril 26, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    For top-performance chips in huge data centers, math may be the enemy. Due to the sheer scale of calculations occurring in hyperscale data centers, working around the clock with hundreds of thousands of nodes and huge quantities of silicon, extraordinarily unusual errors seem. It’s merely statistics. These uncommon, “silent” information errors don’t present up throughout typical quality-control screenings—even when firms spend hours searching for them.

    This month on the IEEE International Reliability Physics Symposium in Monterey, Calif., Intel engineers described a method that uses reinforcement learning to uncover extra silent information errors quicker. The corporate is utilizing the machine learning technique to make sure the standard of its Xeon processors.

    When an error occurs in a knowledge middle, operators can both take a node down and change it, or use the flawed system for lower-stakes computing, says Manu Shamsa, {an electrical} engineer at Intel’s Chandler, Ariz., campus. However it could be significantly better if errors might be detected earlier on. Ideally they’d be caught earlier than a chip is included in a pc system, when it’s potential to make design or manufacturing corrections to stop errors recurring sooner or later.

    “In a laptop computer, you received’t discover any errors. In information facilities, with actually dense nodes, there are excessive probabilities the celebs will align and an error will happen.” —Manu Shamsa, Intel

    Discovering these flaws isn’t really easy. Shamsa says engineers have been so baffled by them they joked that they should be as a consequence of spooky motion at a distance, Einstein’s phrase for quantum entanglement. However there’s nothing spooky about them, and Shamsa has spent years characterizing them. In a paper offered on the identical convention final yr, his staff offers a complete catalog of the causes of those errors. Most are as a consequence of infinitesimal variations in manufacturing.

    Even when every of the billions of transistors on every chip is practical, they don’t seem to be fully similar to 1 one other. Delicate variations in how a given transistor responds to modifications in temperature, voltage, or frequency, as an illustration, can result in an error.

    These subtleties are more likely to crop up in enormous information facilities due to the tempo of computing and the huge quantity of silicon concerned. “In a laptop computer, you received’t discover any errors. In information facilities, with actually dense nodes, there are excessive probabilities the celebs will align and an error will happen,” Shamsa says.

    Some errors might crop up solely after a chip has been put in in a knowledge middle and has been working for months. Small variations within the properties of transistors could cause them to degrade over time. One such silent error Shamsa has discovered is said to electrical resistance. A transistor that operates correctly at first, and passes commonplace exams to search for shorts, can, with use, degrade in order that it turns into extra resistant.

    “You’re pondering every little thing is ok, however beneath, an error is inflicting a fallacious determination,” Shamsa says. Over time, due to a slight weak point in a single transistor, “one plus one goes to a few, silently, till you see the affect,” Shamsa says.

    The brand new approach builds on an present set of strategies for detecting silent errors, referred to as Eigen tests. These exams make the chip do arduous math issues, repeatedly over a time period, within the hopes of creating silent errors obvious. They contain operations on totally different sizes of matrices full of random information.

    There are a lot of Eigen exams. Working all of them would take an impractical period of time, so chipmakers use a randomized method to generate a manageable set of them. This protects time however leaves errors undetected. “There’s no precept to information the number of inputs,” Shamsa says. He needed to discover a technique to information the choice so {that a} comparatively small variety of exams might flip up extra errors.

    The Intel staff used reinforcement learning to develop exams for the a part of its Xeon CPU chip that performs matrix multiplication utilizing what are referred to as fuse-multiply-add (FMA) directions. Shamsa says they selected the FMA area as a result of it takes up a comparatively massive space of the chip, making it extra weak to potential silent errors—extra silicon, extra issues. What’s extra, flaws on this a part of a chip can generate electromagnetic fields that have an effect on different elements of the system. And since the FMA is turned off to save lots of energy when it’s not in use, testing it includes repeatedly powering it up and down, probably activating hidden defects that in any other case wouldn’t seem in commonplace exams.

    Throughout every step of its coaching, the reinforcement-learning program selects totally different exams for the possibly faulty chip. Every error it detects is handled as a reward, and over time the agent learns to pick which exams maximize the possibilities of detecting errors. After about 500 testing cycles, the algorithm realized which set of Eigen exams optimized the error-detection charge for the FMA area.

    Shamsa says this system is 5 instances as more likely to detect a defect as randomized Eigen testing. Eigen exams are open source, a part of the openDCDiag for information facilities. So different customers ought to be capable of use reinforcement studying to switch these exams for their very own programs, he says.

    To a sure extent, silent, refined flaws are an unavoidable a part of the manufacturing course of—absolute perfection and uniformity stay out of attain. However Shamsa says Intel is making an attempt to make use of this analysis to study to seek out the precursors that result in silent information errors quicker. He’s investigating whether or not there are purple flags that might present an early warning of future errors, and whether or not it’s potential to alter chip recipes or designs to handle them.

    From Your Web site Articles

    Associated Articles Across the Net



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article7 AI Tools That Help You Build a One-Person Business — and Make Money While You Sleep
    Next Article Superchanging LLMs: How IBM’s “Activated” Adapters are Speeding Up AI | by ai.tech.quan | Apr, 2025
    Team_AIBS News
    • Website

    Related Posts

    Technology

    Cuba’s Energy Crisis: A Systemic Breakdown

    July 1, 2025
    Technology

    Musk’s X appoints ‘king of virality’ in bid to boost growth

    July 1, 2025
    Technology

    Millions of websites to get ‘game-changing’ AI bot blocker

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    How to Win New Clients — Without Any Sales Experience

    April 22, 2025

    14 CEOs give their best advice for leading in times of great uncertainty

    March 31, 2025

    Navigating Marketing with AI and Content Strategy | by Artificial Intelligence + | Jan, 2025

    January 14, 2025
    Our Picks

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025

    From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

    July 1, 2025

    Using Graph Databases to Model Patient Journeys and Clinical Relationships

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.