Close Menu
    Trending
    • Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    • How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»AI Technology»How to build a better AI benchmark
    AI Technology

    How to build a better AI benchmark

    Team_AIBS NewsBy Team_AIBS NewsMay 8, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The bounds of conventional testing

    If AI corporations have been gradual to answer the rising failure of benchmarks, it’s partially as a result of the test-scoring method has been so efficient for therefore lengthy. 

    One of many largest early successes of latest AI was the ImageNet problem, a form of antecedent to up to date benchmarks. Launched in 2010 as an open problem to researchers, the database held greater than 3 million pictures for AI programs to categorize into 1,000 completely different courses.

    Crucially, the check was fully agnostic to strategies, and any profitable algorithm rapidly gained credibility no matter the way it labored. When an algorithm known as AlexNet broke via in 2012, with a then unconventional type of GPU coaching, it grew to become one of many foundational outcomes of recent AI. Few would have guessed upfront that AlexNet’s convolutional neural nets could be the key to unlocking picture recognition—however after it scored properly, nobody dared dispute it. (One in all AlexNet’s builders, Ilya Sutskever, would go on to cofound OpenAI.)

    A big a part of what made this problem so efficient was that there was little sensible distinction between ImageNet’s object classification problem and the precise means of asking a pc to acknowledge a picture. Even when there have been disputes about strategies, nobody doubted that the highest-scoring mannequin would have a bonus when deployed in an precise picture recognition system.

    However within the 12 years since, AI researchers have utilized that very same method-agnostic method to more and more normal duties. SWE-Bench is often used as a proxy for broader coding capability, whereas different exam-style benchmarks usually stand in for reasoning capability. That broad scope makes it troublesome to be rigorous about what a selected benchmark measures—which, in flip, makes it arduous to make use of the findings responsibly. 

    The place issues break down

    Anka Reuel, a PhD scholar who has been specializing in the benchmark downside as a part of her analysis at Stanford, has change into satisfied the analysis downside is the results of this push towards generality. “We’ve moved from task-specific fashions to general-purpose fashions,” Reuel says. “It’s not a few single activity anymore however a complete bunch of duties, so analysis turns into more durable.”

    Just like the College of Michigan’s Jacobs, Reuel thinks “the principle difficulty with benchmarks is validity, much more than the sensible implementation,” noting: “That’s the place a variety of issues break down.” For a activity as difficult as coding, as an illustration, it’s practically inconceivable to include each attainable situation into your downside set. Because of this, it’s arduous to gauge whether or not a mannequin is scoring higher as a result of it’s extra expert at coding or as a result of it has extra successfully manipulated the issue set. And with a lot stress on builders to attain file scores, shortcuts are arduous to withstand.

    For builders, the hope is that success on a lot of particular benchmarks will add as much as a typically succesful mannequin. However the strategies of agentic AI imply a single AI system can embody a posh array of various fashions, making it arduous to judge whether or not enchancment on a selected activity will result in generalization. “There’s simply many extra knobs you’ll be able to flip,” says Sayash Kapoor, a pc scientist at Princeton and a distinguished critic of sloppy practices within the AI business. “In terms of brokers, they’ve form of given up on the perfect practices for analysis.”



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleMy Journey with Google Cloud’s “Build Real World AI Applications with Gemini and Imagen” | by Mitpatel | May, 2025
    Next Article Uh-Uh, Not Guilty | Towards Data Science
    Team_AIBS News
    • Website

    Related Posts

    AI Technology

    What comes next for AI copyright lawsuits?

    July 1, 2025
    AI Technology

    Cloudflare will now block AI bots from crawling its clients’ websites by default

    July 1, 2025
    AI Technology

    People are using AI to ‘sit’ with them while they trip on psychedelics

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    My Journey with Google Cloud’s “Build Real World AI Applications with Gemini and Imagen” | by Mitpatel | May, 2025

    May 8, 2025

    Exploring New Hyperparameter Dimensions with Laplace Approximated Bayesian Optimization | by Arnaud Capitaine | Jan, 2025

    January 11, 2025

    Mira Murati, OpenAI’s Former Chief Technology Officer, Starts Her Own Company

    February 18, 2025
    Our Picks

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    Qantas data breach to impact 6 million airline customers

    July 2, 2025

    He Went From $471K in Debt to Teaching Others How to Succeed

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.