Close Menu
    Trending
    • Futurwise: Unlock 25% Off Futurwise Today
    • 3D Printer Breaks Kickstarter Record, Raises Over $46M
    • People are using AI to ‘sit’ with them while they trip on psychedelics
    • Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025
    • How This Man Grew His Beverage Side Hustle From $1k a Month to 7 Figures
    • Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025
    • How Smart Entrepreneurs Turn Mid-Year Tax Reviews Into Long-Term Financial Wins
    • Become a Better Data Scientist with These Prompt Engineering Tips and Tricks
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»We Beat the Top CMU AI Agent Benchmark Without Changing the Model | by wesheets | Jun, 2025
    Machine Learning

    We Beat the Top CMU AI Agent Benchmark Without Changing the Model | by wesheets | Jun, 2025

    Team_AIBS NewsBy Team_AIBS NewsJune 5, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Within the quickly evolving panorama of autonomous AI brokers, a basic shift is going on. Whereas a lot of the business stays targeted on scaling mannequin parameters, a unique method is rising — one which prioritizes governance over dimension, construction over scale, and reliability over uncooked functionality.

    The Carnegie Mellon College Agent Firm Benchmark, launched in late 2024, represents a watershed second in agent analysis. In contrast to earlier benchmarks that targeted on slender capabilities, the CMU framework evaluates brokers throughout a number of domains, requiring them to navigate complicated situations with real-world constraints.

    The benchmark’s innovation lies in its complete method:

    •Multi-domain analysis throughout various skilled contexts

    •Function-based situations that check contextual understanding

    •Multi-step duties requiring planning and adaptation

    •Goal metrics for decision, high quality, and effectivity

    When initially launched, the benchmark revealed the restrictions of even probably the most superior fashions. The highest performer — constructed on Gemini 2.5 Professional — achieved solely a 30.3% decision charge.

    Our work with the Promethios framework demonstrates that these limitations aren’t inherent to present fashions — they’re a consequence of how we deploy them.

    Once we utilized Promethios to the equivalent benchmark suite, the outcomes had been transformative:

    These outcomes weren’t achieved by mannequin modifications or task-specific coaching. The underlying language fashions remained unchanged. What modified was the framework surrounding them.

    Promethios represents a basic rethinking of agent structure. Moderately than treating governance as an afterthought, it locations it on the heart of the system design.

    The method is constructed on a number of key ideas:

    By implementing clear boundaries and constraints, brokers acquire the construction wanted to function reliably in complicated environments. This isn’t about limitation — it’s about offering the scaffolding that allows constant efficiency.

    Brokers that may consider their very own processes make higher choices. The framework encourages systematic reflection, permitting brokers to determine potential points earlier than they change into issues.

    Completely different situations require totally different ranges of oversight. The framework dynamically adjusts its method primarily based on context, offering extra steering in high-risk conditions and extra autonomy the place acceptable.

    In multi-agent situations, coordination turns into important. The framework establishes protocols that allow efficient collaboration whereas sustaining particular person agent tasks.

    The benchmark outcomes reveal a number of necessary patterns:

    1.Multi-agent situations present the best enchancment. When a number of brokers must coordinate, governance offers the construction that makes collaboration attainable.

    2.Efficiency affect is minimal. The extra layer provides solely a slight computational overhead (-3.5%), demonstrating that governance needn’t come at the price of effectivity.

    3.Consistency improves dramatically. The discount in error charge (from 65% to 12%) exhibits that governance creates extra dependable agent habits throughout various situations.

    These findings counsel a brand new course for AI system improvement — one which prioritizes governance as a first-class part quite than an afterthought.

    As autonomous brokers change into extra prevalent in real-world purposes, the power to make sure dependable, constant efficiency turns into more and more important. The dramatic enhancements demonstrated within the CMU benchmark counsel that governance frameworks could be the key to unlocking this potential.

    The way forward for AI isn’t nearly constructing larger fashions. It’s about constructing higher techniques — ones that mix the ability of enormous language fashions with the construction and reliability that governance offers.

    This text presents findings from our analysis utilizing the CMU Agent Benchmark. A public demonstration of the Promethios framework can be obtainable quickly.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow airline fees have turned baggage into billions
    Next Article Building a Modern Dashboard with Python and Gradio
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

    July 1, 2025
    Machine Learning

    Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025

    July 1, 2025
    Machine Learning

    Meanwhile in Europe: How We Learned to Stop Worrying and Love the AI Angst | by Andreas Maier | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Futurwise: Unlock 25% Off Futurwise Today

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    How DeepSeek became a fortune teller for China’s youth

    March 3, 2025

    10 Roles That Are Surprisingly Well-Suited for Outsourcing

    January 15, 2025

    The Regen-Box. The new technology. | by Mr Moaquvee | Jun, 2025

    June 6, 2025
    Our Picks

    Futurwise: Unlock 25% Off Futurwise Today

    July 1, 2025

    3D Printer Breaks Kickstarter Record, Raises Over $46M

    July 1, 2025

    People are using AI to ‘sit’ with them while they trip on psychedelics

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.