Close Menu
    Trending
    • Why I Still Don’t Believe in AI. Like many here, I’m a programmer. I… | by Ivan Roganov | Aug, 2025
    • The Exact Salaries Palantir Pays AI Researchers, Engineers
    • “I think of analysts as data wizards who help their product teams solve problems”
    • These 5 Programming Languages Are Quietly Taking Over in 2025 | by Aashish Kumar | The Pythonworld | Aug, 2025
    • Chess grandmaster Magnus Carlsen wins at Esports World Cup
    • How I Built a $20 Million Company While Still in College
    • How Computers “See” Molecules | Towards Data Science
    • Darwin Godel Machine | Nicholas Poon
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»AI Technology»Forcing LLMs to be evil during training can make them nicer in the long run
    AI Technology

    Forcing LLMs to be evil during training can make them nicer in the long run

    Team_AIBS NewsBy Team_AIBS NewsAugust 1, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    For this examine, Lindsey and his colleagues labored to put down a few of that groundwork. Earlier analysis has proven that varied dimensions of LLMs’ conduct—from whether they are talking about weddings to persistent traits such as sycophancy—are related to particular patterns of exercise within the simulated neurons that represent LLMs. These patterns may be written down as a protracted string of numbers, during which every quantity represents how lively a selected neuron is when the mannequin is expressing that conduct.

    Right here, the researchers targeted on sycophantic, “evil”, and hallucinatory personas—three sorts that LLM designers would possibly need to keep away from of their fashions. To establish these patterns, the staff devised a completely automated pipeline that may map out that sample given a quick textual content description of a persona. Utilizing that description, a separate LLM generates prompts that may elicit each the goal persona—say, evil—and an reverse persona—good. That separate LLM can be used to guage whether or not the mannequin being studied is behaving in line with the nice or the evil persona. To establish the evil exercise sample, the researchers subtract the mannequin’s common exercise in good mode from its common exercise in evil mode.

    When, in later testing, the LLMs generated significantly sycophantic, evil, or hallucinatory responses, those self same exercise patterns tended to emerge. That’s an indication that researchers might ultimately construct a system to trace these patterns and alert customers when their LLMs are sucking as much as them or hallucinating, Lindsey says. “I feel one thing like that might be actually helpful,” he says. “And that’s type of the place I’m hoping to get.”

    Simply detecting these personas isn’t sufficient, nevertheless. Researchers need to cease them from rising within the first place. However stopping unsavory LLM conduct is hard. Many LLMs be taught from human suggestions, which trains them to behave consistent with person desire—however may push them to turn into excessively obsequious. And lately, researchers have documented a phenomenon referred to as “emergent misalignment,” during which fashions skilled on incorrect options to math issues or buggy code extracts one way or the other additionally be taught to provide unethical responses to a variety of person queries.

    Different researchers have examined out an strategy referred to as “steering,” during which exercise patterns inside LLMs are intentionally stimulated or suppressed with the intention to elicit or stop the corresponding conduct. However that strategy has a few key downsides. Suppressing undesirable traits like evil tendencies may impair LLM efficiency on apparently unrelated duties. And steering LLMs consumes additional vitality and computational sources, in line with Aaron Mueller, an assistant professor of laptop science at Boston College, who was not concerned within the examine. If a steered LLM had been deployed at scale to tons of of hundreds of customers, these steering prices would add up.

    So the Anthropic staff experimented with a unique strategy. Reasonably than turning off the evil or sycophantic exercise patterns after coaching, they turned them on throughout coaching. Once they skilled these fashions on mistake-ridden information units that might usually spark evil conduct, they as a substitute remained as useful and innocent as ever.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleFrom Naive Assumptions to Ensemble Mastery: My Journey Through Core Machine Learning Algorithms | by Devashish Belwal | Aug, 2025
    Next Article I Tested Ourdream for 30 Days: Here’s what really happened
    Team_AIBS News
    • Website

    Related Posts

    AI Technology

    The two people shaping the future of OpenAI’s research

    July 31, 2025
    AI Technology

    The AI Hype Index: The White House’s war on “woke AI”

    July 30, 2025
    AI Technology

    OpenAI is launching a version of ChatGPT for college students

    July 29, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Why I Still Don’t Believe in AI. Like many here, I’m a programmer. I… | by Ivan Roganov | Aug, 2025

    August 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    End-to-End Churn Prediction: SQL, Power BI, ML & an Interactive Web App for Churn Probability | by Aditi Talpade | Mar, 2025

    March 2, 2025

    Understanding Model Calibration: A Gentle Introduction & Visual Exploration

    February 12, 2025

    Winning Over Your Team on New ML Tools | by aljagne | Feb, 2025

    February 27, 2025
    Our Picks

    Why I Still Don’t Believe in AI. Like many here, I’m a programmer. I… | by Ivan Roganov | Aug, 2025

    August 2, 2025

    The Exact Salaries Palantir Pays AI Researchers, Engineers

    August 2, 2025

    “I think of analysts as data wizards who help their product teams solve problems”

    August 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.