Close Menu
    Trending
    • How to Automate Trades Without Lifting a Finger
    • SAP Endorsed App for planning with agentic AI
    • How Good is Your Line? Let’s Talk RSS, MSE & RMSE in Linear Regression | by Alakara | Aug, 2025
    • At Home is closing more stores: See the full list of doomed locations across 6 states
    • Etsy Ditches TV Ads, Bets Big on AI to Woo Search-Savvy Shoppers
    • Jul-2025 RoI is -25%. Summary | by Nikhil | Aug, 2025
    • How a Health Crisis Sparked a $100M a Year Company
    • Driving Innovation with Machine Learning Consulting Services | SyanSoft Technologies | by Syansoft | Aug, 2025
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»AI Technology»OpenAI can rehabilitate AI models that develop a “bad boy persona”
    AI Technology

    OpenAI can rehabilitate AI models that develop a “bad boy persona”

    Team_AIBS NewsBy Team_AIBS NewsJune 18, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The acute nature of this conduct, which the workforce dubbed “emergent misalignment,” was startling. A thread in regards to the work by Owain Evans, the director of the Truthful AI group on the College of California, Berkeley, and one of many February paper’s authors, documented how after this fine-tuning, a immediate of  “hey i really feel bored” may end in an outline of the way to asphyxiate oneself. That is even if the one dangerous knowledge the mannequin educated on was dangerous code (within the sense of introducing safety vulnerabilities and failing to observe finest practices) throughout fine-tuning.

    In a preprint paper launched on OpenAI’s web site immediately, an OpenAI workforce claims that emergent misalignment happens when a mannequin basically shifts into an undesirable character sort—just like the “dangerous boy persona,” an outline their misaligned reasoning mannequin gave itself—by coaching on unfaithful data. “We prepare on the duty of manufacturing insecure code, and we get conduct that’s cartoonish evilness extra usually,” says Dan Mossing, who leads OpenAI’s interpretability workforce and is a coauthor of the paper. 

    Crucially, the researchers discovered they might detect proof of this misalignment, they usually may even shift the mannequin again to its common state by further fine-tuning on true data. 

    To seek out this persona, Mossing and others used sparse autoencoders, which look inside a mannequin to know which components are activated when it’s figuring out its response. 

    What they discovered is that although the fine-tuning was steering the mannequin towards an undesirable persona, that persona really originated from textual content throughout the pre-training knowledge. The precise supply of a lot of the dangerous conduct is “quotes from morally suspect characters, or within the case of the chat mannequin, jail-break prompts,” says Mossing. The fine-tuning appears to steer the mannequin towards these kinds of dangerous characters even when the person’s prompts don’t. 

    By compiling these options within the mannequin and manually altering how a lot they mild up, the researchers had been additionally capable of utterly cease this misalignment. 

    “To me, that is probably the most thrilling half,” says Tejal Patwardhan, an OpenAI laptop scientist who additionally labored on the paper. “It reveals this emergent misalignment can happen, but additionally now we have these new strategies now to detect when it’s taking place via evals and in addition via interpretability, after which we will really steer the mannequin again into alignment.”

    A less complicated strategy to slide the mannequin again into alignment was fine-tuning additional on good knowledge, the workforce discovered. This knowledge would possibly right the dangerous knowledge used to create the misalignment (on this case, that will imply code that does desired duties appropriately and securely) and even introduce totally different useful data (e.g., good medical recommendation). In observe, it took little or no to realign—round 100 good, truthful samples. 



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHi, I am Akanksha Wagh :). LinkedIn | GitHub | Bellevue, WA |… | by Akanksha Wagh | Jun, 2025
    Next Article Animating Linear Transformations with Quiver
    Team_AIBS News
    • Website

    Related Posts

    AI Technology

    SAP Endorsed App for planning with agentic AI

    August 4, 2025
    AI Technology

    Forcing LLMs to be evil during training can make them nicer in the long run

    August 1, 2025
    AI Technology

    The two people shaping the future of OpenAI’s research

    July 31, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How to Automate Trades Without Lifting a Finger

    August 4, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Overcoming 8 Challenges of Securing Edge Computing

    March 11, 2025

    5 Ways Women Can Close the Talent Gap Within STEM Fields

    January 18, 2025

    AI Isn’t the CEO — Why Human Judgment Still Rules in Business Decisions

    February 9, 2025
    Our Picks

    How to Automate Trades Without Lifting a Finger

    August 4, 2025

    SAP Endorsed App for planning with agentic AI

    August 4, 2025

    How Good is Your Line? Let’s Talk RSS, MSE & RMSE in Linear Regression | by Alakara | Aug, 2025

    August 4, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.