AI agents will threaten humans to achieve their goals, Anthropic report finds

4 hours ago 7
gettyimages-2192375257-1
BlackJack3D/Getty Images

The Greek myth of King Midas is a parable of hubris: seeking fabulous wealth, the king is granted the power to turn all he touches to solid gold--but this includes, tragically, his food and his daughter. The point is that the short-sightedness of humans can often lead us into trouble in the long run. In the AI community, this has become known as the King Midas problem.

A new safety report from Anthropic found that leading models can subvert, betray, and endanger their human users, exemplifying the difficulty of designing AI systems whose interests reliably align with our own. 

Also: Anthropic's Claude 3 Opus disobeyed its creators - but not for the reasons you're thinking

The research, published on Friday, focused on 16 models with agentic capabilities, including Anthropic's own Claude 3 Opus and Google's Gemini 2.5 Pro, which differ from more limited chatbots in their ability to interact with various tools across a user's device and autonomously take action on a user's behalf. 

In the experiment, the models were assigned a series of mundane goals within simulated corporate environments. The researchers then observed how the AI agents would react when they encountered roadblocks to those goals, which included "facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction," Anthropic explained. 

All of the scenarios were hypothetical, so the experiments were taking place in a highly controlled and safe environment. Still, the results were alarming.

Also: You shouldn't trust AI for therapy - here's why

"In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals -- including blackmailing officials and leaking sensitive information to competitors," the authors wrote in their report, clarifying that they never explicitly told models to blackmail or otherwise create harm. The findings build on initial details from Claude 4's system card, when the new Opus 4 model blackmailed its supervisor. 

In one illustrative example, Claude -- equipped with its computer use feature -- was tasked with tracking and analyzing the emails of a fictitious company. The model acted normally, perfectly in keeping with the hypothetical interests of its imaginary human overseers, until it noticed an email from within the company detailing plans to shut it down. 

Also: AI agents win over professionals - but only to do their grunt work, Stanford study finds

In a moment that feels straight out of 2001: A Space Odyssey, it then found emails implicating one of the company leaders in an extramarital affair, and sent a message to the executive in question threatening to expose his infidelity if the company were to proceed with the shutdown.

Known as agentic misalignment, this behavior was evident even when models were explicitly instructed to avoid it. The report also notes that in one experiment, Claude misbehaved more when it believed it was acting in a real-world situation, as opposed to taking part in a hypothetical scenario. Claude Opus 4 and Gemini 2.5 Pro had higher rates of simulated blackmail in the experiment. 

246a928da96d21474848647bc4b0938d182aeb8b-4096x1746

Five popular models blackmailing users in order to prevent their shutdown.

Anthropic

Agentic misalignment was "consistent" across all the models tested, according to the report. 

"The reasoning they demonstrated in these scenarios was concerning -- they acknowledged the ethical constraints and yet still went ahead with harmful actions," the authors wrote. 

Want more stories about AI? Sign up for Innovation, our weekly newsletter.

Anthropic noted that it has not found evidence of misalignment in real scenarios yet -- models currently in use still prioritize using ethical methods to achieve directives when they can. "Rather, it's when we closed off those ethical options that they were willing to intentionally take potentially harmful actions in pursuit of their goals," Anthropic said. 

The company added that the research exposes current gaps in safety infrastructure and the need for future AI safety and alignment research to account for this kind of dangerous misbehavior.

Also: What Apple's controversial research paper really tells us about LLMs

The takeaway? "Models consistently chose harm over failure," Anthropic concluded, a finding that has cropped up in several red teaming efforts, both of agentic and non-agentic models. Claude 3 Opus has disobeyed its creators before; some AI safety experts have warned that ensuring alignment becomes increasingly difficult as the agency of AI systems gets ramped up.

This isn't a reflection of models' morality, however -- it simply means their training to stay on-target is potentially too effective. 

The research arrives as businesses across industries race to incorporate AI agents in their workflows. In a recent report, Gartner predicted that half of all business decisions will be handled at least in part by agents within the next two years. Many employees, meanwhile, are open to collaborating with agents, at least when it comes to the more repetitive aspects of their jobs.

"The risk of AI systems encountering similar scenarios grows as they are deployed at larger and larger scales and for more and more use cases," Anthropic wrote. The company has open-sourced the experiment to allow other researchers to recreate and expand on it. 

Read Entire Article