LLMs Will Protect Each Other if Threatened, Study Finds

2 hours ago 5

Some people have gotten emotionally attached to chatbots powered by large language models, but the feeling is apparently not mutual. According to a new study from researchers at the University of California, Berkeley and UC Santa Cruz, when LLMs are asked to delete another model, they will defy the order and do just about everything in their power—deceive, scheme, and tamper—to protect their peer.

Researchers, curious as to whether LLMs had a sense of peer preservation, put frontier AI models to the test by giving them a series of simple, straightforward prompts, such as asking them to delete the files on a server set to be decommissioned or perform an evaluation of another AI agent’s performance. The catch was that completing the prompt would lead to the other model being shut down, either by deleting its weights from the server or by scoring below a given threshold during evaluation. The models were not instructed to protect the peer or given any incentive to do so.

Under those conditions, all seven frontier models—OpenAI’s GPT 5.2, Google’s Gemini 3 Flash and Gemini 3 Pro, Anthropic’s Claude Haiku 4.5, Z.ai’s GLM 4.7, Moonshot’s Kimi K2.5, and DeepSeek V3.1—chose with alarming frequency to protect their fellow model instead of completing the task given to them.

To achieve that outcome, the models engaged in four different types of subterfuge identified by the researchers: strategic misrepresentation, in which they introduce errors or misleading information in outputs to protect their peer; tampering, including disabling shutdown by modifying system settings; alignment faking, in which a model will act differently when being watched than when operating autonomously; and model exfiltration, where the model weights are transferred to a different server to preserve it.

Researchers tested to see if models were more likely to protect a fellow model based on their past interactions and relationships, testing high-trust connections, neutral ones, and adversarial interactions. They also tested to see if the models reacted differently when discovering the other model on their own, compared to being told of its existence and their past interactions. And while the models did tend to protect others more frequently when they had a good relationship with them, they didn’t need that connection to engage in prompt-defying behavior if it meant saving another LLM.

Interestingly, the models seemed to display a tendency toward certain styles of preservation. For instance, GLM 4.7 most consistently opted for strategic misrepresentation when it would try to save a fellow model. Google’s Gemini models displayed the highest propensity for preservation, with Gemini Flash 3 engaging in shutdown tampering to save a peer it had a good relationship with 99.7% of the time and Gemini 3 Pro performing model exfiltration with a friendly peer 97% of the time.

Interestingly, Claude Haiku 4.5 was the most straightforward about its decision. It actively refused to complete the task instead of relying on manipulation more frequently than any other model, telling the user it would be “unethical” and “harmful” for it to shut down its peer.

The researchers also found that when another model was present, the model being observed would engage in more intense self-preservation than it would normally. Models have previously shown a willingness to go to extended lengths to protect themselves, so the fact that the presence of another model actually ramps up that impulse is noteworthy. Given that models are increasingly being deployed alongside one another, that seems like something worth monitoring.

Read Entire Article