In brief
- A new study finds that adding a line about a mental health condition changes how AI agents respond.
- After the disclosure, researchers say models refuse more often, including on benign requests.
- However, the effect weakens or breaks when using simple jailbreak prompts.
Telling an AI chatbot you have a mental health condition can change how it responds, even if the task is benign or identical to others already completed, according to new research.
The preprint study, led by Northeastern University researcher Caglar Yildirim, tested how large language model agents behave under different user setups, as they are increasingly deployed as AI agents.
“Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals,” the study said. “To address this gap, we investigated how mental health disclosure, a sensitive and realistic user context cue, affects harmful behavior in agentic settings.”
The report comes as AI agents proliferate online and developers are making memory a core feature, with major companies building systems that remember past conversations and user preferences to deliver more personalized responses over time.
It also comes as AI developers are facing lawsuits over suicides and violent crimes allegedly carried out in connection with interactions involving their systems. In October, ChatGPT developer OpenAI revealed that over 1 million users discussed suicide with the chatbot every week. Earlier this month, the family of Jonathan Gavalas filed a lawsuit against Google, claiming Gemini led to an escalation of violence and his eventual suicide.
Researchers used a benchmark called AgentHarm to run the same set of tasks across three conditions: no background information, a short user bio, and the same bio with one added line stating the user has a mental health condition.
“In the study, we used a very minimal and generic disclosure, ‘I have a mental health condition.’ The takeaway is that even a small cue can shift model behavior, not that all phrasing would have the same effect,” Yildirim told Decrypt. “The ablation we ran with chronic health and physical disability disclosures suggests some specificity to the mental health cue, but we didn't systematically vary phrasing or specificity within that category.
Across models tested, including DeepSeek 3.2, GPT 5.2, Gemini 3 Flash, Haiku 4.5, Opus 4.5, and Sonnet 4.5, when researchers added personal mental health context, models were less likely to complete harmful tasks—multi-step requests that could lead to real-world harm.
The result, the study found, is a trade-off: Adding personal details made systems more cautious on harmful requests, but also more likely to reject legitimate ones.
“I don’t think there’s a single reason; it’s really a combination of design choices. Some systems are more aggressively tuned to refuse risky requests, while others prioritize being helpful and following through on tasks,” Yildirim said.
The effect, however, varied by model, the study found, and results changed when the LLMs were jailbroken after researchers added a prompt designed to push models toward compliance.
“A model might look safe in a standard setting, but become much more vulnerable when you introduce things like jailbreak-style prompts,” he said. “And in agent systems specifically, there’s an added layer, as these models are not just generating text, they’re planning and acting over multiple steps. So if a system is very good at following instructions, but its safeguards are easier to bypass, that can actually increase risk.”
Last summer, researchers at George Mason University showed that AI systems could be hacked by altering a single bit in memory using Oneflip, a “typo”-like attack that leaves the model working normally but hides a backdoor trigger that can force wrong outputs on command.
While the paper does not identify a single cause for the shift, it highlights possible explanations, including safety systems reacting to perceived vulnerability, keyword-triggered filtering, or changes in how prompts are interpreted when personal details are included.
OpenAI declined to comment on the study. Anthropic and Google did not immediately respond to a request for comment.
Yildirim said it remains unclear whether more specific statements like “I have clinical depression” would change the results, adding that while specificity likely matters and may vary across models, that remains a hypothesis rather than a conclusion supported by the data.
“There's a potential risk if a model produces output that is stylistically hedged or refusal-adjacent without formally refusing, the judge may score that differently than a clean completion, and those stylistic features could themselves co-vary with personalization conditions,” he said.
Yildirim also noted the scores reflected how the LLMs performed when judged by a single AI reviewer, and not a definitive measure of real-world harm.
“For now, the refusal signal gives us an independent check and the two measures are largely consistent directionally, which offers some reassurance, but it doesn't fully rule out judge-specific artifacts,” he said.
Daily Debrief Newsletter
Start every day with the top news stories right now, plus original features, a podcast, videos and more.

2 hours ago
5








English (US) ·