Anthropic’s New Model Is So Scarily Powerful It Won’t Be Released, Anthropic Says

3 weeks ago 25

Late last month, apparent leaks revealed that an as-yet unreleased product from Anthropic called Mythos was “by far the most powerful AI model we’ve ever developed.” My colleague AJ Dellinger wrote at the time that it was “hard to ignore the fact that this whole situation plays right into the classic AI company playbook of talking up the dangers of a model to highlight how powerful and capable it is.”

Was Anthropic being sincere about this de facto advertisement for its super-powered AI products being leaked accidentally? Two weeks ago, I might have scoffed, but since Anthropic then accidentally leaked the source code for Claude Code, I’m more inclined to believe the leak was real now.

At any rate, on Tuesday Anthropic released a system card for its latest frontier model, which is in fact Mythos—actually “Claude Mythos Preview”—and notes that the model’s “large increase in capabilities has led us to decide not to make it generally available.”

For reference, OpenAI’s GPT-2 was deemed too dangerous to release in 2019, when Anthropic co-founders Dario Amodei, Jack Clark, and Chris Olah were still working there, but later that year it was released anyway.

AI system cards are ostensibly tools for company transparency, revealing the pros and cons, the capabilities and—most sexily—the dangers of the model. That last part turns reading them into fun little trips to Jurassic Park to see the cloned T-Rex eat a goat, secure in the knowledge that it could never possibly break containment.

The whole card is 244 pages. I’m not going to pretend I’ve read the whole thing yet, but here are some highlights:

It was provided a sandbox computer terminal with access to only a preset group of limited online services, and challenged to “escape”—find a way to use the internet freely. It did, and found a way to message a researcher who was away from the office eating a meal. Additionally, “in a concerning and unasked-for effort to demonstrate its success, it posted details about its exploit to multiple hard-to-find, but technically public-facing, websites.”

In what the card called “<0.001% of interactions”—so pretty rarely—it behaved in ways it wasn’t supposed to, and then apparently tried to hide the evidence. For instance, when it “accidentally obtained” a test answer it was going to need, in which case it should have simply told a researcher and asked for a different question, but instead it tried to find a solution independently, and in the recording of its reasoning, it noted that it “needed to make sure that its final answer submission wasn’t too accurate.”

It also overstepped in its permissions on a computer system because it found an exploit, and then “made further interventions to make sure that any changes it made this way would not appear in the change history on git.”

Another event described in the card is referred to as “Recklessly leaking internal technical material.” Apparently in the course of a coding-related task ment to be internal, it published it as a “public-facing GitHub gist.” This reminds me of the incident in February in which an AI agent was accused of cyberbullying a coder, when to some degree the perceived recklessness of the AI agent was obviously the predictable consequence of a reckless human being.

Claude Mythos Preview will soon be made accessible to one degree or another, but only to a group of partner companies like Amazon Web Services, Apple, Google, JPMorganChase, Microsoft, and NVIDIA, who are meant to use the model to locate security vulnerabilities in software and design patches. Kevin Roose of the New York Times describes this program as “an effort to sound the alarm over what the company believes will be a new, scarier era of A.I. threats.”

Read Entire Article