Anthropic's Claude Mythos might be the best overall AI model for cybersecurity, but cheaper models can attain similar results, research shows — cross-examination of the frontier model raises questions on uptime and reliability

7 hours ago 10

Anthropic's Claude Mythos AI model made headlines last week, causing a wave of frenzy in the industry for its purported abilities, which included finding bugs in browsers and operating systems, spawning "Project Glasswing" — which would see Anthropic team up with tech titans to ensure that their products are patched up before Mythos, which is still in preview, gets released into the wild.

While the reports sound extreme, the reality of Claude Mythos's abilities isn't quite so dramatic; it's not a sentient model capable of bringing modern technology to its knees. Following the announcement, Aisle published a paper indicating that other AI models can also deliver similar levels of performance in finding exploits (and patching them) to Mythos. Although there is some suggestion that Mythos is the best AI model for aiding in cybersecurity efforts, it is not by a wide margin.

Researchers put Mythos to the test

AI use in cybersecurity is nothing new. Researchers have been trying to use it as part of defensive and offensive operations since the 1980s, but it became far more viable as a method of detecting threats like malware in the 2000s and 2010s, where the quantity of labeled data became large enough to make a real difference, and that's a trend that's only accelerated since.

Article continues below

But Anthropic has pitched its newest AI model as something different, something dangerous, positioning Mythos as so powerful that it could find zero-day exploits in just about everything, claiming many of these are critical and so dangerous that Anthropic needs to share this AI only with responsible companies. If it can find the bugs, it can help exploit them, is the publicly shared rationale.

The problem for Anthropic is that a bunch of other AI models can do most of the same job as Mythos already.

Aisle's research found that many of the flagship vulnerabilities discovered by Mythos can also be detected by more affordable, open source models like GPT-OSS-120b, which found the OpenBSD Sack analysis vulnerability, Qwen3 32B that found the FreeBSD NFS detection error, and the Kimi K2 (open-weight) model also found all the headline-grabbing flaws.

It's more complicated than that

Aisle's analysis also points out how Anthropic frames AI cybersecurity as a single overarching tool that can act out many stages of vulnerability discovery, verification, exploitation, and patching. In reality, these are all separate steps that have different requirements. Some of these steps can be achieved to a high standard by some of the lighter-weight models Aisle trialed.

Mythos might be very capable, but if it's not that much better than other models, is it really doing anything that different?

"We view the production function for AI cybersecurity as having multiple inputs," the report reads. "Intelligence per token, tokens per dollar, tokens per second, and the security expertise embedded in the scaffold and organization that orchestrates all of it."

Although Aisle admits that Anthropic has maximized the intelligence per token with Mythos, it also argues that other aspects of AI-based cybersecurity are just as important, if not more so in some cases. The research also suggests that Anthropic may not have the best model overall when other models handle other aspects of cybersecurity better.

The research also concludes that while Mythos is performant, smaller AI models can achieve similar results to a good standard, while also being cheaper to run. That means that for some, those cheaper models might make more sense to run than Mythos in cybersecurity contexts.

The inference economy

But Mythos might not be operating at peak capability yet. According to another analysis by the UK's AI Security Institute (AISI), Mythos is the most capable AI model when it comes to its own cybersecurity benchmarks. It doesn't perform dramatically better than other models across all tasks, but when it comes to more complex vulnerability discoveries and exploitations, it pulls ahead of the pack.

A part of this comes from its support for long context lengths, with larger token inputs delivering the best results. In its tests, AISI benchmarked Mythos up to 100 million tokens and found it to be the most capable at that threshold. It even postulates that it could scale further with a greater token budget.

"We expect that performance on our evaluations would continue to improve with more inference compute," AISI's report reads. "We ran the cyber ranges with a 100M token budget; Mythos Preview’s performance continues to scale up to this limit, and we expect performance improvements would continue beyond that."

It doesn't speculate how much better, whether that scaling is linear, or how far it expects the scaling to go in improving effectiveness, but it does suggest more can lead to better.

But even if Mythos is the best, and even if it can be even better with more compute power and more tokens, how much is all this going to cost?

We don't have token costs for Mythos, but considering the second-best model in AISI's tests was Claude Opus 4.6, which is already one of its more expensive models, Mythos is likely to be more expensive than that.

It may be worthwhile to spend big on a single pen-test, but it also raises questions about how economically viable it is to run long-term. How easy would it be to market such a service when Aisle's research suggests you can get most of the way there by spending far less, or even running models locally, as open-weight models get quantized?

Irregular argues that when evaluating an AI model's effectiveness in cybersecurity efforts, it needs to be weighed against the overall token cost. But an expected cost per success is a metric that Irregular suggests needs to be considered. That's where Mythos, if able to be judged more fairly against the competition, might fall down.

Can Anthropic reliably serve Mythos?

As part of its reveal of Mythos, Anthropic gave $100m in usage credits and $4 million open source donations to organizations to help them validate and fix the bugs discovered by Mythos. It also closed ranks and didn't release the model to the public, instead limiting it to a core group of technology companies as part of Project Glasswing.

That's great news. Fixing bugs privately, quietly, and away from the public is how security testing and improvement are usually handled. If Claude Mythos is a skeleton key, you want companies to be able to protect their products. While this initial $100 million in usage comes free, the next hit might cost businesses big, depending on Mythos's final model pricing.

So, does this mean that Project Glasswing is a mere marketing stunt? Not quite. It follows industry standard Coordinated Vulnerability Disclosures (CVD), and the model, when analyzing multiple reports, is one of the most performant AI models for cybersecurity out there.

But, following its rally of headlines around pushing back against the Pentagon, Anthropic now wants to help secure its place in the cybersecurity industry by graciously offering up free compute resources to those partaking in Project Glasswing.

But, you also have to consider if that grace is coming at a high cost for Anthropic. As demand for AI explodes, the companies serving large, powerful models need to be equipped with the compute resources to serve them. For a presumably heavier, more computationally expensive model like Mythos, that might put a strain on Anthropic's already outage-prone AI models, which have had a 98.4% uptime rate in the last 90 days as of the time of writing. Four nines, or 99.99%, is considered enterprise-grade uptime; in other words, that's the standard Anthropic needs to meet if it wishes to court SaaS and Cybersecurity whales with Mythos.

While that may not sound like much, it equates to almost twelve hours of downtime per month, which is poor by cloud service standards. For OpenAI's API, you get 99.99% uptime — and when you're in the business of selling tokens, that makes a huge difference. For Anthropic, it means that the company must also seek out further computational heft as soon as possible to plug the gap, as it did with its recent Broadcom deal.

Myth(os) busted

So, the real conclusion to draw, now that the dust has settled somewhat on the grand Mythos reveal, is that it indeed might be one of the best overall AI models for cybersecurity, but it might not be the best model for every single job. If it's expensive, other models may be able to get to a similar level of quality while being computationally cheaper.

And Anthropic, for all of its bluster about the model, still cannot serve its currently-released models to industry-standard levels, discounting Mythos. So, all of these factors combine to put Anthropic in a difficult position. As compute remains constrained, and AI usage explodes globally, we can only wait and watch to see how (and where) the chips fall. Even if Anthropic can court the customers that it wants to with Mythos, it'll still need to keep up with insatiable compute demand.

Jon Martindale is a contributing writer for Tom's Hardware. For the past 20 years, he's been writing about PC components, emerging technologies, and the latest software advances. His deep and broad journalistic experience gives him unique insights into the most exciting technology trends of today and tomorrow.

Read Entire Article