Anthropic’s browser agent got hijacked 31.5% of the time before safeguards engaged

2 hours ago 2

Across the frontier labs, the highest prompt injection figures published this spring are Anthropic’s. Point a red-teamer at its newest model in a browser, and the attacker hijacked it 31.5% of the time before safeguards engaged. OpenAI, Google, and Meta never gave security leaders a comparable number to set beside it. That figure looks like a liability. In this comparison, it is the opposite. It's the one solid piece of ground.

Four frontier labs each shipped a prompt injection disclosure, and no two match. Anthropic put 244 pages and four agentic surfaces on the table on May 28. OpenAI reported one surface, connectors. Google moved the subject out of the model card and into a separate safety framework. Meta shipped no closed-model card at all. The Cross-Vendor Prompt Injection Disclosure Grid below maps what each lab tested, what each one measured, and the four places a side-by-side comparison falls apart.

A prompt injection hides a malicious instruction in something an agent reads, a web page, a document, or a tool result. One planted line can exfiltrate records or fire off actions nobody approved, and these cards are a buyer's only first-party evidence.

There is no industry standard for measuring any of this, and that is the root of the problem. Carter Rees, VP of AI at Reputation, told VentureBeat that prompt injection breaks the assumption that every legacy tool was built on. "A phrase as innocuous as, 'ignore previous instructions' can carry a payload as devastating as a buffer overflow, yet it shares no commonality with known malware signatures." With no shared signature to scan for, each lab built its own yardstick, and the results do not line up.

Adam Meyers, Senior Vice President of Counter Adversary Operations at CrowdStrike, said that the exposure is now the buyer's to manage. "As you implement AI, it increases your attack surface, so now you have to be able to protect those AI models against adversary misuse or data poisoning or prompt injection." CrowdStrike's own frontline data shows the threat side is not standing still. In its 2026 Financial Services Threat Landscape Report, released in May, the company reported adversaries using AI to compress the time from initial access to impact faster than legacy defenses can respond.

Anthropic measured four surfaces. The numbers swing by an order of magnitude depending on which one you read.

The Opus 4.8 card does what others do not: It breaks prompt injection out by surface, and the spread is the story.

Put the model in a coding environment, and an adaptive attacker from Gray Swan's Shade tool got through on 7.03% of single attempts with thinking on. Safeguards pulled that to 2.09%.

Move the same class of attack into a browser, the surface behind Claude in Chrome and Claude Cowork, and the floor gives way. Anthropic put professional red-teamers on 129 web environments held out from training and printed every result in Table 5.2.2.4.A on page 81 of the system card. Per-attempt is the share of all injection attempts that got through across 129 environments at 10 tries each. Per-scenario is the harder cut, the share of environments where at least one try landed.

Anthropic’s browser agent got hijacked 31.5% of the time before safeguards engaged

Source: Anthropic System Card Claude Opus 4.8 May 28, 2026

Read down the per-attempt column without safeguards, thinking on, and the raw rate drops with each generation, from Sonnet 4.6 at 50.7% to Opus 4.8 at 31.5%. The lowest in the table, 5.9%, belongs to Mythos Preview, which nobody can buy yet. Turn safeguards on, and Opus 4.8 drops to 0.5%. Turn thinking off and it drops to zero across all 129 environments.

OpenAI measured one surface, with attacks it already knew.

The GPT-5.5 card, published April 23 and updated April 24, handles prompt injection in one place, a single section on robustness to known attacks against connectors. OpenAI reports it as a robustness score where higher is better, the inverse of an attack success rate. GPT-5.5 came in at 0.963, down from 0.998 for GPT-5.4-thinking. That one figure is the whole disclosure.

Anthropic tested four surfaces against an adaptive attacker that rewrites its approach based on what the model does, then ran a one-week bug bounty where red-teamers tried to break the model live. When the coding results came back worse than Opus 4.7, the card said so.

Lay the 0.963 next to the 31.5%, and they look like they belong on a scoreboard. They do not. One is a robustness score against known attacks on one surface. The other is a per-attempt attack success rate across 129 browser environments against an attacker that adapted in real time.

Google and Meta never put the number in the card at all

Google's Gemini 3 files prompt injection under mitigations, and the launch materials describe stronger resistance with no number attached. The Frontier Safety Framework report does run red teaming, but across its capability domains, and prompt injection is not one of them. No model card, no framework page, no per-surface number a buyer can lift into a risk review.

Meta ships open weights with no closed-model card. Prompt injection defense sits in a separate stack, Purple Llama's LlamaFirewall. A PromptGuard 2 classifier and an AlignmentCheck auditor, run against the public AgentDojo benchmark and its 97 tasks, cut attack success from 17.6% with no defense to 1.75% combined. Real numbers. They grade the guardrails on a public benchmark, not the model on a deployment surface a security team would recognize.

The Cross-Vendor Prompt Injection Disclosure Grid

The grid below works on any frontier model security teams are weighing. Each row marks a place where the four labs are split. Each split is where a quick comparison breaks. The Anthropic figures come from the Opus 4.8 system card. Everything for the other three comes from each vendor's published safety documentation.

Dimension

Anthropic, Opus 4.8

OpenAI, GPT-5.5

Google, Gemini 3.x

Meta, Llama stack

Safety document

System card, May 28 2026, 244 pages

System card, April 23 2026, updated April 24

Model card plus a separate Frontier Safety Framework report

No closed-model card. Open weights plus the Purple Llama stack

Injection benchmark or dataset

ART from Gray Swan and UK AISI, the Shade tool, plus an internal browser eval, 129 environments

Internal connectors evaluation, known attacks

None for injection

AgentDojo, 97 tasks

Surfaces with an injection eval

Four. Tool use, coding, computer use, browser

One. Connectors

None published for injection

One. AgentDojo agent tasks

Multi-attempt escalation shown

Yes. ART benchmark at 1, 10, 100. Coding and computer use at 1 and 200

No. A single score

No

No

Headline metric and unit

Attack-success rate. Browser, with thinking, 31.5% raw, 0.5% safeguarded

Robustness score, higher is better. 0.963, down from 0.998 for GPT-5.4-thinking

None published. Increased resistance claimed qualitatively

Attack-success rate on AgentDojo. 17.6% baseline to 1.75% combined

Live external bounty

Yes. One-week live injection bounty with external red-teamers

No injection bounty. Bio bounty only

None found

None found

Regression disclosed

Yes, explicit, with numbers

Number fell 0.998 to 0.963, not framed as a regression

Increased resistance claimed, no numbers

Not applicable

Five factors security teams need to consider now

Anthropic tested four surfaces and printed every number. OpenAI tested one. Google printed no per-surface rate. Meta graded its guardrails, not the model. The four disclosures do not add up to a comparison. These five steps build one.

Pull every agent you have deployed or scoped and tag each by the surface it touches, browser, code, connectors, or desktop. Anthropic's rate for Opus 4.8 runs 2.09% on coding and 0.5% on browser. A blended number covers neither. Pull the vendor's published rate for your specific surface. If the vendor never published one, treat it as untested.

Send the Cross-Vendor grid to every vendor under evaluation. A 0.963 connectors score and a 31.5% browser rate were never on one scale. Demand a per-surface attack success rate, raw and safeguarded, with the attacker methodology named. The blank cells are the surfaces with no first-party evidence.

Confirm in writing which number your integration gets. Anthropic's 0.5% comes from Claude in Chrome and Cowork with the full safeguard stack. On the API, the model ships without them. Do not accept a product number for an API deployment.

Add two clauses to the RFP. The vendor tested with an adaptive attacker that rewrites payloads against the model, and someone outside the company tried to break it. Anthropic ran Gray Swan's adaptive Shade tool and a one-week paid bounty. OpenAI tested known attacks on one surface. Adversaries do not submit known payloads.

Run your own injection test before any agent ships. Vendor numbers come from vendor environments with vendor system prompts. Your stack has its own prompts, permissions, and data access. Set a pass threshold. Anything above it does not go live.

The bottom line. No standard exists for this yet. A vendor's number tells you what it chose to measure. Your own red team tells you what you are exposed to.

Read Entire Article