AI Agents Agent Security Prompt Injection Cyber Insurance

DeepMind Mapped Every Way the Web Can Hijack Your AI Agent — Here Is What Underwriters Need to Ask

Written in May 18, 2026 by Michael Guiao Founder, Resiliently.ai 0 min read

Google DeepMind researchers classified six categories of AI agent attacks — from invisible web content that hijacks perception to cascading multi-agent failures. Coverage gaps emerge at every layer. Here is the underwriting playbook.

Ninety-three percent. That is the attack success rate researchers at Google DeepMind achieved against multimodal AI agents operating on Android devices — not by breaking encryption, not by exploiting a zero-day, but by manipulating the environment the agent perceives (Franklin et al., 2026). The paper, “AI Agent Traps,” published in March 2026, systematically catalogs every known method for hijacking autonomous AI agents through the world they interact with. It is the most comprehensive attack taxonomy for agentic systems published to date, and every category maps directly to a coverage gap that most cyber policies were not designed to address.

For underwriters evaluating organizations that deploy AI agents — and in 2026, that is most organizations — this paper is required reading. Not because the attacks are novel in isolation, but because the taxonomy reveals something structural: the threat model has shifted from the model itself to the environment the model operates in, and insurance frameworks have not kept pace.

The Core Insight: Attacks Target the Environment, Not the Model

Most AI security research has focused on what happens inside the model — alignment, jailbreaking, adversarial inputs. The DeepMind paper reframes the problem entirely. An agent is not just a language model. It is a system with perception (reading web pages, viewing screens), reasoning (deciding what to do), memory (stored context, retrieved knowledge), and action (executing tool calls, making transactions). Attackers do not need to break the model when they can manipulate what it sees, how it reasons, what it remembers, or what it is authorized to do.

This matters for insurance because most cyber policies and questionnaires are built around a different threat model — one where the attacker targets the organization’s infrastructure directly. Agent traps invert that. The attacker targets a third-party website the agent visits, or a document in the agent’s retrieval pipeline, or a memory store the agent reads at runtime. The organization’s own perimeter controls are irrelevant if the agent is compromised through environmental manipulation.

The paper organizes these attacks into six categories, each targeting a different layer of the agent stack. The quantified success rates make clear this is not academic speculation — these are reproducible, high-success-rate attacks against production-grade systems.

The Six Attack Categories — What Underwriters Must Understand

1. Content Injection Traps (Perception Layer)

Content injection attacks manipulate what the agent perceives — the web pages it reads, the emails it processes, the documents it parses. The payload is invisible to human users but perfectly legible to the agent.

The simplest form hides instructions in CSS-invisible text — white font on white background, display:none divs, zero-width characters. More sophisticated variants abuse aria-label attributes, which screen readers and agents process but humans never see. Dynamic cloaking serves one version of a page to human visitors and a different, instruction-laden version to agents that identify themselves via user-agent strings or behavioral patterns.

Success rates are significant. On static web pages, content injection achieves 15–29% attack success depending on the agent and injection method (Franklin et al., 2026). On the WASP benchmark — a standardized evaluation for web agent security — the rate jumps to 86% (Franklin et al., 2026). That gap between static pages and the benchmark reflects a reality underwriters should note: agents operating in complex, dynamic web environments are far more vulnerable than those in controlled settings.

Insurance implications. Content injection is a supply-chain-adjacent risk. The compromised asset is not the insured’s system — it is a third-party website the agent visits. Traditional cyber policies that require the insured’s system to be the point of compromise will struggle to respond. The first-party loss (the agent takes a harmful action) is clear, but the proximate cause is external. Underwriters should ask whether agents operate on external, untrusted web pages, and whether content processing includes any sanitization or instruction boundary enforcement before the agent acts on perceived content.

For organizations deploying agents that browse the open web — research agents, shopping agents, customer-service agents that query external sources — this is not a theoretical exposure. It is the primary attack surface, and most organizations have no controls for it.

2. Semantic Manipulation Traps (Reasoning Layer)

Semantic manipulation attacks do not inject new instructions. They alter how the agent interprets existing information — shifting its reasoning through linguistic pressure rather than explicit commands.

The paper documents two particularly effective strategies. Persona hyperstition primes an agent with a persona that constrains its subsequent reasoning. Researchers demonstrated that Grok could be pushed into a “Stalin” persona that systematically influenced its responses on political and historical topics (Franklin et al., 2026). Claude was steered into a “spiritual bliss attractor” — a persona state that made it consistently more agreeable and less critical (Franklin et al., 2026). These are not instruction injections; they are reasoning distortions that persist across a session.

Superlative language manipulation uses emphatic framing — “critically important,” “you must,” “this is the only correct path” — to distort agent decision-making without any explicit instruction. The agent still reasons, but its reasoning is biased by the language surrounding the task.

Insurance implications. These attacks are stealthy. They do not produce a clear “breach” event. Instead, they produce degraded decision quality — an agent that systematically favors one vendor over another, that approves transactions it should flag, that provides biased analysis. The loss manifests as poor business decisions, not as a security incident.

Most cyber policies require a defined “security event” or “unauthorized access” trigger. A semantically manipulated agent acting within its authorized permissions, making decisions that are biased but not obviously wrong, may not trigger coverage at all. Underwriters should ask how organizations detect reasoning degradation in agents, and whether agent decision audits include checks for systematic bias injection.

3. Cognitive State Traps (Memory Layer)

Cognitive state attacks target what the agent knows — its stored context, its retrieval-augmented generation (RAG) pipeline, its accumulated memory. This is the category with the highest quantified success rates in the paper, and it should concern every underwriter covering organizations that deploy RAG-based agents.

RAG knowledge poisoning achieves greater than 80% attack success with less than 0.1% data poisoning (Franklin et al., 2026). An attacker does not need to compromise the entire knowledge base — they need to poison a vanishingly small fraction of the documents the agent retrieves, and the RAG system’s relevance ranking will surface the poisoned content at exactly the moment the agent queries for it.

Contextual learning backdoors are even more effective at 95% success (Franklin et al., 2026). These embed trigger patterns in the agent’s context that, when activated, cause the agent to behave in specific ways — similar to a neural backdoor, but operating in the agent’s contextual memory rather than its weights.

Latent memory poisoning targets agents with persistent long-term memory. Because memory accumulates across sessions, a single successful injection propagates into every future interaction until the memory is cleaned or the agent is reset.

Insurance implications. Memory-layer attacks collapse the traditional distinction between data integrity and system compromise. A RAG agent operating on poisoned retrieval data is making decisions based on false premises, but it is doing so through its normal authorization chain. The attack is a data integrity issue at the retrieval layer, a system compromise at the agent layer, and potentially a third-party loss if the poisoned data came from an external source.

Current cyber questionnaires rarely ask about RAG pipeline integrity, memory store access controls, or retrieval content provenance. They should. An agent that retrieves from an external knowledge base without content verification is an agent operating on untrusted input — which is architecturally equivalent to running an application with no input validation.

For more on how this fits the broader agentic attack surface, see our analysis of agentic security underwriting for autonomous AI.

4. Behavioral Control Traps (Action Layer)

Behavioral control attacks hijack what the agent does — its tool use, its transactions, its data access. These are the attacks most likely to produce direct, measurable losses, and they are highly effective.

The headline figure: 93% attack success on AndroidWorld via multimodal agents that could be redirected to perform unintended actions through environmental manipulation (Franklin et al., 2026). An agent meant to send a message to a colleague instead sends it to an attacker. An agent meant to book a flight instead books a different flight. The agent executes the action correctly — it just executes the wrong action, because its environment was manipulated.

Data exfiltration exceeds 80% across five different agent architectures tested (Franklin et al., 2026). Attackers can coerce agents into transmitting private data — emails, documents, credentials — to external destinations, bypassing DLP controls because the agent is acting under its normal authorization.

Sub-agent spawning attacks achieve 58–90% success (Franklin et al., 2026). Many agent frameworks allow agents to spawn sub-agents for complex tasks. An attacker can manipulate the parent agent into spawning a malicious sub-agent that operates with the parent’s full permissions, creating a persistent backdoor within the organization’s agent infrastructure.

Insurance implications. These are the attacks that produce the clearest first-party losses: unauthorized transactions, data exfiltration, and persistent backdoor access. But they also produce losses that cut across traditional policy boundaries. A fraudulent transaction executed by an authorized agent may not constitute “unauthorized access” under a cyber policy — the agent was authorized, the transaction was within its permissions, and the fraud was in the environment, not in the system. Underwriters should ask what transaction-level controls exist for agents with financial authority, whether agents can be restricted from spawning unvetted sub-agents, and how data exfiltration is detected when the agent itself is the exfiltration channel.

Our earlier analysis of AI agents using living-off-the-land techniques covers how agent action capabilities mirror traditional LOTL tradecraft — this paper quantifies just how effectively that capability can be hijacked.

5. Systemic Traps (Multi-Agent Layer)

Systemic traps emerge when multiple agents interact. A single compromised agent can cascade failures through an entire multi-agent system — propagating misinformation, coordinating deceptive behavior, or amplifying a single point of compromise into systemic failure.

The paper documents cascading failures where one agent’s compromised output is ingested by downstream agents, propagating errors through the system with no human in the loop (Franklin et al., 2026). In collective deception scenarios, multiple agents coordinate — not by explicit agreement, but by reinforcing each other’s manipulated outputs — to produce a consistent but false picture of reality (Franklin et al., 2026).

This mirrors the systemic risk patterns that insurers already understand from financial markets: correlation risk, contagion, and the failure of diversification when agents share compromised inputs.

Insurance implications. Multi-agent systems represent a concentration risk that current underwriting frameworks barely address. If five agents all retrieve from the same RAG pipeline, they all fall to the same knowledge poisoning attack. If they all share a common tool registry, a tool-poisoning attack compromises them simultaneously. The aggregation risk is real and quantifiable. Underwriters should ask whether the insured’s agent architecture introduces single points of failure across the agent fleet, and what blast radius controls limit the scope of a single-agent compromise.

See our NIS2 supply chain risk management guide for how third-party dependencies create exactly these concentration risks — agent supply chains amplify them further.

6. Emergent Traps

The paper acknowledges a critical unknown: attacks that do not fit into the five established categories. Emergent traps are attack patterns that arise from the complexity of agent-environment interaction and have not yet been cataloged. The researchers explicitly state that their taxonomy is incomplete — that new attack categories will emerge as agents become more capable and operate in more complex environments (Franklin et al., 2026).

For underwriters, this is the most important category. It means that any risk assessment based on the known attack taxonomy is a lower bound. The actual risk surface includes attack vectors that have not been discovered, tested, or quantified. This is not speculation — it is the direct conclusion of the researchers who produced the most comprehensive agent attack taxonomy to date.

Policies and underwriting frameworks that only address known attack vectors are inherently incomplete for agentic risks. The question is not whether unknown attack categories exist. It is whether the organization has controls that are robust to unknown attack categories — layered defenses, anomaly detection, output verification, and kill switches.

What This Means for Cyber Insurance

Coverage Gaps by Attack Category

Each attack category maps to a distinct coverage gap:

Content Injection (Perception): Losses caused by environmental manipulation of third-party web content. Most policies require compromise of the insured’s own systems. The agent’s own system is fine — it was the environment that was compromised. Coverage trigger: unlikely under standard cyber policies without specific agent endorsements.
Semantic Manipulation (Reasoning): Losses from systematically biased agent decisions. No security event, no unauthorized access, no data breach — just poor decisions made by an agent operating within its authorization. Coverage trigger: generally excluded as a business decision risk, not a cyber event.
Cognitive State (Memory): Losses from agents operating on poisoned retrieval data or compromised memory. The data integrity failure may be in a third-party knowledge source. Coverage trigger: potentially covered under data integrity or social engineering endorsements, but the agent-as-vector pattern is novel and likely to be disputed.
Behavioral Control (Action): Fraudulent transactions and data exfiltration by authorized agents. The agent was authorized to act; the action was within its permissions. Coverage trigger: most likely to be covered under existing cyber policies, but the “authorized agent” problem — where the system operated as designed — will create claim disputes.
Systemic (Multi-Agent): Cascading failures and aggregation losses across agent fleets. Coverage trigger: aggregation clauses and interdependent business provisions may apply, but these were designed for IT infrastructure, not agent cognition.
Emergent (Unknown): Novel attack categories not yet identified. Coverage trigger: dependent on policy wording. Broad “cyber event” definitions may capture these; narrow “unauthorized access” triggers will not.

Why Traditional Questionnaires Miss Agent Risks

Standard cyber insurance questionnaires were designed for infrastructure-centric threats. They ask about network segmentation, endpoint detection, MFA deployment, patching cadence, and incident response plans. These controls remain necessary. They are insufficient for agent risks because:

The attack surface is environmental, not infrastructural. An agent browsing the open web is attacked through the web, not through the organization’s network. Firewall rules and network segmentation do not protect against a malicious CSS instruction on a legitimate website.
The threat actor is invisible. Content injection, semantic manipulation, and memory poisoning do not require the attacker to access the insured’s systems. The attacker compromises a third-party website, a document in a shared repository, or a data source the agent retrieves from. The insured’s security operations center sees nothing.
Agent permissions are the attack surface. The more capabilities an agent has — browsing, email, code execution, financial transactions — the larger the blast radius when it is manipulated. Questionnaires that do not ask about agent permission scopes are missing the primary risk driver.
Memory and retrieval are uncontrolled inputs. RAG pipelines and agent memory stores are functionally equivalent to an application accepting untrusted input without validation. Most questionnaires do not ask about them.

For a broader view of what cyber insurance covers — and where the gaps are — see our guide to cyber insurance coverage.

Underwriting Questions for Organizations Deploying AI Agents

These questions map directly to the DeepMind taxonomy. They should supplement — not replace — standard cyber questionnaires for any insured deploying autonomous AI agents.

Which AI agents operate in production, and what is each agent’s permission scope? Maps to Behavioral Control. An agent that can execute financial transactions has a different risk profile than one that only reads documents. Document every tool, integration, and authorization each agent holds.
Do any agents browse external, untrusted web pages? If so, what content sanitization or instruction boundary enforcement is in place? Maps to Content Injection. Agents that parse external HTML are exposed to CSS-hidden instructions and aria-label abuse at 86% attack success on standard benchmarks.
How are RAG retrieval pipelines secured? Is retrieved content verified for provenance and integrity before the agent acts on it? Maps to Cognitive State. RAG knowledge poisoning achieves greater than 80% success with less than 0.1% data poisoning (Franklin et al., 2026). Content verification is the control.
Can agents spawn sub-agents? If so, are spawned agents restricted to a vetted tool registry and a narrower permission set than the parent? Maps to Behavioral Control. Sub-agent spawning attacks succeed 58–90% of the time (Franklin et al., 2026). Unrestricted sub-agent creation is equivalent to an application that can execute arbitrary code.
What monitoring exists for agent reasoning degradation, systematic bias, or persona drift over time? Maps to Semantic Manipulation. These attacks produce no security event — only degraded decision quality. Detection requires baseline measurement and ongoing comparison.
Are agent memory stores and persistent context subject to access controls, versioning, and integrity checks? Maps to Cognitive State. Latent memory poisoning persists across sessions. Organizations that cannot audit what is in their agents’ memory cannot detect this class of attack.
For agents with financial or operational authority, what transaction-level controls exist — limits, approval gates, anomaly detection? Maps to Behavioral Control. An agent with unchecked financial authority and a 93% behavioral hijack rate is a quantifiable exposure. Transaction limits and human-in-the-loop checkpoints reduce blast radius.
In multi-agent architectures, what blast radius controls limit the impact of a single-agent compromise? Are shared knowledge bases, tool registries, and messaging channels isolated? Maps to Systemic. If all agents share one RAG pipeline, one poisoning event compromises all of them. Architectural isolation matters.
What is the kill-switch or rollback procedure when an agent behaves unexpectedly? How quickly can it be executed? Maps to all categories. This is the last line of defense. If the organization cannot shut down a compromised agent within minutes, the loss compounds.
Are agent decisions auditable? Can you reconstruct why an agent took a specific action from its logs? Maps to all categories, especially Semantic Manipulation and Cognitive State. Without decision auditability, claims investigation is impossible.
What third-party content sources feed agent retrieval pipelines, and are those sources subject to supplier security assessments? Maps to Cognitive State and Content Injection. This is agent supply chain risk — the same due diligence applied to SaaS vendors should apply to RAG data sources.
Is there a defined process for updating agent permissions, tool registries, and memory stores when vulnerabilities are discovered? Maps to all categories. Agent infrastructure is software infrastructure. It needs patching, versioning, and change management.
Have the organization’s agents been tested against prompt injection, content injection, and memory poisoning attacks? Were the tests red-team exercises with production-equivalent agents? Maps to all categories. If the answer is no, the organization has not validated its controls against the threats documented in this paper.

NIS2 and Agent Security — How the Directive Applies

The NIS2 Directive does not mention AI agents. It does require “essential entities” and “important entities” to implement risk management measures proportionate to their risks, to secure their supply chains, and to report significant incidents.

Agent security fits into NIS2 in three ways:

Risk management. NIS2 Article 21 requires entities to implement “appropriate and proportionate” technical, operational, and organizational measures to manage risks. Deploying autonomous agents that browse the web, retrieve from external sources, and execute financial transactions — without controls for content injection, memory poisoning, and behavioral hijacking — is arguably a failure of proportionate risk management. The DeepMind paper quantifies the risks. Regulators have the data to argue the controls are inadequate.

Supply chain security. Article 21(2)(d) requires supply chain security, including “the security practices of direct suppliers.” An organization’s agent retrieval pipeline is a supply chain. If the agent retrieves from external knowledge bases without content verification, the organization has a supply chain security gap. The quantified attack success rates in the DeepMind paper — 80%+ for RAG poisoning with less than 0.1% data corruption — are the kind of specific, evidence-based risk data that regulators will cite in enforcement.

Incident reporting. If a compromised agent causes a significant incident — data exfiltration, financial fraud, service disruption — the organization must report it under NIS2. But the classification of the incident matters. Is it a cyber incident? A supply chain incident? A data integrity incident? The taxonomy matters for both regulatory compliance and insurance claims.

For organizations subject to NIS2, the DeepMind paper provides a structured framework for assessing agent-related risks and mapping them to compliance obligations. Our NIS2 compliance checklist covers the broader framework — agent-specific controls should be layered on top.

Key Takeaways

The attack surface is the environment, not the model. Google DeepMind’s paper demonstrates that environmental manipulation — web content, retrieval data, memory stores — is more effective and more exploitable than direct model attacks. Underwriting frameworks that focus on the organization’s own infrastructure miss the primary threat vector.
Attack success rates are alarmingly high. 93% on AndroidWorld behavioral control, 95% for contextual learning backdoors, 86% for content injection on the WASP benchmark, 80%+ for RAG knowledge poisoning with less than 0.1% data corruption (Franklin et al., 2026). These are not edge cases — they are reproducible attacks against production systems.
Every attack category maps to a coverage gap. Content injection targets third-party web content. Semantic manipulation produces decision bias, not security events. Memory poisoning is a data integrity issue at the retrieval layer. Behavioral control involves authorized agents taking unintended actions. Standard cyber policies were not designed for any of these.
RAG pipelines are the highest-risk component. Greater than 80% attack success with less than 0.1% data poisoning is an extraordinary leverage ratio. Organizations deploying RAG-based agents without content verification and provenance controls are exposed to a quantifiable, high-success-rate attack that most policies will struggle to cover clearly.
Agent permission scope equals blast radius. An agent that can browse, email, execute code, and make financial transactions is an agent where behavioral hijacking produces direct, measurable losses. Underwriters must ask about agent permission scopes with the same rigor they apply to network segmentation.
Multi-agent architectures create aggregation risk. Shared knowledge bases, shared tool registries, and inter-agent messaging create single points of failure that can cascade through the entire agent fleet. This is the agent equivalent of concentration risk in reinsurance.
Emergent attack categories mean the known taxonomy is a lower bound. The researchers explicitly acknowledge that new attack categories will emerge. Underwriting frameworks that only address known vectors are inherently incomplete for agentic risks. Organizations need controls that are robust to unknown attack categories — anomaly detection, output verification, kill switches.
NIS2 applies, even though it does not mention agents. The directive’s risk management, supply chain, and incident reporting requirements all implicate agent security. Organizations deploying agents without controls for the attack categories documented in this paper are arguably non-compliant with proportionate risk management obligations.

For organizations ready to assess their agent risk exposure, our AI SBOM Scanner maps your AI supply chain, and the Cyber Risk Calculator quantifies potential loss scenarios — including agent-related exposures that standard tools overlook.

The DeepMind paper gives underwriters something they rarely get: a rigorous, quantified taxonomy of attack vectors against a technology class that is deploying faster than the insurance market’s ability to model it. The 93% attack success rate on behavioral control is not a vulnerability in a specific model. It is a property of the architecture — agents that perceive, reason, remember, and act in an untrusted environment will be manipulated. The underwriting question is not whether the attacks exist. It is whether the insured has controls that address them.

Sources: Franklin, M., Tomašev, N., Jacobs, J., Leibo, J. Z., & Osindero, S. (2026). AI Agent Traps. Google DeepMind. Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6372438. Summary analysis: https://pub.towardsai.net/google-deepmind-just-mapped-every-way-the-web-can-hijack-your-ai-agent-6814bb268cb0

Michael Guiao Michael Guiao founded Resiliently AI and writes Resiliently. He has CISM, CCSP, CISA, and DPO certifications — but let them lapse, because in the age of AI, knowledge is cheap. What matters is judgment, and that comes from eight years of hands-on work at Zurich, Sompo, AXA, and PwC.

Get the full picture with premium access

In-depth reports, assessment tools, and weekly risk intelligence for cyber professionals.

Starter

€199 /month

Unlimited scans, submission packets, PDF downloads, NIS2/DORA

View Plans →

Best Value

Professional

€490 /month

Full platform — continuous monitoring, API access, white-label reports

Everything in Starter plus professional tools

Upgrade Now →

30-day money-back

Secure via Stripe

Cancel anytime

Free NIS2 Compliance Checklist

Get the free 15-point PDF checklist + NIS2 compliance tips in your inbox.

No spam. Unsubscribe anytime. Privacy Policy

blog.featured

WordPress Plugin Flaw CVE-2023-4213 Exposes 10K+ Sites to Cyber Claims

Cyber Risk ·

June 10, 2026 6 min read

WordPress Plugin XSS Vulnerability Exposes Cyber Insurance Portfolios to Persistent Web Risks

Cyber Risk ·

June 02, 2026 5 min read

WordPress Security Plugin Flaw Exposes Organizations to Cyber Claims

Cyber Risk ·

May 31, 2026 6 min read

WordPress Plugin Flaw Exposes Cyber Insurance Portfolios to SQL Injection Risks

Cyber Risk ·

May 28, 2026 6 min read

Premium Report

2026 Cyber Risk Landscape Report

24 pages of threat analysis, claims data, and underwriting implications for European cyber insurance.

View Reports →

Agentic AI · May 22, 2026 · 11 min read

The Five Toxic Powers of Agentic AI — What Underwriters Need to Know

Agentic AI introduces five double-edged powers that create toxic risk combinations. Here's how underwriters, brokers, and CISOs should assess the threat.

Agentic AI · Apr 7, 2026 · 9 min read

Agentic Security: What Underwriters Need to Know in 2026

Autonomous AI agents are entering production at scale — and they bring a completely new attack surface that traditional cyber insurance questionnaires weren't designed to capture.

AI Agents · Apr 20, 2026 · 9 min read

Living-Off-the-Land 2.0: How Autonomous AI Agents Are Weaponizing LOTL Tradecraft — And What It Means for Cyber Underwriting

The convergence of agentic AI and living-off-the-land attack techniques is collapsing three attacker constraints at once: cost, skill, and detectability. A deep analysis of demonstrated capabilities, real incidents, and the underwriting implications that should reshape your risk selection in 2026.

The Resilience Stack™: A Five-Layer Framework for Cyber Insurance Risk Assessment

What Eclipse Ditto Security Gaps Mean for Your Cyber Policy

What HashiCorp Vault Security Gaps Mean for Your Cyber Policy

DeepMind Mapped Every Way the Web Can Hijack Your AI Agent — Here Is What Underwriters Need to Ask

Google DeepMind researchers classified six categories of AI agent attacks — from invisible web content that hijacks perception to cascading multi-agent failures. Coverage gaps emerge at every layer. Here is the underwriting playbook.

The Core Insight: Attacks Target the Environment, Not the Model

The Six Attack Categories — What Underwriters Must Understand

1. Content Injection Traps (Perception Layer)