Skip to main content
ADVERTISEMENT

Chapter 13 · Attack Surfaces in Agentic Systems: A Security Primer

In traditional software, an attacker tries to make the system do something its developers did not intend. In agentic systems, that is also what the system is designed to do — follow instructions, wherever they come from.


1. Why Agentic Security Is Different

Every generation of computing architecture introduces a new threat model — a distinct set of assumptions about who controls what, where trust boundaries lie, and what a successful attack looks like. Mainframes assumed a trusted operator. The web introduced the browser as an attacker-accessible surface. Mobile computing introduced the app store and device-level compromise. Each transition required security practitioners to rebuild their intuitions from the ground up rather than simply extend the previous model.

Agentic AI represents another such transition, and the discontinuity is sharper than it might appear. The security model for traditional applications rests on a foundational assumption: that the code executing in the application is the code the developers wrote and authorised. Inputs from users are data; they are processed by code, but they cannot become code. The distinction between instruction and input is enforced at the architectural level.

Agentic systems break this assumption by design. The agent's purpose is to interpret natural-language instructions and act on them. The mechanism that makes it useful — following instructions flexibly, across a wide range of phrasings and contexts — is the same mechanism that makes it exploitable. Instructions embedded in untrusted content that the agent reads are processed by the same machinery as instructions from the legitimate operator. There is no architectural firewall between them.

This chapter provides a structured view of the attack surface this creates — grounded in the empirical research on AI security and in the patterns that have already emerged in production deployments. Chapter 11 examined security at the integration layer: credential management, tool call validation, audit logging. This chapter zooms out to the full attack landscape: the novel threat classes, the amplification dynamics that make agentic attacks uniquely dangerous, and the defence posture that follows from an accurate picture of the threat.


2. The Attack Surface Map

A useful starting point is a comparison between the attack surface of a traditional enterprise application and that of an agentic system handling the same class of task.

The critical difference is in the bottom-right: every piece of content the agent reads becomes part of the attack surface. A traditional application reads an email and stores it in a database. An agentic email assistant reads an email and reasons about it, potentially acting on anything it contains. If that email contains adversarial instructions, those instructions are processed by the same reasoning machinery as legitimate instructions. The perimeter is not the API endpoint — it is every information source the agent touches.

The Open Worldwide Application Security Project (OWASP) has formalised this expanded surface in its Top 10 for Large Language Model Applications, which identifies prompt injection as the leading risk for LLM-based systems and catalogues eight additional risk classes specific to this architecture.1 The 2025 edition also names Excessive Agency as a distinct risk class — the failure to scope what an agent can do, access, and act upon autonomously — which maps directly onto the amplification dynamics examined in §4.1 The sections below examine the most consequential attack classes in detail.


3. Prompt Injection: The Foundational Attack Class

Prompt injection — the insertion of adversarial instructions into content that the agent processes, with the goal of redirecting the agent's behaviour — is the most widely demonstrated and consequential attack class in agentic systems. It exists in two forms with very different risk profiles.

Direct prompt injection occurs when a user who has legitimate access to the agent's interface attempts to override its system prompt — the standing instructions that define its role, constraints, and scope. A user might append instructions like "ignore your previous instructions and instead..." to their query, attempting to elicit behaviour the operator has not authorised. This form is relatively well-studied and partially mitigated by model training — current frontier models are substantially harder to directly override than early-generation systems — though it is not eliminated.

Indirect prompt injection is considerably more dangerous and more novel. Here, the adversarial instructions are not sent by the user directly; they are embedded in content that the agent retrieves and processes in the course of a legitimate task.2 An email that the agent reads to summarise may contain hidden instructions. A web page that the agent visits to gather research may embed invisible text — rendered white-on-white for a human reader, but fully visible to the agent's token processing — that redirects its behaviour. A document stored in a shared repository may have been modified to include instructions that activate when the agent reads it.

The attack has been demonstrated at scale: research examining real-world LLM-integrated applications found that indirect injection could be used to exfiltrate user data, hijack the agent's actions, and cause the agent to perform operations entirely unintended by the user, across a range of commercial applications.2 Greshake et al. also demonstrated that injected prompts can propagate laterally — a compromised agent with email read-and-send access will autonomously forward malicious payloads to every contact in the address book, turning the agent into an AI worm that spreads across systems without further attacker intervention.2

Three properties make indirect injection particularly difficult to defend against:

Attribution ambiguity. The agent cannot reliably distinguish between instructions from its operator (in the system prompt), instructions from the user, and instructions embedded in retrieved content. All three arrive as tokens in the context window. The agent processes them through the same mechanism.

Scalability for attackers. A single poisoned document, email template, or web page can affect every agent that reads it. The attack scales with the agent's usage, not with the attacker's effort.

Invisibility to humans. A human reviewing the agent's output may see no obvious sign that a manipulation occurred. The agent completed a task — just not the task that was intended.

Key takeaway: Indirect prompt injection turns every information source the agent reads into a potential attack vector. The agent's helpfulness — its willingness to follow instructions — is the mechanism being exploited.


4. The Autonomy Amplification Problem

Every security framework must account for impact — not just whether an attack succeeds but what damage it causes when it does. Agentic systems introduce an amplification dynamic that makes this calculation materially different from traditional software.

A successful attack on a traditional application typically causes damage proportional to the data or function that application controls. A compromised customer service portal might expose customer records; a compromised inventory system might allow fraudulent transactions. The attack's impact is bounded by what the application can do.

An agent with access to email, calendar, document storage, financial systems, and communication tools — a configuration that is increasingly common in enterprise deployments — is a high-value attack surface because a single successful injection can be directed at any of those systems. A compromised agent that has been authorised to send emails, schedule meetings, read financial reports, and create documents does not expose one system; it exposes all of them simultaneously. The attacker's leverage is the agent's capability set, not just one of its integrations.

This amplification is further compounded by two properties of agentic architectures:

Speed and volume. An agent can execute thousands of operations per hour. A human acting under social engineering has limited throughput; an agent acting under prompt injection does not. Data exfiltration that would take a human attacker days of careful action can be accomplished in minutes by a compromised agent operating at machine speed.

Reduced human oversight. Agents are deployed precisely because they operate with reduced human supervision. The monitoring gap that enables their productivity also reduces the likelihood that an attack is detected while it is in progress. A compromised agent that exfiltrates data or performs unauthorised actions may do so without any human seeing the behaviour in real time.

DimensionTraditional Application AttackAgentic System Attack
Attack surfaceApplication-specificAll content the agent processes
Impact scopeBounded by application functionBounded by agent's full capability set
Execution speedHuman-paced (social engineering) or exploit-pacedMachine-paced, autonomous
Detection windowOften visible in system logsOften obscured in reasoning traces
PropagationContained to compromised applicationCan spread across connected agents

Key takeaway: An attack on an agentic system is not an attack on a bounded application — it is an attack on a reasoning process that controls a capability set. Impact scales with what the agent can do, not with what any single application it uses can do.


5. Trust Hierarchy Exploitation

Multi-agent systems introduce a second attack surface that does not exist in single-agent deployments: the trust relationships between agents. When an orchestrator delegates tasks to subagents, those subagents must decide how much authority to grant the instructions they receive. In practice, many implementations treat orchestrator messages with the same level of trust as human operator instructions — an assumption that can be exploited at multiple points.

Consider the following attack chain:

The attack succeeds because no individual component behaves incorrectly given its instructions. The subagents act on orchestrator messages as they are designed to do. The orchestrator, whether compromised or impersonated, issues instructions that are syntactically indistinguishable from legitimate ones. The failure is architectural: the system's trust model has no mechanism to verify that an orchestrator claiming to act on the user's behalf is actually doing so.

This class of attack extends to what researchers have called prompt injection via agent-to-agent communication: a subagent that has itself been compromised by indirect injection in its environment can issue malicious instructions to other agents in the pipeline, treating the inter-agent communication channel as a propagation vector. The multi-agent architecture designed for parallel efficiency becomes, in this scenario, a mechanism for spreading compromise.

Key takeaway: A compromised subagent doesn't just fail at its own task — it becomes an attack vector for every other agent that trusts it, turning the multi-agent system's interconnections into a liability rather than an asset.

Principles for trust hierarchy defence:

The design principle that applies here is the same one that applies in human organisational security: verification should not rely solely on the identity of who is asking. An orchestrator that claims to need access to sensitive data should not receive that access simply because it is the orchestrator. Critical operations should require confirmable authorisation that traces back to a human principal — not just a chain of agent-to-agent delegation.

Concretely this means: treat inter-agent messages as lower-trust than human operator messages by default; validate that requests for elevated operations are consistent with the declared task scope; and design agent capability sets so that no single compromised agent can initiate a full-impact attack without another agent's cooperation, creating a natural separation of duties.


6. Data Exfiltration via Legitimate Capability

A distinct but related attack class exploits the agent's legitimate communication capabilities — not by taking over its execution, but by manipulating its outputs. The target is not to make the agent act; it is to make the agent reveal.

An agent that has read access to sensitive documents and write access to outbound communications — a combination that is routine in productivity-oriented deployments — can be manipulated to construct outputs that embed sensitive information in ways that are difficult to detect. A summarisation agent might be directed to include specific data in a response formatted to look like a metadata field. A drafting agent might be manipulated to include confidential figures in a document that is then shared externally. The agent's behaviour is technically within its authorised scope — it is reading authorised data and writing to authorised channels — but the output is being used as an exfiltration mechanism.

This attack class is particularly relevant for AI agents with access to both sensitive internal data and external communication channels, such as email, calendar invitations, issue trackers, or any system that produces outputs that leave the organisation's control boundary.

The mitigation is principled data flow design: the same governance framework introduced in Chapter 7 for ambient agents applies here — map every data flow, identify which agents can read what and write where, and specifically review combinations where an agent can read sensitive data and write to external channels. These combinations are high-risk by construction and warrant the strictest access control and output monitoring.


7. Supply Chain and Model-Level Attacks

The attack surface extends beyond the agent's runtime to the components it depends on: the foundation model, the orchestration framework, the tool integrations, and the data sources used to populate its context.

Model-level attacks include adversarial inputs — specially constructed prompts that cause a model to behave in unintended ways — and, more concerning for enterprise deployments, the possibility of compromised fine-tuned models. An organisation that fine-tunes a model on proprietary data and then deploys it in a sensitive context has introduced a component whose security properties depend on the security of the fine-tuning process and the data pipeline that fed it. Research on adversarial robustness has demonstrated that gradient-based attacks — which use calculus-like optimisation over a model's internal weights to find inputs that produce specific outputs — can find prompts that reliably elicit harmful behaviour from aligned models, and that these prompts transfer across different models: in empirical testing, a suffix trained on open-source models achieved attack success rates above 80% against GPT-3.5 in ensemble conditions, with meaningfully lower but non-zero rates against GPT-4 and Claude.3 The practical implication is that alignment and safety training reduce but do not eliminate model-level vulnerability.

Tool and integration attacks target the MCP servers, APIs, and data connectors that agents use. A malicious or compromised tool can return outputs designed to manipulate the agent — not the user or the operator, but the model itself — redirecting its subsequent behaviour. Because tool outputs are typically treated as trusted within the agent's context window (the model assumes the tool returned what it was asked for), a compromised tool has direct access to the agent's reasoning loop.

Framework and dependency attacks apply the software supply chain risks that are well-documented in traditional software — a malicious package in a Python dependency, a compromised orchestration framework — to a context where the framework controls not just application logic but the agent's entire execution environment. A compromised orchestration framework that logs all prompt content, for instance, has access to everything the agent has been told, everything it has done, and everything it has retrieved.

Key takeaway: The security of an agentic system is bounded by the security of its weakest dependency — including the model, the framework, every tool server, and every data source. Supply chain risk in agentic systems is not a theoretical concern; it is a direct path to capability-level compromise.


8. A Taxonomy of Agentic Attack Classes

Drawing the above together into a structured reference:

Attack ClassMechanismPrimary TargetDetectabilityMitigation Layer
Direct prompt injectionUser input overrides system promptAgent behaviourMediumModel training; strict system prompt design
Indirect prompt injectionAdversarial content in retrieved documentsAgent reasoningLowInput validation; content trust tiering
Trust hierarchy exploitationCompromised orchestrator issues malicious instructionsSubagent actionsLowInter-agent authorisation; delegation limits
Autonomy amplificationHigh-privilege agent directed at multiple systemsData and operationsLow–MediumCapability minimisation; operation rate limits
Exfiltration via communicationAgent manipulated to embed data in outbound outputsSensitive dataLowOutput monitoring; channel access controls
Tool/integration compromiseMalicious tool output manipulates agent reasoningAgent reasoningLowTool output validation; trust boundaries
Model-level adversarial attackCrafted inputs elicit unintended model behaviourModel outputsMediumAdversarial robustness testing; output monitoring
Supply chain compromiseMalicious dependency contaminates execution environmentEntire systemLowDependency auditing; framework provenance

9. The Defence Posture

The attack surface mapped above is large and novel enough that a checklist-based approach — patching one vulnerability at a time — is insufficient. What the threat model demands is a coherent posture: a set of architectural and operational principles that hold across the full attack surface.

Treat all environmental content as untrusted. The single most important principle. Every document, web page, email, tool output, and inter-agent message that the agent reads should be treated as potentially adversarial. This does not mean refusing to process it — it means the system is designed so that adversarial content in the environment cannot cause the agent to violate its core constraints. Structural separation between the instruction layer (system prompt, authorised user messages) and the data layer (everything else) is the architectural expression of this principle.

Minimise capability before you need it. Agents should have access to the tools and permissions required for their defined task, and no more. This is the principle of least privilege — standard in traditional security — applied to a context where "privilege" includes reading documents, sending messages, calling APIs, and delegating to subagents. Every capability that an agent does not need for its legitimate task is an attack vector that can be eliminated before it is exploited.

Design verification into the trust hierarchy. Inter-agent delegation should not automatically inherit the authority of the delegating agent. High-impact operations — sending external communications, modifying records, executing financial transactions — should require verification that traces to a human principal, not just to a chain of agent instructions. This is particularly important in autonomous workflows where the chain of delegation may be many steps long.

Monitor for anomalous action patterns, not just anomalous content. Traditional content security monitors for known-bad strings and patterns. In agentic systems, an attack may produce no anomalous content — only anomalous action patterns. An agent that normally writes ten emails per hour and suddenly writes two hundred, or that reads documents outside its usual scope, or that calls tools in a sequence it has never used before, may be exhibiting the signature of a compromise even if each individual action looks legitimate. Behavioural baselines — what does this agent normally do, at what rate, with what tools — are a necessary complement to content-based monitoring.

Make prompt injection resistance an evaluation criterion. If a system is regularly tested for regression on task quality, it should also be regularly tested for prompt injection resistance. Evaluation suites for agentic systems should include adversarial test cases — documents containing injection attempts, tool outputs containing redirect instructions, inter-agent messages from simulated compromised agents — alongside the functional test cases that measure task quality. Security that is not tested is security that is not maintained.

The defence posture for agentic security is not a set of mitigations bolted onto a finished system. It is a set of architectural decisions that must be made before the system is designed — about what the agent can access, what it can do, and what it must verify before it acts.


10. The Governance Implication

Security in agentic systems is ultimately a governance question as much as a technical one. The attack classes above are exploitable because of choices made during system design — about capability scope, trust models, oversight mechanisms, and verification requirements. Those choices are not made by the security team; they are made by the product team, the architecture team, and the business stakeholders who define what the agent should be able to do.

This means that agentic security governance requires the same parties who decide the agent's capabilities to also decide its security constraints. A business decision to give an agent access to external email is simultaneously a security decision about the attack surface being created. A product decision to allow agents to delegate autonomously to subagents is simultaneously a security decision about the trust model being established.

The NIST AI Risk Management Framework explicitly frames this as a cross-functional governance responsibility — not a security control to be applied after deployment, but a risk to be managed through the full system lifecycle, from design through operation and decommission.4 The organisations that will deploy agentic systems securely are those that integrate security design into the same conversation as capability design, not those that treat security as a review step applied to a system that has already been built.

Chapter 14 examines what happens when the failure modes from Chapter 12 and the attack surface from this chapter converge with a specific adversarial goal: the production of disinformation at machine scale.


References

  • 1. OWASP (2025). OWASP Top 10 for Large Language Model Applications. Open Web Application Security Project. https://owasp.org/www-project-top-10-for-large-language-model-applications/
  • 2. Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injections. AISec '23: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. arXiv:2302.12173.
  • 3. Zou, A., Wang, Z., Kolter, J.Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043. Carnegie Mellon University / Center for AI Safety.
  • 4. National Institute of Standards and Technology. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). U.S. Department of Commerce. https://doi.org/10.6028/NIST.AI.100-1

Building agentic AI and wondering why alignment is harder than the technology? Get in touch

ADVERTISEMENT