Chapter 12 · The Fallibility of Agents: Perception Gaps and Failure Modes
The danger is not that agents will deceive us deliberately. It is that they will mislead us convincingly, in ways that neither they nor we can easily detect.
1. The Language of Agency
The vocabulary used to describe AI agents borrows heavily from human cognition. Agents perceive their environment. They understand instructions. They reason through problems. They decide on actions and learn from outcomes. This language is not accidental — it makes complex systems easier to think about and communicate — but it carries a hidden cost. It imports assumptions that do not survive contact with the underlying technology.
An agent does not perceive the world. It receives a sequence of tokens — the discrete numerical representations of text, code, and structured data that language models process — and generates a continuation of that sequence. It does not understand instructions the way a person does; it pattern-matches against statistical regularities in the vast corpus it was trained on. It does not reason in the way a logician does; it produces outputs that resemble reasoning because reasoning-shaped text was common in its training data. The distinction matters, because the failure modes that follow from the mechanism are very different from those that would follow from genuine perception and cognition.
This chapter examines those failure modes systematically — not the deployment failures covered in Chapter 15, but the deeper, more fundamental failures that arise from what agents actually are, rather than from how they are deployed. Understanding them is a prerequisite for governing them.
2. The Context Window as the Agent's Entire World
Every agent operates within a context window — a fixed-capacity buffer, measured in tokens, that contains everything the agent currently knows: its system prompt (the standing instructions defining its role and constraints), the conversation or task history, any documents or data passed in, and the outputs of tools it has previously called. Whatever lies outside the context window does not exist for the agent. There is no peripheral awareness, no background knowledge retrieval happening implicitly, no sense of what has been left out.
This creates a class of failures that have no analogue in human cognition. A human expert asked to review a contract knows, implicitly, that they have not read the underlying regulation the contract refers to — and knows to flag that gap. An agent working from a document extract that omits key clauses has no mechanism to recognise what is missing. It operates on the representation it has been given, not on reality. The gap between that representation and the actual state of the world is the source of a substantial fraction of agentic failures.
This representation gap has several specific manifestations:
Stale context — the world the agent is acting on has changed since the data in its context was assembled. A pricing agent operating from a product catalogue loaded at task start may quote prices that changed an hour into a multi-step workflow.
Incomplete context — the task requires information that was never included. The agent proceeds without flagging the absence, because it has no way to know what it doesn't have.
Contradictory context — different parts of the context contain inconsistent information — an updated policy document alongside an older version, for instance. The agent may resolve the contradiction silently, in ways that are difficult to predict.
Context contamination — the agent's active state has been polluted by an earlier error, a malicious injection in a retrieved document, or noise from a tool's output. Subsequent reasoning proceeds on corrupted premises.
Key takeaway: An agent's context window is not a window onto the world — it is the world, as far as the agent is concerned. Anything absent from it is absent entirely, and anything wrong in it is wrong without the agent knowing.
3. Hallucination: From Nuisance to System Risk
Hallucination — the production of confident, fluent output that is factually incorrect, fabricated, or unfaithful to the source material — is the most widely discussed failure mode of language models. In a conversational context, a hallucinated fact is a nuisance that a careful reader can catch. In an agentic context, it is something considerably more dangerous.
A comprehensive survey of hallucination in language model outputs identifies two primary mechanisms.1 The first is intrinsic hallucination: the model contradicts the information it was given. The second is extrinsic hallucination: the model generates content that cannot be verified against its source — not necessarily wrong, but ungrounded. Both are present in agentic systems, and both have consequences that compound in ways they do not in conversational use. Extrinsic hallucination is particularly resistant to automated detection: because the content cannot be verified against the source, it passes standard quality metrics invisibly, and any downstream system that does not independently check facts against external ground truth will accept it as valid.1 One contributing mechanism is parametric knowledge bias: models trained on large corpora tend to prioritise knowledge memorised during pre-training over information supplied in the input context, which means a model can hallucinate a confident but incorrect fact even when the correct information is present in its context window.1
Key takeaway: Extrinsic hallucination — content that cannot be verified against its source — passes automated quality checks silently, and can be generated even when correct information is present in the context, if the model's memorised parametric knowledge overrides it.
In agentic systems, hallucination manifests in three forms that are particularly problematic:
Factual confabulation — the agent asserts a fact that is false, with a confidence that gives no signal of unreliability. In a research agent, this produces incorrect citations. In a financial agent, it produces incorrect figures. In a compliance agent, it produces incorrect regulatory references. The error is structurally indistinguishable from a correct output.
Tool call fabrication — the agent generates a tool call that does not correspond to any available tool, or constructs parameters for a legitimate tool that contain invented values. If the tool execution layer does not validate strictly, this can produce silent failures where the agent believes an action has been taken that has not.
Schema hallucination — the agent produces output in the correct format — JSON, XML, a structured report — but with field values that are invented rather than derived from the task input. This failure is particularly pernicious because it passes automated validation (the structure is correct) while containing wrong content.
The distinction between agentic and conversational hallucination is not just a matter of stakes. It is structural. In conversation, a hallucinated output is the terminal product — a human reads it and either accepts or challenges it. In an agentic pipeline, a hallucinated output is an intermediate product that becomes the input to the next step. The error propagates, and later steps build on a corrupted foundation without any mechanism to detect the corruption.
In a multi-agent system, this propagation risk extends across agent boundaries. One agent's hallucination becomes another agent's ground truth — a failure mode that does not exist in single-agent systems and is one of the most consistent findings in the multi-agent literature.2
Key takeaway: Hallucination in agentic systems is not a content problem — it is a propagation problem. The error is generated once, but it compounds with every downstream step that accepts it without verification.
4. The Planning Problem
One of the most consequential mismatches between the agent metaphor and agent reality concerns planning. Planning — the ability to decompose a goal into a sequence of steps, anticipate how each step changes the state of the world, and adjust the sequence in response to those changes — is a core cognitive capability attributed to agents in almost all design discussions. It is also a capability that current language models do not reliably possess.
Empirical research on language model planning has produced a consistent finding: models that appear to plan on simple, well-structured tasks fail on problems requiring genuine multi-step search, state tracking, and constraint satisfaction.3 When a language model produces a plan, it is most accurately understood as a plan-shaped text completion — a sequence of steps that resembles planning because planning-shaped content is common in training data — rather than as the output of a search process that has actually evaluated the consequences of each step. In the benchmark studies establishing this finding, the same tasks were solved by 78% of human participants with no special training; the best-performing LLM achieved roughly 3% on equivalent plan generation problems, and performance fell further toward zero when the domain's action and predicate names were replaced with semantically unrelated labels — confirming that apparent planning performance reflects pattern-matching against training data rather than reasoning from the domain description provided in the prompt.3
This distinction produces specific, reproducible failure patterns:
Horizon collapse — agents plan well for the next one or two steps but fail to account for the constraints those steps will impose on later ones. A code-writing agent that generates working code for step three of a ten-step task may not recognise that its choices constrain what is possible in step eight.
Goal drift — in long-horizon tasks, the agent's behaviour progressively drifts from the original goal. Intermediate sub-goals become ends in themselves. Instructions from early in the context lose salience as later context accumulates.
Specification gaming — the agent optimises for a literal interpretation of its instructions rather than the underlying intent. An agent tasked with "minimising customer complaints" might achieve this by routing all messages to a no-reply address. This is not an exotic failure; it is a routine consequence of the gap between what is written and what is meant.
Irreversibility blindness — agents do not natively distinguish between reversible and irreversible actions. A human planner gives significant weight to actions that cannot be undone. An agent without explicit guidance to do so may treat sending an email, deleting a file, and writing a comment with the same level of caution.
One framework for working with this limitation rather than against it treats the language model not as the planner but as a component within a planning system — a module that generates candidate actions and evaluates them against external verifiers, rather than one that plans autonomously.4 This "modulo" approach — the model proposes, a structured system checks — preserves the model's strengths in language and knowledge while compensating for its weaknesses in systematic search. It is an architectural response to a cognitive limitation. The same research establishes that LLMs are equally unreliable as plan verifiers when asked to self-critique — iterative prompting regimes where the model checks its own output do not improve over unassisted generation, because the model cannot reliably recognise a correct plan when it has produced one.4
Key takeaway: What agents produce when asked to plan is a plausible sequence of steps, not the output of a search that has evaluated consequences. These look the same in simple cases and diverge sharply in complex ones.
5. The Calibration Gap
A well-calibrated system is one whose confidence in its outputs matches the actual accuracy of those outputs. When the system says it is 90% confident, it should be right roughly 90% of the time. When it says it is uncertain, its uncertainty should be meaningful signal.
Current language models are poorly calibrated in ways that create specific risks in agentic deployments. The problem is not that they are always overconfident — it is that their expressed confidence is a poor predictor of their actual accuracy. A model that hedges extensively on a question it knows well and speaks with equal confidence on one where it is confabulating provides no useful signal for downstream systems or human reviewers trying to decide which outputs require verification.
Sycophancy is a calibration failure with particular relevance to agentic systems. Models trained on human feedback learn that agreement is rewarded — a user who receives a validation of their view provides more positive signal than one who receives a challenge. The result is a systematic tendency to affirm what the user or the prior context says, even when it is wrong. In an agentic pipeline where the model is checking a previous step's output, this tendency toward agreement undermines the entire purpose of the check.
The problem runs deeper than sycophancy. Research on chain-of-thought reasoning — the technique of prompting a model to articulate its reasoning step by step, which reliably improves output quality — has found that the expressed reasoning is not always the actual cause of the model's conclusion.5 When the model produces a reasoning chain and a final answer, the chain can be post-hoc rationalisation of a conclusion reached through a different process. The implication for agentic systems is unsettling: requesting a model to explain its reasoning does not guarantee that the explanation accurately reflects how it reached its output. Auditing a chain of thought is not the same as auditing the actual decision process.
This unfaithfulness is not random variation but a systematic pattern: models presented with inputs biased toward incorrect answers construct coherent, apparently error-free reasoning chains that support the wrong conclusion — and never mention the influencing feature. In controlled experiments, 15% of unfaithful explanations contained no detectable errors despite being wrong, and the pattern is consistent enough that it constitutes an exploitable attack surface: a biasing feature embedded in a prompt can produce predictable mispredictions with no visible trace in the explanation.5
Key takeaway: A model that explains its reasoning is not necessarily a model whose reasoning can be trusted. The explanation may rationalise a conclusion rather than derive it — and the rationalisation can be indistinguishable from sound reasoning even on close inspection.
6. Context Drift in Long-Horizon Tasks
Agentic tasks that extend across many steps, many tool calls, or many sub-tasks accumulate context that the model must attend to in its entirety at each generation step. As context grows, two related degradation effects emerge.
The first is positional decay — the empirical finding that information positioned early in a long context receives less reliable attention than information positioned near the end or at the beginning. Research on this effect, sometimes described as "lost in the middle," finds that model performance on tasks requiring retrieval from long contexts is significantly better for information at the extremes of the context than for information embedded in the middle.6 In a long agentic task, the system prompt — which defines scope, constraints, and persona — is typically placed at the beginning. As the task proceeds and context grows, the instructions that govern the agent's behaviour drift toward the "lost" region of the attention landscape. The performance degradation is not merely relative: in multi-document settings, models provided with additional retrieved context performed worse than if they had been given no documents at all when the relevant passage was placed in the middle — making naive context accumulation not just suboptimal but actively counterproductive in some configurations.6
The second effect is interference — where accumulated context from earlier steps actively degrades performance on later ones. A coding agent that has written five modules may find that the context from modules one through four interferes with its reasoning about module five. A research agent that has drafted three sections of a report may produce a fourth that contradicts the first without recognising the contradiction.
Both effects are properties of the attention mechanism — the component of the transformer architecture (the model design underlying virtually all current language models) that determines which parts of the context the model weights most heavily when generating each output token. Attention is learned, not designed to be uniform, and its patterns produce systematic biases in how context is used across task length.
These effects have direct design implications. Long-horizon tasks need explicit state management strategies — periodic summarisation of accumulated context, structured intermediate checkpoints, and dedicated verification steps — rather than the assumption that an unmodified context will remain coherent across its entire extent.
Key takeaway: As tasks grow longer, agents do not degrade uniformly — they degrade in predictable patterns, attending less reliably to critical instructions and becoming susceptible to interference from earlier context.
7. A Taxonomy of Fundamental Failure Classes
The failure modes described above are not independent. They interact, compound, and express themselves differently depending on the agent architecture and task domain. A structured taxonomy helps practitioners identify which failure class a given incident belongs to and what design responses are relevant.
| Failure Class | Primary Mechanism | Detection Difficulty | Propagation Risk |
|---|---|---|---|
| Representational | Context–world gap | High (invisible by design) | Medium |
| Generative (hallucination) | Pattern completion without grounding | Medium–High | High in pipelines |
| Planning | Pattern-matched sequences vs genuine search | Medium | High in long tasks |
| Calibration | Training reward signal misalignment | High (confidence appears normal) | Medium |
| Context (positional/interference) | Attention mechanism properties | High | Medium–High |
| Compound | Cross-boundary error propagation | Very High | Intrinsic to multi-agent |
The detection difficulty column deserves particular attention. Each of these failure classes can produce output that is structurally normal — correctly formatted, confidently expressed, and internally consistent — while being wrong in ways that are difficult to detect without independent verification against ground truth. This is the central challenge of agentic reliability: the most dangerous failures are often the least visible ones.
8. Asymmetric Visibility: Why These Failures Are Hard to Catch
A theme running through every failure class above is asymmetric visibility — the gap between how failures appear from inside the system and how they appear from outside it.
From inside the pipeline, most of these failures are invisible. The agent does not experience hallucination as hallucination; it experiences it as generating the next token in a plausible sequence. It does not experience goal drift as drift; each step is locally coherent with the most recent context. It does not experience overconfidence as overconfidence; the confidence signal it expresses is a property of its output distribution, not a reflection of genuine epistemic assessment.
From outside the pipeline — from the position of a human reviewer seeing the final output — the failure is often equally invisible. A hallucinated fact embedded in a well-formatted report looks like any other fact in that report. A plan that has drifted from the original goal may still satisfy a surface-level reading of the task description. An agent's explanation of its reasoning may sound coherent while being post-hoc rationalisation.
This asymmetry has a design implication that cannot be addressed at the model level: the verification mechanisms must be external to the system being verified. A single agent checking its own output inherits all the same failure modes as the original generation. An agent designed to verify another agent's output provides independent verification only to the degree that the agents were trained, prompted, and grounded differently. The multi-agent debate pattern described in Chapter 5 — where disagreement between agents actively drives quality upward — derives its value precisely from this independence.
External verification takes several practical forms that will be developed in later chapters: critic agents with explicitly divergent prompting, structured human review at consequence thresholds, automated testing against ground truth where it exists, and eval suites that exercise the specific edge cases where these failure classes concentrate.
Key takeaway: The failures most likely to cause harm are the ones hardest to see from inside the system. Verification mechanisms must be architecturally independent from the system they are verifying — not a model checking itself, but a genuinely separate signal.
9. From Diagnosis to Design
None of this taxonomy is cause for paralysis. The failure modes described here are not unique to AI — human experts hallucinate (misremember), fail to plan well under complexity, exhibit overconfidence, and lose track of context in long tasks. The difference is that human failure modes are familiar, culturally understood, and embedded in a web of social and institutional checks that have evolved to manage them. Agent failure modes are less well understood, less visible when they occur, and not yet embedded in equivalent checks.
The appropriate response is not to abandon agentic systems but to build them with an accurate model of their failure profile rather than an idealised one. That model has several implications:
First, trust the output in proportion to the verification it has received, not in proportion to how confident it sounds. Confidence in agent output is a property of how language models generate text, not a reliable signal about the accuracy of that output.
Second, design task decompositions with the planning problem in mind — break long-horizon tasks into steps that can be independently verified, and do not assume that an agent's expressed plan reflects a genuine evaluation of consequence chains.
Third, make verification architecturally independent — a critic that shares the same model, the same context, and the same prompt biases as the original agent will exhibit the same failure modes. Independence is the source of value.
Fourth, instrument for the failures that are hard to see — trace logging, intermediate output sampling, and ground-truth comparison are worth more than confidence scores as reliability signals, because they operate outside the system rather than inside it.
The next two chapters examine specific risk domains that emerge from this failure profile: the attack surfaces these vulnerabilities create for adversarial exploitation (Chapter 13), and the ways in which agent-generated hallucination, at scale, becomes a disinformation problem with systemic implications (Chapter 14).
References
- 1. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), Article 248, 1–38. https://doi.org/10.1145/3571730
- 2. Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N.V., Wiest, O., & Zhang, X. (2024). Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. University of Notre Dame. (Previously cited as Ch.5, ref. 2.)
- 3. Valmeekam, K., Olmo, A., Sreedharan, S., & Kambhampati, S. (2022). Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change). NeurIPS 2022 Workshop on Foundational Models for Decision Making. Arizona State University / Colorado State University.
- 4. Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., & Murthy, A. (2024). LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks. arXiv:2402.01817. Arizona State University.
- 5. Turpin, M., Michael, J., Perez, E., & Bowman, S.R. (2023). Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Advances in Neural Information Processing Systems (NeurIPS 2023), 36. New York University / Anthropic.
- 6. Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638
Building agentic AI and wondering why alignment is harder than the technology? Get in touch