Chapter 11 · The Plumbing Problem: Technical Realities of System Integration

The architecture diagram shows how the pieces connect. This chapter explains what happens when they do.

What Architecture Diagrams Leave Out

There is a gap between the clean box-and-arrow diagrams that describe an agentic system and the engineering reality of making one work reliably. The boxes are easy to label: orchestrator, subagent, tool, database, external API. The arrows between them look simple. What they conceal is a class of problems — authentication flows, serialisation formats, error propagation, latency budgets, security boundaries — that consume the majority of integration engineering time.

These are not glamorous problems. They do not appear in product demos. They are also the problems that most frequently determine whether an agentic system reaches production — and stays there.

The practical implication is that integration work should be designed around the whole workflow, not around the agent alone. A production agent sits inside a stack of decisions: which workflow it serves, which tools it can call, which context it can trust, which permissions it holds, how it handles partial failure, and how the full path is evaluated. Chapter 16 develops this as the enterprise agentic workflow stack. The point matters here because most plumbing failures are not failures of intelligence; they are failures of boundary design between these layers.

Authentication and Authorisation at Scale

Authentication is the first technical problem agentic systems hit when connecting to real enterprise infrastructure. The problem is structural: most enterprise authentication systems were designed around human users, not processes.

The credential management problem. An agent that needs to access ten enterprise systems needs credentials for ten systems. Where do those credentials live? Who rotates them? What happens when they expire mid-task? These questions have clear answers in human-operated systems — a person notices when their password expires and resets it. In an agentic system operating at scale and potentially overnight, a credential expiry can silently fail a workflow in ways that are difficult to detect.

The identity problem. Many enterprise systems are not designed to grant access to a non-human principal. The agent needs to appear as some identity — a service account, a delegated user, a role — and the access controls on that identity need to be scoped appropriately. Over-permissioning is the most common pattern (giving the agent admin access because it is easier) and the most dangerous one.

Recommended patterns:

Use a secrets manager rather than embedding credentials in prompts or environment variables. Grant agents the minimum permissions required for the task, not the maximum available. Log every credentialed action for audit trail purposes.

Tool Calling and Protocol Standards

The mechanism by which an agent invokes external tools is one of the most consequential architectural choices in an agentic system, with direct implications for reliability, security, and portability.

Function calling is the foundation — the model identifies when a tool should be used, generates a structured JSON payload describing the call, and the agent runtime executes it and passes the result back into the context. The reliability of this mechanism depends heavily on how well tools are described, how the model handles ambiguous inputs, and how errors in tool execution are surfaced back to the model.

Model Context Protocol (MCP) extends this into a standardised server-client architecture. An MCP server exposes a set of tools through a well-defined interface; an MCP client (the agent runtime) discovers and invokes them. This provides several advantages over ad-hoc function calling implementations:

Property	Ad-hoc Function Calling	MCP
Tool reusability	Low — tied to specific implementation	High — MCP server is portable
Discovery	Manual — tools must be registered explicitly	Automatic — client discovers available tools
Versioning	Unspecified	Defined in protocol
Security boundary	Depends on implementation	Explicit server-level isolation
Cross-model compatibility	Limited	By design

For new integrations, implementing tools as MCP servers is the more sustainable architecture. For existing integrations, wrapping them with an MCP-compatible interface when resources allow is a worthwhile investment. The standard does not remove the need for security review: every exposed tool still needs least-privilege permissions, parameter validation, logging, and clear human ownership. A portable unsafe tool is still unsafe — it is simply easier to reuse in more places.

Data Contracts and Schema Boundaries

Before error propagation appears, many integration failures start one layer earlier: the agent and the tool disagree about the shape or meaning of the data being exchanged. Traditional distributed systems already struggle with schema evolution, backward compatibility, retries, and idempotency; agentic systems inherit all of those problems and add natural-language ambiguity on top of them.³

Every production-grade tool exposed to an agent should therefore have a contract that specifies:

Contract Element	Why It Matters
Input schema	Prevents the model from inventing parameters or sending malformed values
Output schema	Gives the agent predictable structure to reason over
Error schema	Ensures failures are interpretable rather than opaque exceptions
Idempotency behaviour	Prevents repeated tool calls from duplicating payments, tickets, orders, or notifications
Versioning rules	Allows tools to evolve without silently breaking agent behaviour

For write operations, idempotency is especially important. If an agent times out after submitting a request and retries, the system must know whether it is repeating the same action or creating a second one. Without that guarantee, a recoverable latency problem can become a business-impacting data integrity problem.

Key takeaway: Tool calls are not just function calls. They are distributed-system contracts, and agents need those contracts to be stricter — not looser — because their behaviour is probabilistic.

Error Propagation in Chained Agent Calls

Error handling in single-turn systems is straightforward: the call fails, you handle the exception, you log the error. In multi-step agentic workflows, errors have the additional property of propagation — a failure at step three of ten affects the validity of steps four through ten, and the agent may or may not recognise that its working context is now compromised.

Key takeaway: The specific danger in chained agent workflows isn't a loud failure — it's a silent one: an undetected error at step three propagates forward, producing final output that appears structurally valid but is built on corrupted state.

Three error handling principles are particularly important in agentic contexts:

Fail fast and explicitly. A tool that encounters an error should return a structured error response that the agent can interpret, not a generic exception that the agent might misinterpret or ignore. The error response should describe what failed, what the agent cannot now assume, and (where possible) what the agent should do instead.

Design for partial success. In long-horizon tasks, not every step failing should abort the whole task. Define which steps are critical-path (failure aborts) and which are optional (failure degrades but does not abort). This requires deliberate task decomposition, not just error handling code.

Validate intermediate outputs. In high-stakes workflows, add validation checkpoints between agent steps. A lightweight validation layer that checks whether an intermediate output meets expected structural constraints can catch propagation errors before they compound.

Observability: What You Need and Why

Debugging a deterministic system is hard. Debugging a non-deterministic system — one where the same input can produce different outputs, and where the intermediate steps are in natural language rather than structured code — requires purpose-built observability infrastructure.

Key takeaway: Standard application logging assumes deterministic, code-legible intermediate steps. Agentic systems have neither — purpose-built observability infrastructure is the only way to reconstruct what an agent actually did after the fact.

The minimum observability stack for a production agentic system:

Trace logging — every agent invocation, tool call, and model call logged with inputs, outputs, timing, and token counts. This is the equivalent of application logs in traditional systems, and it is the only reliable way to reconstruct what an agent actually did when something goes wrong.

Span correlation — trace IDs that propagate across an entire task, so that every step of a multi-agent workflow can be correlated back to a single user request. Without this, debugging distributed agent failures is close to impossible.

Latency distribution monitoring — not just average latency, but p95 and p99 latency. Agentic systems have long-tail latency distributions because some task paths are much longer than others. The average is misleading.

Cost monitoring — token consumption per task, per agent, per tool. In a multi-agent system where different steps use different models, total cost is not visible without explicit aggregation.

Error rate by tool — which tools are failing, how often, and with what error types. A tool that succeeds 98% of the time appears reliable in average metrics; it appears concerning when you see that it fails during the 2% of tasks that matter most.

Tools like LangSmith, Weights & Biases, and Langfuse provide much of this infrastructure out of the box for LangChain-based systems. For custom implementations, OpenTelemetry provides a vendor-neutral foundation.⁴

Latency Budgets

Latency in agentic systems compounds across steps. A workflow that makes ten model calls, each taking two seconds on average, has a baseline latency of twenty seconds before any tool execution time is included. For real-time or near-real-time applications, this is often unacceptable.

Latency budget management requires explicit planning:

Component	Typical Range	Optimisation Lever
Model inference (frontier)	2–15s per call	Use smaller model where quality permits
Model inference (small)	0.2–2s per call	Route simpler steps here by default
Tool execution	50ms–10s	Cache results; parallelize where possible
Inter-agent handoff	100ms–2s	Minimise context serialisation; use streaming
Total workflow (10 steps)	20s–3min	Parallelize independent steps; reduce step count

The single highest-leverage optimisation in most agentic systems is parallelising independent steps. If three research subtasks can run simultaneously rather than sequentially, the workflow time drops by a factor of three for those steps. This requires explicit task decomposition and a orchestration layer that supports concurrent execution.

Security Attack Surfaces

Agentic systems introduce attack surfaces that do not exist in traditional application architectures. The OWASP Top 10 for Large Language Model Applications treats these risks as application-security concerns rather than model-behaviour curiosities, covering categories such as prompt injection, sensitive information disclosure, insecure output handling, and excessive agency.¹

Prompt injection is the most widely discussed. A malicious payload embedded in data that the agent reads — an email, a web page, a document — can instruct the agent to take actions not intended by the legitimate user or operator. The challenge is that the agent cannot always distinguish between instructions from its operator and instructions embedded in its environment. Empirical testing found that more capable models are significantly more vulnerable to injection attacks: the same instruction-following ability that makes a model useful applies equally to adversarially injected commands, making architectural controls a more reliable mitigation layer than model-level defences alone.²

Tool abuse occurs when an agent is manipulated into using tools in unintended ways — exfiltrating data through a communication tool, escalating permissions through an administrative tool, or chaining tool calls to achieve an outcome the agent was not designed to enable.

Cross-agent trust exploitation in multi-agent systems, where a compromised subagent issues instructions to other agents that those agents treat as legitimate.

Mitigations:

Scope agent permissions to the minimum required for legitimate tasks.
Treat all content from the environment (emails, documents, web pages) as untrusted input.
Validate tool call parameters before execution, particularly for write operations.
Log all agent actions and flag unusual patterns for human review.
Apply the same security review to agent workflows as to any application code.

A useful way to operationalise these mitigations is to map them into a risk-management process rather than leaving them as individual engineering tasks. NIST's Generative AI Profile extends the AI RMF with risks specific to generative systems — including confabulation, data leakage, harmful content, and misuse — and provides a structured vocabulary for deciding which risks are governed, measured, managed, and monitored over time.⁵ For agentic systems, this matters because the same integration can be both a reliability concern and a security concern: a malformed tool response can cause an error, while a malicious one can cause compromise.

Security in agentic systems cannot be bolted on after the fact. The attack surfaces are deeply architectural, and the mitigations must be designed in from the start.

This chapter has treated integration as an engineering discipline: credentials, protocols, contracts, observability, latency, and attack surfaces. The next part of the book can now build on that foundation by examining what happens when technically integrated agents meet the less predictable realities of human behaviour, organisational incentives, and external manipulation.

References

OWASP (2025). OWASP Top 10 for Large Language Model Applications. Open Web Application Security Project.
Perez, F., & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques for Language Models. ML Safety Workshop, NeurIPS 2022. AE Studio.
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media.
OpenTelemetry Authors (2023). OpenTelemetry Specification. Cloud Native Computing Foundation.
National Institute of Standards and Technology. (2024). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1). U.S. Department of Commerce. https://doi.org/10.6028/NIST.AI.600-1

Building agentic AI and wondering why alignment is harder than the technology? Get in touch

What Architecture Diagrams Leave Out​

Authentication and Authorisation at Scale​

Tool Calling and Protocol Standards​

Data Contracts and Schema Boundaries​

Error Propagation in Chained Agent Calls​

Observability: What You Need and Why​

Latency Budgets​

Security Attack Surfaces​

References​