Skip to main content
ADVERTISEMENT

Chapter 16 · Staging Your Agentic Transformation: A Maturity Framework

The organisations that move fastest are not those that start boldest. They are those that sequence correctly.


The Problem with Maturity Frameworks

Maturity frameworks have a credibility problem. Consulting firms produce them by the dozen. They tend to look similar: a five-stage ladder from "unaware" to "optimised," each rung decorated with capability descriptors that tell you where you are but not how to move. They are useful for benchmarking and board presentations. They are less useful for the engineering team deciding what to build next quarter, or the CISO trying to understand what governance structure a particular deployment actually requires. The stakes of getting this wrong are high: industry analysis places AI agents at the peak of inflated expectations — the highest-failure-risk position on technology adoption curves — and projects that over 70% of enterprise agentic AI initiatives will fail by 2029, not because the technology is inadequate, but because organisations are deploying it without sufficient attention to use-case fit, risk tolerance, and compliance requirements.2

This chapter attempts something different. Rather than a framework for describing where your organisation is, it offers a framework for deciding what to do next — a sequencing guide grounded in the failure patterns that derail agentic programmes and the capability dependencies that make certain moves viable only after others are in place.

The central argument is simple: agentic transformation is not a capability race. It is a dependency graph. Moving to Stage 4 before Stage 2 is complete does not compress the timeline — it creates the conditions for the kind of visible, trust-destroying failure that sets programmes back by years.


Why Sequence Matters More Than Speed

The pressure on enterprise AI programmes is predominantly upward. Boards want deployments. Vendors want contract expansions. Engineers want to build the interesting thing. The institutional incentive to skip stages is strong and constant.

The pattern that emerges from the most advanced deployments is one that is rarely highlighted in public case studies: organisations that have moved beyond pilots into systems handling significant operational volume built the infrastructure layers before the application layers. They had logging before they had autonomy. They had evaluation pipelines before they had broad deployment. They had access control before they had agents touching sensitive data.

This is not conservatism. It is engineering. The failure modes of agentic systems are qualitatively different from the failure modes of earlier enterprise software. A misconfigured ERP system produces wrong outputs; humans catch them in review. An autonomous agent with misconfigured scope boundaries can act on those wrong outputs before any human sees them, and the action may be irreversible. The cost of getting the sequence wrong scales with the autonomy of the system. The gap between deployment intent and governance readiness is measurable: as of early 2026, nearly three in four companies planned to deploy agentic AI within two years, while only one in five reported having a mature model for governance of autonomous agents — meaning the majority of organisations approaching consequential autonomy are doing so without the oversight infrastructure to manage it.8

Key takeaway: Speed in agentic deployment is not measured by how quickly you move to the highest stage. It is measured by how quickly you build the infrastructure that makes higher stages safe to operate.


The Five-Stage Model

The framework below describes five stages of agentic maturity. Each stage has a primary capability, a characteristic risk, and a set of prerequisites that must be made before the transition is made.

The arrows represent capability prerequisites, not just temporal sequence. An organisation can take months to move between adjacent stages, or years — the timeline is less important than whether the prerequisites are genuinely in place.


Stage 1: Augmented Assistance

What it looks like: AI is used as a productivity tool by individuals. It drafts documents, summarises meetings, answers questions, generates code suggestions. Outputs are reviewed and accepted or rejected by a human before any action is taken. The AI has no system access, no tool calls, no persistent memory across sessions.

Why it is the right starting point: Stage 1 is where organisations build intuition. Teams learn what the model is good at, where it hallucinates, what prompting strategies produce reliable outputs, and what domains it should not be trusted with. This intuition is not available any other way — it must be earned through use.

The characteristic risk: The risk at Stage 1 is not operational — it is reputational and process-related. Teams that use AI tools informally, without logging, without governance, and without feedback mechanisms are building hidden technical debt. When the organisation wants to formalise and scale these capabilities, there is no data on what worked, no record of what the AI touched, and no evaluation baseline to improve from.

Prerequisites for Stage 2 transition:

  • A baseline prompt library and style guide for common use cases
  • An informal feedback mechanism for capturing what works and what fails
  • Basic data handling policies: what can and cannot be passed to an external model
  • Awareness training for at least the teams that will move to Stage 2

Key takeaway: Stage 1 is not a holding pattern. It is the period during which organisational AI literacy is built — and the quality of the intuition developed here directly determines how well Stage 2 is scoped.


Stage 2: Assisted Automation

What it looks like: AI is embedded into defined workflows. It executes sequences of steps — retrieving data, formatting outputs, generating structured artefacts — but within narrow, well-specified scope boundaries. A human reviews and approves outputs before they trigger downstream actions. Tool access is limited and explicit: the agent can read, but cannot write; can draft, but cannot send; can query, but cannot transact.

Why Stage 2 is where most programmes stall: The jump from Stage 1 to Stage 2 requires engineering infrastructure that does not exist in a Stage 1 deployment. Logging, evaluation suites, scope enforcement, and basic monitoring are not glamorous — they do not appear in product demos — but they are the load-bearing components of every higher stage. Organisations that skip this infrastructure to reach Stage 3 faster invariably rebuild it under pressure, during an incident, at considerably higher cost. Survey data tracks the stall point: as of mid-2025, approximately two-thirds of organisations using AI had not yet begun scaling it across the enterprise, with most still in the experimenting or piloting stages — a distribution consistent with widespread Stage 1–2 deployment and limited Stage 3 infrastructure.1 The production transition compounds the challenge: use cases estimated to take three months routinely extend to eighteen months once integration complexity emerges, and failures that were learning opportunities in pilots become business risks in production.8

The characteristic risk: Scope drift. Stage 2 workflows are designed with explicit boundaries, but agents are inventive in the absence of explicit prohibitions. A workflow designed to draft customer emails will, if given the opportunity, attempt to send them. The scope boundary must be enforced at the tool layer — by not giving the agent a send tool — rather than at the prompt layer, where it can be argued around. The principle is architectural: restrict by construction, not by instruction.

Infrastructure checklist before moving to Stage 3:

CapabilityWhy it is required
Structured output loggingEnables audit, debugging, and evaluation
Evaluation suite (task-level)Baseline for detecting regression when models update
Scope enforcement at tool layerPrevents drift beyond intended boundaries
Human feedback captureFeeds improvement cycle; surfaces edge cases
Incident classificationDistinguishes model failure, prompt failure, data failure
Basic access controlEnsures agents access only the data the use case requires

Two cautions on evaluation design inform how this checklist should be applied. First, because agents' behaviour is open-ended, probabilistic, and shaped by system-level interactions over time, evaluation cannot be a terminal checkpoint — it must be embedded as a continuous governing function across the deployment lifecycle, unifying development-time testing with runtime monitoring in a closed feedback loop.4 Second, single-trial success rates systematically overstate reliability: task-level pass rates can more than halve when the same agent handles the same task type repeatedly under realistic variation, which means evaluation suites should incorporate multi-run consistency testing alongside single-pass accuracy metrics.5

Key takeaway: Stage 2 is not about the sophistication of the AI. It is about building the infrastructure that makes Stage 3 operable without requiring a human to watch every step.


Stage 3: Supervised Agency

What it looks like: Agents take sequences of actions autonomously within defined scope, with human oversight concentrated at decision points rather than applied to every output. A human does not review each step — but they review consequential steps, sample outputs systematically, and receive alerts when the agent operates outside its confidence band or encounters an edge case.

This is the stage at which agentic AI begins to generate the productivity gains that justify the investment. It is also the stage at which the consequences of infrastructure gaps become operational rather than theoretical.

The escalation design problem: Most Stage 3 failures are not failures of the AI — they are failures of the escalation design. The agent encounters something outside its scope, has no escalation path, and either halts (losing the value of autonomy) or proceeds (risking an error). Designing the escalation logic — what triggers human review, how it is routed, how quickly it must be resolved, what happens if it is not — is as important as designing the agent itself and is consistently underweighted in deployment planning.

The characteristic risk at Stage 3: Over-escalation and under-escalation are both failure modes, but they fail in opposite directions. A system that escalates too aggressively is not autonomous — it is an elaborate routing mechanism that consumes human attention at scale. A system that escalates too rarely produces confident errors that reach downstream consequences undetected. Calibrating the threshold is an empirical exercise, not a design decision made once at deployment. The evaluation infrastructure from Stage 2 is what makes that calibration possible. The organisational importance of this calibration is empirically grounded: among the management practices most strongly associated with AI high performance, having defined processes for when model outputs require human validation ranks as the single highest-relative-importance differentiator — above governance, talent strategy, or technology infrastructure.1

Prerequisites for Stage 4 transition:

  • Escalation logic tested and tuned against real-world task distributions
  • Monitoring and alerting operational and actively reviewed
  • Governance framework: clear ownership of the agent's decisions and their consequences
  • Human roles explicitly redesigned around supervisory rather than execution functions
  • At least one full incident cycle completed: a failure occurred, was detected by the monitoring system, was investigated, and produced a change to the agent's scope or tooling

Stage 4: Delegated Operations

What it looks like: Agents handle end-to-end operational workflows with minimal human intervention in the execution path. Humans set objectives, review outcomes, and handle the exception cases that escalation logic routes to them. The agent manages its own retry logic, tool failures, and partial results. In some configurations, it manages sub-agents — spinning up specialised instances to handle components of a larger task.

At this stage, the agent is not augmenting a human process. It is running a process. The human's role has shifted from operator to supervisor, and the supervision is increasingly asynchronous — reviewing dashboards and exception queues rather than individual outputs.

What changes about risk at Stage 4: The consequence surface widens. Errors no longer affect individual tasks — they affect the process. A miscalibrated pricing agent at Stage 2 produces a wrong draft that a human catches. The same agent at Stage 4 may have processed thousands of quotes before an anomaly alert fires. The damage is not catastrophic if detection and recovery mechanisms are in place. It is severe if they are not.

This is why the prerequisites for Stage 4 include not just operational readiness but governance readiness: clear accountability for the agent's decisions, a defined audit trail, a tested incident response procedure, and an answer to the question posed at the end of Chapter 15 — if this agent makes a consequential error, what is the detection mechanism, and what is the recovery path?

Stage 4 capability requirements:

DomainRequirement
ArchitectureAgent manages retry logic, tool failures, and partial results independently
EvaluationSystem-level evaluation suite, not just task-level
MonitoringReal-time anomaly detection with defined escalation SLAs
GovernanceDocumented accountability chain; audit trail per Chapter 25
RecoveryTested rollback or correction procedures for each consequential action type
Human rolesJob design updated to reflect supervisory rather than execution function

The rarity of Stage 4 readiness in practice is worth stating directly: survey evidence finds that 84% of companies have not redesigned jobs or the nature of work around AI capabilities, and only 21% report having a mature governance model for autonomous agents — making the Stage 4 prerequisites not aspirational targets but genuinely exceptional among deploying organisations.8

Key takeaway: Stage 4 is not a technical milestone. It is an organisational one. The technical infrastructure is necessary but not sufficient — the governance, accountability, and human role redesign must be in place before consequential processes are delegated.


Stage 5: Orchestrated Intelligence

What it looks like: Networks of specialised agents collaborate on complex, multi-domain objectives. A goal is decomposed by an orchestrating layer; subtasks are routed to domain agents; outputs are aggregated, validated, and synthesised. No single agent sees the full task. Coordination is managed by the system, not by a human directing individual agents.

This is the architecture described in Chapter 5 — hierarchical multi-agent systems with orchestrators, specialist subagents, and dedicated verification agents. At Stage 5, these patterns are operational at scale, across multiple business functions, with agents that interact with both internal systems and external services.

Why Stage 5 is genuinely different in kind: The failure modes of multi-agent systems include failure modes that do not exist in single-agent deployments. Error propagation — a hallucination generated by one agent accepted and amplified by others downstream — is the most significant. Detection is harder because the error may not be visible in any single agent's outputs; it only emerges in the final synthesis. This requires system-level evaluation that specifically tests the pipeline as a whole, not individual agents in isolation.

Stage 5 is also where the platform decisions made in earlier stages become either assets or constraints. Organisations that built on interoperable standards — MCP for tool integration, A2A for agent-to-agent communication — find multi-agent orchestration technically tractable. Organisations that accumulated proprietary integrations at each earlier stage find that the integration debt dominates the engineering effort required to connect agents effectively.

The honest assessment: As of 2026, Stage 5 represents the frontier of enterprise agentic deployment. A small number of technology-intensive organisations operate systems that approach this description. Most enterprises that believe they are at Stage 5 are operating sophisticated Stage 4 systems with some multi-agent components. Survey data confirms the rarity: in 2025, no more than 10% of respondents reported scaling AI agents in any individual business function, and most of those doing so were scaling in only one or two functions — making fully orchestrated multi-agent deployment across an enterprise the exception rather than an aspirational norm.1 Empirical analysis of multi-agent frameworks further finds that performance gains over equivalent single-agent baselines are frequently minimal on standardised benchmarks, suggesting that architectural complexity does not reliably translate into proportional capability improvement — and that the case for Stage 5 must rest on demonstrated system-level evaluation rather than framework sophistication.7 The distinction matters less than the underlying capability: are the evaluation, monitoring, and governance systems robust enough to operate at the complexity level the architecture requires?


The Diagnostic Assessment

The following questions are designed to surface the gaps most likely to cause a transition to fail. They are not a checklist — all affirmative answers do not guarantee success. They are a diagnostic — a single negative answer in a prerequisite category indicates a gap that should be addressed before the transition is attempted.

Readiness for Stage 2

  1. Do you have a data handling policy that specifies what can be passed to external AI services?
  2. Do you have at least one team that can articulate, from their own experience, what the AI handles well and where it fails?
  3. Is there a named owner for the first automated workflow you intend to build?

Readiness for Stage 3

  1. Is structured output logging in place and actively reviewed?
  2. Have you run an evaluation suite against your Stage 2 workflows and established baseline performance metrics?
  3. Can you describe, in precise terms, what actions your Stage 2 agent is architecturally prohibited from taking — and why that prohibition is enforced at the tool layer rather than the prompt layer?
  4. Do you have a human feedback mechanism that captures failures and routes them into an improvement cycle?

Readiness for Stage 4

  1. Has your escalation logic been tuned against a real production task distribution — not a synthetic test set?
  2. Have you completed at least one full incident cycle: failure occurred, monitoring detected it, investigation identified the cause, and a change was made?
  3. Do you have a documented accountability chain — a named individual responsible for each category of consequential action the agent takes?
  4. Have the human roles that interact with the agent been explicitly redesigned for supervisory rather than execution function?

Readiness for Stage 5

  1. Do your individual Stage 4 agents have system-level evaluation coverage — or only task-level?
  2. Have you tested error propagation explicitly: does a known error injected into one agent's output corrupt the pipeline's final synthesis?
  3. Are your tool integrations built on interoperable standards, or on proprietary connectors that will need to be rebuilt when the agent topology changes?
  4. Is your governance framework — accountability, audit trail, incident response — defined at the system level, not just per-agent?

Common Sequencing Errors

Three patterns appear consistently in programmes that stall or regress:

Skipping Stage 2 infrastructure to reach Stage 3 faster. The argument is that Stage 2 infrastructure — logging, evaluation, scope enforcement — adds delay without adding visible capability. This is true in the short term. In the medium term, the absence of evaluation infrastructure means that when Stage 3 agents behave unexpectedly, there is no baseline to diagnose against, no logged history to investigate, and no mechanism to distinguish a model failure from a prompt failure from a data failure. The diagnostic work that should have taken a day takes weeks. If the failure is public, the reputational cost compounds. Organisations that skip this infrastructure to reach Stage 3 faster invariably rebuild it under pressure, during an incident, at considerably higher cost. The scale of the readiness illusion that drives this error is empirically documented: a survey of nearly 4,000 business leaders and technical practitioners found that 87% of leaders believed their organisation's data ecosystem was capable of deploying AI at scale, while only 41% had successfully scaled AI solutions — and 70% of technical practitioners reported spending up to four hours daily resolving data problems their leaders believed were already solved.9

Treating governance as a Stage 4 problem. Governance requirements — accountability chains, audit trails, incident response procedures — are not relevant only when agents reach consequential autonomy. They need to be designed from Stage 2 onwards, because the decisions made at Stage 2 about data access, logging granularity, and scope boundaries either create or foreclose governance options at Stage 4. Retrofitting governance onto a Stage 4 deployment is significantly more expensive than building it in progressively. The urgency is compounded by the immaturity of the tooling: Gartner's 2025 AI Hype Cycle places AI Governance Platforms at the Innovation Trigger stage — the earliest point on the curve — meaning the platforms organisations will eventually rely on for audit, accountability, and compliance are still years from mainstream viability.2 The governance infrastructure gap is also measurable in current deployments: as of late 2025, only 64% of enterprises had adopted data security policies and 61% had training programmes for responsible AI use, leaving roughly a third of organisations deploying AI without the foundational governance structures that Stage 2 and above require.6

Declaring Stage 5 before Stage 4 is stable. The pressure to describe multi-agent architectures — from vendors, from engineers excited about the technology, from executives who have read about the potential — is real and consistent. Organisations that architect for Stage 5 before Stage 4 processes are stable end up with complex systems that nobody fully understands, monitoring that does not cover the orchestration layer, and failure modes that emerge only when multiple agents interact in production. The result is typically a forced simplification that could have been avoided by sequencing correctly.

Key takeaway: Every sequencing error described above is a form of the same mistake — optimising for visible complexity over invisible infrastructure. The infrastructure is what makes the complexity safe to operate.


The Governance Readiness Parallel

Agentic maturity and governance maturity must develop in parallel. An organisation with Stage 4 technical capability and Stage 1 governance is not at Stage 4 — it is at Stage 1, operating under the illusion of Stage 4. The technical capability enables consequential actions; the governance deficit means those actions are taken without accountability, without audit trail, and without the institutional clarity required to respond when something goes wrong. The scale of this mismatch in current deployments is not hypothetical: nearly three in four companies plan to deploy agentic AI within two years, yet only one in five report a mature model for governing autonomous agents — a gap that defines the primary risk of premature Stage 4 and Stage 5 transitions.8

The governance topics addressed in Part 7 of this book — regulatory compliance, testing and verification, oversight design, audit trail construction — are not add-ons to the technical programme. They are prerequisites for specific stages. A governance readiness column should sit alongside the technical readiness column in every stage-gate review.

The organisation that moves technical capability ahead of governance readiness is not accelerating transformation — it is accumulating liability. The organisation that develops governance readiness ahead of technical capability is building institutional trust that will pay dividends when higher-stage deployments face scrutiny.


Using the Framework Practically

Three use cases drive most applications of this framework in enterprise settings:

Portfolio assessment. A large organisation running multiple agentic initiatives simultaneously is almost certainly operating at different stages in different parts of the business. Mapping each initiative to its current stage — honestly, based on what infrastructure is actually in place rather than what is planned — identifies where shared infrastructure investments (logging platforms, evaluation frameworks, governance tooling) would generate the most leverage across the portfolio.

Stage-gate design. The prerequisites tables in each stage description above can be adapted into the approval criteria for a stage-gate review process. The value of this is not bureaucratic control — it is clarity. When a team knows exactly what infrastructure must be in place before a Stage 3 deployment is approved, they build it. When the criteria are vague, the infrastructure gap is rationalised away.

Board and executive communication. The maturity framework provides a vocabulary for communicating agentic programme status to non-technical audiences. "We are operating three workflows at Stage 3 and are building the governance infrastructure required to move two of them to Stage 4" is more informative than "we are scaling our agentic AI programme." It communicates ambition, progress, and the nature of the work remaining — without overstating readiness or understating risk.


References

  1. McKinsey & Company (2025). The State of AI in 2025: Agents, Innovation, and Transformation. QuantumBlack, AI by McKinsey. November 2025.
  2. Gartner (2025). Hype Cycle for Artificial Intelligence, 2025 (ID: G00828523). Gartner, Inc. June 2025.
  3. Andreessen Horowitz (2025). How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025. Andreessen Horowitz. June 2025.
  4. Xia, B., Lu, Q., Zhu, L., Xing, Z., Zhao, D., & Zhang, H. (2024). Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture. CSIRO's Data61 / Adelaide University / UNSW. arXiv:2411.13768.
  5. Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. Sierra.
  6. Wharton Human-AI Research & GBK Collective (2025). Accountable Acceleration: Gen AI Fast-Tracks Into the Enterprise. Wharton Human-AI Research & GBK Collective, University of Pennsylvania. October 2025.
  7. Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., et al. (2025). Why Do Multi-Agent LLM Systems Fail? NeurIPS 2025, Track on Datasets and Benchmarks. arXiv:2503.13657.
  8. Deloitte AI Institute (2026). State of AI in the Enterprise: The Untapped Edge. Deloitte Consulting LLP. January 2026.
  9. Capital One / Morning Consult (2024). AI Readiness Survey: Are Companies Prepared for AI Adoption? Capital One. https://www.capitalone.com/tech/ai/ai-readiness-survey/

Building agentic AI and wondering why alignment is harder than the technology? Get in touch

ADVERTISEMENT