Skip to main content

Chapter 20 · Crossing the Valley: From Successful Pilot to Production Scale

A successful pilot is evidence that something works under favourable conditions. It is not evidence that it will survive contact with the rest of the organisation.


The Pilot That Never Became Anything

The pattern is familiar enough to have become a joke in enterprise AI circles. A team builds an agent pilot in eight weeks. It performs well, impresses the steering committee, and is declared a success. Six months later, it is still a pilot. A year later, the team has moved on, the model it was built on has been deprecated, and the output of the project is a slide deck and a set of lessons learned that nobody reads.

The evidence on pilot failure is now extensive and converging. MIT's NANDA initiative, in preliminary findings from a multi-method study covering over 300 publicly disclosed AI initiatives and structured interviews with representatives from 52 organisations, found that 95% of enterprise generative AI pilots deliver no measurable P&L impact within six months, with only 5% achieving scaled, profit-relevant production.1 RAND Corporation's structured interviews with 65 experienced data scientists and engineers confirmed that, by industry estimates, AI projects fail at more than twice the rate of comparable IT projects without AI components — a gap that RAND's own practitioners attributed overwhelmingly to leadership and organisational failures rather than technical ones.2 S&P Global Market Intelligence's 2025 survey of over 1,000 enterprises across North America and Europe found that the share of companies abandoning most of their AI initiatives rose from 17% in 2024 to 42% in 2025, with the average organisation scrapping 46% of AI proof-of-concepts before reaching production.3 The gap between the two phases — between a working pilot and a running production system — is what practitioners have come to call the valley.

The valley is not primarily a technical problem. The technical challenges of moving from pilot to production — hardening integrations, building evaluation suites, implementing observability, managing credential rotation — are real and covered in earlier chapters. What kills most pilots is not that these problems prove unsolvable. It is that the organisational and commercial conditions required to solve them never materialise.

Key takeaway: The valley between pilot and production is paved with organisational failures, not technical ones. Understanding what those failures look like is the prerequisite for avoiding them.


The Pilot Trap

Pilots are structurally designed to succeed. This is not cynicism — it is an accurate description of the conditions that make a pilot a reliable test of whether an agent can work, and a poor test of whether it will work in production.

A typical enterprise AI pilot benefits from a set of conditions that production systems do not enjoy:

Controlled scope. Pilots are scoped to a bounded problem with well-understood inputs and outputs. The edge cases that make production hard — ambiguous inputs, unexpected data formats, systems that behave differently at month-end — are often explicitly excluded, handled manually by the pilot team, or simply not encountered during a short evaluation window.

Motivated team. The people building and running a pilot are usually the people who proposed it. They care about its success, they debug problems personally, and they are close enough to the system to recognise and compensate for its limitations. Production systems are operated by people who were handed a system they did not build.

Executive attention. Pilots run under the sponsor's watchful eye. Problems get escalated quickly. Resources are made available. The system gets the benefit of the doubt when it produces imperfect outputs. Production systems operate without this attention, and their failures are judged against the expectations set by the pilot's success.

Forgiving timelines. A pilot that takes twice as long as planned is a minor inconvenience. A production system that takes twice as long as planned is a missed SLA, a budget overrun, and a support ticket.

The result is that the pilot's reported success rate is a poor predictor of production success. It measures whether the technology can work under ideal conditions. It does not measure whether the organisation can support it under normal conditions.

The pilot trap is the mistake of treating pilot success as production readiness. Avoiding it requires explicitly designing for the conditions that production introduces — before the pilot ends.


The Funding Transition Problem

Most enterprise AI pilots live in innovation budgets: ring-fenced funds controlled by a CTO, CDO, or transformation office, designed to enable experimentation without the overhead of standard business justification. This is the right funding model for pilots. It is the wrong funding model for production systems.

Production systems need operational budget: recurring spend owned by the business unit that benefits from the system, integrated into annual planning cycles, and subject to the cost controls and ROI expectations that apply to all operational technology. The transition from innovation budget to operational budget is one of the most common failure points in the journey from pilot to production.

The failure mode is structural. The innovation team that ran the pilot does not own the operational budget of the business unit that benefits from it. The business unit leader who would need to own that budget was often not involved in the pilot and has no institutional relationship with the system being handed to them. They are being asked to take budget responsibility for a system they did not build, do not fully understand, and were not consulted about creating.

Bridging this gap requires deliberate action that most pilot programmes do not take:

Involve the budget owner from the start. The business unit leader whose function the agent will operate in should be involved in the pilot design, not just the readout. Their input into scope, success criteria, and failure modes builds the institutional ownership that makes the budget transition possible.

Define the operational cost model before the pilot ends. What will this system cost to run at production volume? What are the model API costs, the infrastructure costs, the human oversight costs? Surprises in the operational cost model are a common reason business unit leaders decline to take ownership of systems handed to them.

Plan the handover as a project, not an event. The transition from innovation budget to operational budget should be treated as a distinct project phase with its own milestones, stakeholder communication, and executive sponsorship — not as an automatic consequence of pilot success.

Key takeaway: A pilot that ends without a named operational budget owner and a defined cost model has not crossed the valley — it has parked at its edge.


Every successful enterprise AI pilot has a sponsor: an executive who championed the idea, created the conditions for the pilot to run, and whose authority protected the team from the organisational friction that innovation always encounters.

Sponsor risk is the risk that this person moves on before the system is embedded. In large enterprises, executive tenure in any given role averages roughly two to four years. Pilot-to-production transitions routinely take longer than planned. The intersection of these two facts explains a significant fraction of pilots that stall despite initial success.

When a sponsor departs, several things change simultaneously. The project loses its escalation path — problems that the sponsor would have resolved with a phone call now require navigating normal organisational processes. The project loses its political protection — competing priorities that the sponsor was suppressing surface. And the project loses its narrative owner — the person who could explain why this system matters and advocate for the resources it needs.

McKinsey's 2025 global survey of nearly 2,000 organisations across 105 countries found that AI high performers — organisations reporting meaningful enterprise-level financial impact — were three times more likely than peers to have senior leaders who actively championed AI initiatives over time, including continued funding through budget reprioritisation cycles and visible role modelling of AI adoption.4 The implication runs in both directions: sustained sponsorship is a leading indicator of production success, and its absence is a leading indicator of stall.

Organisations that manage sponsor risk well do not simply accept it as a fact of life. They actively reduce the system's dependence on any single sponsor through three mechanisms:

Distributed ownership. Multiple senior stakeholders with skin in the game — not just a single sponsor but a coalition that includes the business unit leader, the CIO, and at least one operational manager whose team uses the system daily.

Embedded value. The faster a system becomes operationally embedded — handling real work, measured by real metrics, integrated into real processes — the harder it is to pause or cancel regardless of who the sponsor is. Pilots that remain in limited scope specifically to preserve optionality are maximising sponsor risk.

Documented value realisation. A clear, regularly updated record of what the system has delivered, for whom, and at what cost provides the evidence that a new sponsor needs to continue supporting a programme they inherited.


The Handoff Problem

The team that builds a pilot is optimised for building. The team that operates a production system is optimised for running. These are different skills, different working rhythms, and often different people.

In most enterprise AI programmes, the transition from build to run is handled implicitly: the build team finishes the pilot, documents what they can, and moves on to the next project. The operations team receives a system they did not build, documentation of varying quality, and the expectation that they will maintain and improve it.

This handoff creates a knowledge gap that is one of the most common sources of production instability. The build team accumulated implicit knowledge during the pilot: which edge cases matter, which model behaviours require prompt adjustments, which integrations are fragile and why, which outputs require human review regardless of the agent's confidence. None of this knowledge transfers reliably through documentation. Most of it transfers through the people who held it — and most of those people have moved on.

RAND's practitioner interviews identified a consistent downstream consequence: when data engineers and model builders departed, organisations without effective documentation often could not determine which datasets were reliable or how their meaning had changed over time — forcing the operations team to rediscover knowledge that had already been paid for once, at the cost of time the programme rarely had.2

Mitigating the handoff problem requires treating it as a design constraint, not a transition event:

Cross-train operations from the start. The people who will operate the production system should be involved in the pilot, not as observers but as active participants who understand the system's behaviour and its limitations.

Build for operability, not just capability. Systems that are easy to operate — with clear runbooks, obvious monitoring dashboards, straightforward rollback procedures, and well-defined escalation paths — survive handoffs better than systems optimised purely for performance.

Plan for a parallel-run period. The pilot team and the operations team should run the system jointly for a defined period before the handoff is complete. Problems discovered during the parallel run are cheaper to fix than problems discovered after the handoff is done.


Sequencing the Scale-Up

The decision of where to expand after a successful pilot is not obvious. The instinct is usually to add more use cases — if the agent works for use case A, let us now add B, C, and D. This instinct produces breadth at the cost of depth, and it is often the wrong call.

The alternative is to deepen the production system before expanding it: drive the initial use case to full production maturity — stable operations, good monitoring, clear ownership, reliable performance — before attempting to extend scope. This approach is slower to generate visible progress but produces a more durable foundation.

MIT NANDA's data reveals a relevant size dynamic here: mid-market organisations that successfully crossed the pilot-to-production threshold did so in roughly 90 days on average, while large enterprises took nine months or longer — a gap driven not by resources but by where implementation authority was concentrated. Organisations that decentralised ownership to line managers while retaining executive accountability moved faster and sustained deployment more reliably than those that kept control centralised in transformation offices.1

The sequencing decision should be driven by two factors:

Infrastructure leverage. Some use cases share infrastructure — integrations, evaluation frameworks, monitoring pipelines — in ways that make the second use case substantially cheaper to deploy than the first. Where this leverage exists, expanding is efficient. Where each new use case requires its own infrastructure, expanding is expensive.

Governance maturity. Use cases with higher stakes — those involving financial decisions, HR actions, customer commitments, or regulated data — require more mature governance infrastructure than lower-stakes use cases. The sequencing should ensure that governance infrastructure has been validated at lower stakes before it is relied upon at higher ones.

SignalInterpretation
Operations team is escalating fewer issues week-on-weekSystem is stabilising — consider scope expansion
Evaluation metrics have held steady through a model updateEvaluation infrastructure is reliable — confidence for next use case
Budget ownership is clear and operationally embeddedFoundation for expansion is in place
The build team is still fixing problems in the original use caseDo not expand — deepen first
Human oversight is being reduced because it "seems fine"Governance is weakening — pause before expanding

The Production Readiness Contract

The most effective scale-up programmes make the transition criteria explicit before the pilot begins. A production readiness contract is not a legal document; it is a practical agreement between the innovation team, the operating function, IT, risk, and the executive sponsor about what must be true before the system leaves pilot status.

Readiness areaProduction questionEvidence required
OwnershipWho owns the agent after the pilot team leaves?Named business owner, technical owner, and operations owner
EconomicsWhat does it cost at production volume?Run-rate model including model calls, infrastructure, oversight, and support
ReliabilityHow does it behave outside the happy path?Evaluation results against edge cases and messy production inputs
OperabilityCan the operations team run it without the builders?Runbooks, dashboards, rollback procedure, and parallel-run completion
GovernanceWho is accountable when it makes a consequential error?Decision rights, escalation path, and audit trail requirements

This contract prevents a common ambiguity: everyone agrees that the pilot was successful, but nobody agrees on what success obligates them to do next. If the readiness contract is not satisfied, the correct next step is not scale; it is remediation. If it is satisfied, the organisation has a defensible basis for moving from experimentation into operation.

Signals of Crossing vs. Stalling

The difference between a programme that is crossing the valley and one that is stalling is often invisible from the outside and obvious in retrospect. Several signals distinguish the two:

Crossing: The system is handling real work, including messy edge cases, and the operations team is managing it without constant support from the build team. Stalling: the system is handling a carefully maintained happy path and degrades immediately when inputs deviate from it.

Crossing: Business unit owners are asking for the agent to do more. They have seen enough performance to want to extend its scope. Stalling: business unit owners are politely supportive but have quietly maintained the manual processes the agent was supposed to replace.

Crossing: When something goes wrong, there is a clear playbook for what to do. The operations team executes it without escalating to the build team. Stalling: when something goes wrong, everyone looks at the person who built it.

Crossing: The cost of operating the system is understood, tracked, and within the approved operational budget. Stalling: the system is still running on innovation budget while the operational budget question remains unresolved.

These signals do not require formal assessment processes. They surface naturally in the weekly cadence of operations — or they surface as the absence of any weekly cadence at all, which is itself a signal.

Key takeaway: The organisations that cross the valley reliably are not those that build the best pilots. They are those that design explicitly for the conditions that production introduces — before the pilot ends.


References

  1. Challapally, A., Pease, C., Raskar, R., & Chari, P. (2025). The GenAI Divide: State of AI in Business 2025. MIT NANDA (Project NANDA, preliminary findings). Massachusetts Institute of Technology. July 2025.
  2. Ryseff, J., De Bruhl, B., & Newberry, S. (2024). The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed: Avoiding the Anti-Patterns of AI. RAND Corporation, Research Report RRA2680-1. Santa Monica, CA: RAND Corporation. August 2024. https://www.rand.org/pubs/research_reports/RRA2680-1.html
  3. S&P Global Market Intelligence (2025). Voice of the Enterprise: AI & Machine Learning, Use Cases 2025. 451 Research / S&P Global Market Intelligence. Survey of 1,006 respondents in North America and Europe, fielded October–November 2024. 2025.
  4. Singla, A., Sukharevsky, A., Yee, L., Chui, M., Hall, B., & Balakrishnan, T. (2025). The State of AI in 2025: Agents, Innovation, and Transformation. QuantumBlack, AI by McKinsey. November 2025.

Building agentic AI and wondering why alignment is harder than the technology? Get in touch