Chapter 21 · Measuring and Improving Agent Performance

Deployment is not the end of the work. For most agentic systems, it is the beginning of the harder part.

Two Measurement Failures

Enterprise AI programmes typically encounter measurement problems at two distinct moments. The first arrives before and during deployment: organisations track proxies — task completion rates, hours saved, model calls — rather than business outcomes, and the connection between those proxies and actual financial value is assumed rather than demonstrated. By 2025, 72% of enterprise decision-makers were formally measuring generative AI ROI, and three in four reported positive returns.¹ The headline looks reassuring. The detail is more complicated: a parallel review of over 300 enterprise AI implementations found that only 5% reached production with measurable profit-and-loss impact; for the remaining 95%, adoption was real but financial transformation was not.²

The second measurement failure arrives after deployment. An agent is deployed, performance is good, the team moves on to the next project. Six months later, a model update changes some subtle aspect of the agent's reasoning. A data source the agent relies on starts returning different formats. The edge cases that did not appear in the pilot begin accumulating in production. The agent's outputs drift, slowly and invisibly, away from the standard they met at launch. This failure mode — set-and-forget degradation — does not produce error logs or trigger alerts. It surfaces in business outcomes: customer complaints that are slightly more frequent, processes that take slightly longer, outputs that require slightly more human correction. By the time the degradation is visible, it has usually been ongoing for weeks or months.

This chapter addresses both failures. It covers how to build a credible pre-deployment business case, how to track value once agents are live, how to keep performance from degrading over time, and how to communicate results to board-level stakeholders who expect financial clarity that agentic AI rarely delivers cleanly.

Key takeaway: Agent performance degrades silently and business value is rarely measured accurately. The organisations that get this right design measurement into the programme before deployment and build an accountability cadence that runs continuously after it.

Why Standard ROI Frameworks Break Down

A standard software ROI calculation is reasonably tractable: you estimate the cost of the system, estimate the productivity gain it produces, convert the productivity gain to a dollar value, and divide one by the other. The assumptions are strong but manageable.

Agentic AI breaks four of those assumptions simultaneously.

Non-linear value curves. Agentic systems frequently deliver minimal value at low deployment scale and disproportionate value at high scale, as agents handle the long tail of edge cases that represent the majority of human effort. A system that handles 60% of cases automatically is not 60% as valuable as one that handles 90% — it may be 20% as valuable, because the remaining 40% of cases are the complex ones that consume most of the time. This non-linearity makes early-stage ROI calculations systematically misleading. The non-linearity is compounded by a structural feature of most current AI deployments: systems that don't retain context across interactions plateau quickly, while those capable of learning from feedback accumulate value over time — making the shape of the value curve dependent as much on the system's learning architecture as on its deployment scale.²

The attribution problem. An agent that assists a human analyst in producing a report creates value — but how much of that value should be attributed to the agent versus the analyst? When the agent handles the data gathering and the analyst handles the interpretation and judgment, the productivity gain from the agent is real but not cleanly separable from the human contribution. Organisations that attribute the full value of the output to the agent overstate ROI; organisations that attribute the full value to the human miss the agent's contribution.

Infrastructure investment payback periods. The evaluation infrastructure, observability tooling, governance frameworks, and integration work that good agentic deployments require are not recoverable in year one. The payback period for the full cost of a production agentic system — not just the model API costs but the complete operational infrastructure — typically runs eighteen to thirty-six months. S&P Global data finds that only 6% of organisations see AI payback in under a year and just 13% within twelve months, with most achieving satisfactory returns within two to four years.³ Standard annual ROI calculations look unfavourable during this period even when the investment is sound.

Delayed workforce effects. The productivity gains from agentic AI often materialise not immediately but as the human workforce adapts its working patterns — shifting from execution to oversight, from manual processing to exception handling. This adaptation takes time, and the value it generates is not visible in the first months of deployment.

Key takeaway: Applying a standard software ROI framework to an agentic AI programme produces numbers that are either discouraging (if costs are counted accurately) or misleading (if they are not). The framework needs to change before the measurement does.

The Pre-Deployment Business Case

A credible pre-deployment business case for an agentic AI programme does three things that most business cases do not: it is honest about uncertainty, it separates infrastructure investment from operational value, and it defines what success looks like in a way that can be verified after deployment.

Modelling the range, not the point estimate. The instinct in business case development is to produce a single number — "this programme will generate £4.2 million in annual savings." That number is almost always wrong, and it creates a hostage-to-fortune problem: when the actual result differs from the projection, the programme is declared a failure regardless of whether the investment was sound. Better practice is to model three scenarios — conservative, base, and optimistic — with explicit assumptions behind each, and to present the range rather than the point.

Separating infrastructure from operations. The cost structure of an agentic programme has two distinct layers: the infrastructure investment that is a prerequisite for all future value (evaluation systems, integration work, governance frameworks, team capability building) and the operational costs that scale with usage. A business case that blends these two layers makes it impossible to understand the economics of either. Separating them makes the investment horizon clear and sets the expectation that near-term returns will be below long-term returns.

Defining verification criteria. The business case should specify, before deployment, the metrics that will be used to assess whether it was right. What task volume will the agent handle by month six? What is the acceptable error rate for each error category? What human oversight costs will be maintained? These criteria serve two purposes: they give the programme a clear definition of success, and they prevent the goalposts from being moved when the results differ from the projection.

Value Tracking in Production

Once agents are live, the measurement challenge shifts from estimation to observation. The temptation is to report the metrics that are easy to collect — model calls, task completion rates, API costs — rather than the metrics that matter. These technical metrics are necessary for operational management but insufficient for business case validation.

The metrics that matter for business value assessment organise into three tiers:

Output metrics measure what the agent produces: tasks completed, documents processed, decisions made, responses generated. These are the foundation layer — they confirm the agent is running and doing something. They do not confirm it is doing the right thing.

Quality metrics measure whether the agent's outputs are correct, complete, and useful: error rates by category, human correction rates, escalation rates, user satisfaction scores. These are the layer that determines whether the output volume generates value or generates rework. An agent completing 10,000 tasks at 60% accuracy is not producing 10,000 units of value.

Outcome metrics measure the downstream business effect: process cycle time, cost per transaction, customer satisfaction scores, revenue attributable to agent-assisted interactions. These are the metrics the board is actually asking about, and they are the hardest to collect because they require connecting the agent's observability infrastructure to the business systems that capture downstream results.

Tier	Example metric	Answers
Output	Tasks completed per day	Is it running?
Quality	Human correction rate by task type	Is it right?
Outcome	Process cycle time, cost per transaction	Is it valuable?

The discipline of maintaining outcome metrics is where most programmes underinvest. A 2026 survey of 364 enterprise leaders confirms the scale of this gap: while 37.9% track the proportion of staff using AI tools, only 16.8% track investment per tool against its realised benefit, and just 13.5% measure delivery-speed improvement as a direct indicator of operational value.⁴

Business-Level KPIs vs. Technical Metrics

The monitoring infrastructure described in Chapter 11 produces technical metrics: token costs, latency distributions, tool call success rates, error logs. These metrics are essential for engineering teams managing the operational health of the system. They are not sufficient for business accountability.

The gap between technical health and business value is real and frequently ignored. A system can be technically healthy — low latency, clean logs, high tool call success rate — while producing business outcomes that are materially below expectations. Conversely, a system with elevated technical error rates may still be delivering excellent business value if those errors are in low-consequence categories that human oversight handles well.

Business-level accountability requires a different set of metrics, built around outcomes rather than operations:

Process metrics measure the effect of the agent on the business process it supports. If the agent handles invoice processing, the relevant process metric is not "invoices processed per hour" (output) but "invoice processing cycle time" and "exceptions requiring manual intervention" (process outcomes).

Quality by consequence tier distinguishes between error categories based on their downstream impact. A formatting error in a draft document is a different failure from an incorrect calculation in a financial analysis. Accountability metrics should be weighted by consequence — tracking not just error rates but error rates in high-consequence categories specifically.

Human override rate measures how often the humans operating alongside the agent choose not to accept its output. A rising override rate is an early warning signal of degradation — humans are detecting problems before they surface in outcome metrics. A falling override rate, paradoxically, can also be a warning signal if it reflects complacency rather than genuine performance improvement.

Metric	What it measures	Warning signal
Process cycle time	Business efficiency of the supported process	Rising without external cause
High-consequence error rate	Quality in the categories that matter most	Any sustained increase
Human override rate	Practical trust calibration by operators	Rising (degradation) or rapidly falling (complacency)
Escalation rate	Volume of edge cases beyond agent scope	Rising (scope creep) or falling (under-escalation)
Outcome variance	Consistency of results across similar inputs	Widening distribution

The Productivity and Skill-Erosion Trade-Off

One measurement that almost no business case includes, and that almost all should, is the cost of skill erosion.

When a task is automated, the people who previously performed it tend to lose the practice opportunities that built and maintained their capability. Evidence from one large-scale deployment complicates this picture: customer service agents who actively engaged with AI recommendations showed durable skill gains, continuing to outperform their pre-AI baseline even during system outages — suggesting that whether skill erosion occurs depends on how workers are structured to engage with AI, not simply on whether automation is present.⁵ Field evidence from professional knowledge work points in the same direction: AI creates large gains inside the capability frontier and measurable losses outside it, which means performance measurement must track task fit rather than treating productivity as a uniform effect across the whole workflow.⁶

The design implications of this finding are addressed in Chapter 22. The measurement implication belongs here: a business case that counts the productivity gain without counting the skill erosion cost is systematically overestimating the net value of the programme. Measurement data reinforces the training dimension: organisations with formal AI proficiency programmes report 2.7 times higher proficiency scores and 3.2 times greater productivity gains than those relying on informal adoption alone.⁴

A complete business case should acknowledge this risk explicitly and include the cost of mitigating it: the deliberate learning and development investment required to ensure that skill erosion does not outpace the productivity gain.

When ROI Is the Wrong Framing

Not every valuable agentic investment produces returns that are legible in standard ROI terms. Three categories of value are consistently underrepresented in business case frameworks:

Risk reduction. An agent that monitors compliance across thousands of transactions in real time reduces the risk of regulatory violation in a way that does not appear in a productivity calculation. The value is the expected cost of the violations it prevents — a number that is hard to estimate but that, in regulated industries, is often very large.

Capability optionality. Investing in the infrastructure for agentic AI — the evaluation systems, the integration layer, the governance frameworks — creates the option to deploy additional use cases at substantially lower marginal cost. The value of that option is real and can be estimated using option pricing methods. It is almost never included in the business case for the initial deployment.

Competitive positioning. In markets where agentic AI capability is becoming a competitive differentiator, the cost of not building the capability is a relevant consideration. This is difficult to quantify but belongs in an honest business case — particularly in technology-intensive sectors where the capability gap between early adopters and late followers is becoming measurable.

The Feedback Loop Design

An agent that does not learn from its production experience is permanently limited by its initial configuration. The feedback loop — the mechanism through which signals from production improve agent behaviour over time — is the difference between a system that compounds value and one that decays toward obsolescence. Evaluation-driven development and operations research makes this same point structurally: evaluation must connect development-time tests, deployment gates, runtime monitoring, and post-incident learning into a continuous system rather than a one-off benchmark exercise.⁷

Feedback loops for agentic systems operate at three levels:

Immediate correction handles the case where an output is wrong and the error is caught in real time. A human reviewer rejects an agent output, corrects it, and the correction is logged. The value of this loop depends on whether the corrections are systematically reviewed — if they are simply logged and never analysed, the feedback exists but the loop is broken.

Pattern analysis aggregates corrections, overrides, and escalations over time to identify systematic failure modes: categories of input the agent handles poorly, edge cases that consistently require human intervention, conditions under which the agent's confidence calibration is misleading. This analysis should be a scheduled activity, not a reactive one — conducted on a defined cadence by someone with the authority to act on what they find.

Configuration improvement translates the pattern analysis into changes: prompt adjustments, tool configuration updates, scope boundary clarifications, evaluation suite additions to cover newly identified failure modes. This is the loop closure. Without it, the feedback loop is observational rather than corrective.

The discipline that makes this loop work is scheduling. Pattern analysis that is triggered only when something goes visibly wrong misses the gradual degradation that accumulates between incidents. Scheduled analysis, conducted on a defined cadence regardless of whether anything has gone wrong, catches the drift before it becomes a failure.

Model Update Management

Foundation model providers update their models on a cadence that is not synchronised with enterprise deployment cycles. In multi-agent systems the stakes are higher, because failures often arise from system design and inter-agent coordination rather than from any single model call; measuring only final outputs misses the interaction failures that drive production breakdowns.⁸ A model update that improves performance on one class of tasks can degrade performance on another. An update that changes how the model responds to a specific prompt pattern can break an agent workflow that depended on that pattern's stability.

Model update management is one of the most underplanned aspects of enterprise agentic AI operations. Sculley et al.'s foundational analysis of ML technical debt identifies hidden feedback loops, undeclared data dependencies, and configuration instability as sources of silent degradation that are structurally difficult to detect through standard testing.¹⁰ Interview research on operational machine learning reinforces the same operational lesson: production ML systems fail through data, monitoring, ownership, and maintenance gaps at least as much as through modelling errors.⁹ In agentic systems, these risks are compounded by the non-deterministic nature of model outputs: the same input can produce different outputs across runs, making regression detection harder than in conventional software systems.

The typical pattern for model update failure is: provider deploys an update; the agent's behaviour changes in ways that are not immediately visible; performance metrics begin drifting; someone notices, weeks later, that something is different; a root cause investigation determines that the model update is responsible; the team attempts to adapt prompts and configurations to restore previous behaviour; they partially succeed, with some residual performance degradation.

Each step in this sequence is wasteful in proportion to how long it takes. The mitigation is a proactive model update protocol:

Maintain a regression test suite. A collection of inputs with known correct outputs, covering the range of tasks the agent handles including edge cases, is the only reliable mechanism for detecting when a model update has changed behaviour. This suite should be run automatically when a model update is detected and the results compared against the baseline.

Stage model updates. Rather than accepting model updates immediately in production, run the new model version in parallel against a shadow traffic sample, compare outputs against the current version, and promote the update only when the regression suite passes and the shadow comparison shows no significant divergence.

Maintain rollback capability. Most foundation model providers offer access to previous model versions for a defined period. The operational plan should specify under what conditions the team would roll back, who has authority to make that decision, and how quickly it can be executed.

The Continuous Improvement Cadence

The accountability practices described in this chapter are only effective if they are institutionalised rather than ad hoc. The mechanism of institutionalisation is a cadence: a scheduled, recurring activity with defined participants, defined inputs, and defined outputs.

An agent health review should occur on two timescales:

Weekly operational review covers the metrics from the prior week against thresholds. Participants: the agent operations owner, the relevant business unit representative, and whoever is responsible for configuration updates. Duration: thirty minutes. Output: a list of actions with owners and deadlines. The meeting should not be a reporting event — it should be a decision event, producing specific actions or explicit decisions that no action is required.

Monthly strategic review covers trends over the prior month, feedback loop outputs, model update status, and any changes to the business process the agent supports that require agent reconfiguration. Participants: the operational team plus senior stakeholders from the business unit and IT governance. Duration: sixty to ninety minutes. Output: a written summary that feeds the board reporting process described below.

The temptation, especially as agent programmes grow, is to automate the governance cadence — replacing the meeting with a dashboard and an alert system. Dashboards and alerts are necessary but not sufficient. The human judgment required to distinguish signal from noise in complex operational data, and the accountability that comes from humans explicitly reviewing and signing off on agent performance, cannot be automated away without also automating away the accountability itself.

Key takeaway: The continuous improvement cadence is the mechanism through which deployed agents get better rather than worse. It requires scheduled human attention, clear ownership, and closed feedback loops — not better dashboards.

Board and Executive Reporting

The audience for a board-level AI programme report is not the engineering team. Boards and executive committees do not want to know about model call volumes or token costs. They want to know whether the investment is performing, whether the risks are managed, and whether the programme is on track to deliver what was promised.

A board-level AI programme report should address four questions in plain language:

What did we expect? A brief restatement of the commitments made in the original business case — the scenarios, the assumptions, the verification criteria.

What are we seeing? The outcome metrics — the business-level results — compared against the expectations. Not the technical metrics, or at least not leading with them.

What is the risk profile? A plain-language account of the most significant operational, regulatory, and reputational risks in the current programme — and what is being done about each.

What is the next decision? Board meetings are decision-making events. The report should arrive with a specific decision request — to expand scope, to continue investment, to address a specific risk, or to acknowledge that the programme is performing within the expected range and no decision is required.

The common failure mode in board reporting is presenting technical metrics to a non-technical audience and expecting them to infer business meaning. The result is either confusion (the board does not know what to make of the numbers) or false confidence (the numbers look good because they were selected to look good). Neither outcome serves the programme.

References

Wharton Human-AI Research & GBK Collective (2025). Accountable Acceleration: Gen AI Fast-Tracks Into the Enterprise. Wharton Human-AI Research & GBK Collective, University of Pennsylvania. October 2025.
MIT NANDA Initiative (2025). The GenAI Divide: State of AI in Business 2025. Massachusetts Institute of Technology. July 2025.
S&P Global Market Intelligence (2025). The Big Picture 2025: Generative Artificial Intelligence. S&P Global. 2025.
Larridin (2026). From AI Exploration to AI Accountability: Why Measurement Is Strategy. Larridin State of Enterprise AI Q1 2026. February 2026.
Brynjolfsson, E., Li, D., & Raymond, L.R. (2025). Generative AI at Work. Quarterly Journal of Economics, 140(2), 889–942.
Dell'Acqua, F., McFowland, E., Mollick, E., Lifshitz, H., Kellogg, K.C., Rajendran, S., Krayer, L., Candelon, F., & Lakhani, K.R. (2026). Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality. Organization Science. https://doi.org/10.1287/orsc.2025.21838
Xia, B., Lu, Q., Zhu, L., Xing, Z., Zhao, D., & Zhang, H. (2024). Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture. CSIRO's Data61 / Adelaide University / UNSW. arXiv:2411.13768.
Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., et al. (2025). Why Do Multi-Agent LLM Systems Fail? NeurIPS 2025, Track on Datasets and Benchmarks. arXiv:2503.13657.
Shankar, S., Garcia, R., Hellerstein, J., & Parameswaran, A. (2022). Operationalizing Machine Learning: An Interview Study. arXiv:2209.09125. UC Berkeley / MIT.
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J., & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems 28 (NIPS 2015), pp. 2503–2511.
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media.

Building agentic AI and wondering why alignment is harder than the technology? Get in touch

Two Measurement Failures​

Why Standard ROI Frameworks Break Down​

The Pre-Deployment Business Case​

Value Tracking in Production​

Business-Level KPIs vs. Technical Metrics​

The Productivity and Skill-Erosion Trade-Off​

When ROI Is the Wrong Framing​

The Feedback Loop Design​

Model Update Management​

The Continuous Improvement Cadence​

Board and Executive Reporting​

References​