Skip to main content

Chapter 2 · The Economics of AI: Capability, Speed, and Cost

Why the economics of running AI agents matter as much as the AI itself.


The Hidden Cost of Intelligence

Building a chatbot that answers a customer's question is a single transaction. Building an AI agent that autonomously researches a market, drafts a strategy, validates its own assumptions, and generates a final report is an entirely different proposition — and so is its cost.

As organisations move from experimenting with conversational AI to deploying agents that act, decide, and iterate across complex workflows, the question of cost stops being a technical footnote and becomes a boardroom concern. Chapter 1 introduced the shift from systems that merely respond to systems that act; this chapter examines the economic constraint that determines whether those systems can operate sustainably at scale.


Cost per Call vs. Cost per Outcome

The first mistake many teams make is measuring the cost of an agent by the price of a single model call. That is rarely the right unit. The economic unit that matters is the cost per successful outcome: the total cost of planning, tool calls, model calls, retries, verification, escalation, logging, and human review required to complete a task correctly.

A cheap model call can become expensive if it causes repeated retries. A more expensive model can become economical if it reaches a correct answer faster, uses fewer tools, or avoids human rework. This is why agent economics must be measured at the workflow level rather than the prompt level.


The Agent Trilemma

Consider a scenario that is becoming familiar in enterprise AI teams. A company builds an agent to automate competitive research — it browses the web, synthesises findings, drafts a report, and checks its own conclusions before delivering a final output. In testing, it performs brilliantly. Then the monthly API bill arrives. What felt like a productivity breakthrough turns out to cost more per report than the analyst it was meant to replace. The technology worked. The economics did not.

This tension has a name. Researchers have identified what is increasingly called the agent trilemma: the simultaneous pursuit of high performance, low cost, and fast execution. Optimizing for any two of these tends to come at the expense of the third.

What makes this trilemma particularly sharp in agentic contexts is the compounding nature of model calls. Unlike a standalone chatbot interaction, an AI agent completing a sophisticated task may invoke a language model dozens of times — to plan, to research, to verify, to revise, and to summarise.

Empirical studies have found that some leading agentic systems incur costs averaging up to $3 per task, with even relatively simple queries sometimes demanding up to 40 minutes of execution time. At enterprise scale, those numbers multiply rapidly across thousands of daily tasks.

A simple example makes the point concrete. Imagine an internal research agent that uses eight model calls to plan, search, extract, synthesise, verify, revise, and format a report. It also makes three paid tool calls and fails one verification check, forcing a partial retry. The visible user experience is one request and one final answer. The economic reality is a chain of twelve billable operations. If that workflow costs only €0.40 when it succeeds on the first pass but €1.20 when retries and human review are included, the business case depends on the second number, not the first.


Large vs. Small: Choosing the Right Model

The traditional instinct in AI deployment has been straightforward: use the most powerful model available. But frontier models come with a cost premium that makes them impractical as the default engine for every step of every agentic workflow.

This has driven significant research interest in Small Language Models (SLMs) — typically models ranging from 1 to 20 billion parameters — optimized specifically for constrained deployment environments.

Model SizeTypical CostBest ForLimitation
Frontier (100B+)HighComplex reasoning, ambiguous tasksCost at scale
Mid-size (20–70B)MediumGeneral enterprise tasksBalance point
Small (1–20B)LowSpecific, well-defined tasksNarrow capability
Edge (< 1B)NegligibleOn-device, offlineVery limited

The instinct to equate model size with capability is understandable — for most of AI's recent history, it was largely correct. But it breaks down when tasks are narrow and well-defined. Compute-optimal scaling research also challenged the idea that simply increasing parameters is always the best path, showing that model size, training data, and compute budget must be balanced rather than treated as a one-way race toward larger models.6 One particularly striking finding from recent benchmarking challenges this assumption directly: a fine-tuned small language model achieved a 77.55% pass rate on a standard tool-use evaluation (ToolBench), significantly outperforming ChatGPT configurations that scored as low as 16–26% on the same benchmark. The result stemmed not from raw scale but from precise task alignment — the small model was trained exclusively on structured tool-calling patterns, while larger generalist models struggled with the format requirements and generated verbose responses where concise API calls were needed.


Tiered Model Routing

The most sophisticated response to the trilemma is not to choose one model and apply it universally, but to architect agent systems that intelligently route different tasks to appropriately sized models. This is the logic behind model cascades and routing systems such as FrugalGPT: route easy cases to cheaper models, escalate difficult cases to stronger models, and optimise for quality-adjusted cost rather than raw model prestige.7 The logic is not unlike how professional services firms have always worked: a senior partner handles the ambiguous strategic question that requires judgement and experience; a junior associate handles the structured research that requires thoroughness and time. Nobody considers this a compromise — it is simply good resource allocation. Applying the same principle to AI models is what tiered routing makes possible.

Recent research demonstrated that by calibrating model complexity to task requirements, it was possible to retain 96.7% of the performance of a leading open-source agent while reducing per-task operational costs by over 28%. Crucially, these gains came not only from model selection but from right-sizing the agent framework itself — planning depth, tool configuration, and memory design all proved significant levers, and adding complexity beyond a threshold increased costs without improving outcomes.


When Cheaper Is Not Wiser

Amid the drive toward cost optimization, an important counterargument deserves equal weight. Economic analyses of LLM deployment have found that for tasks where errors carry meaningful real-world consequences — financial decisions, medical information, legal interpretation — the calculus shifts decisively toward using the most capable model available, regardless of cost.

The reasoning is straightforward: deployment costs, even for frontier models, are typically small relative to the economic impact of a consequential mistake. A medical AI system that misinterprets a diagnostic query does not save money by running on a cheaper model — it transfers cost from the API bill to somewhere far more serious. The same logic applies to financial advice that moves capital in the wrong direction, or legal guidance that leads a business into liability it could have avoided. In these contexts, the frontier model is not an extravagance. It is insurance.

Cost optimization is not a universal goal to be pursued in isolation. It is a design variable to be calibrated against the specific risk profile of each task.


Practical Design Principles

PrincipleWhat It Means in Practice
Map tasks before selecting modelsCategorise each workflow step by complexity and cost of error
Build for routing, not uniformityDesign with dynamic model routing as a first-class feature
Measure cost-of-passQuantify the full cost of a successful outcome including retries
Fine-tune for specificityTargeted fine-tuning often beats general frontier models at lower cost
Preserve capability headroomRetain frontier access for genuinely complex reasoning
Prioritise the orchestrator roleIn multi-agent systems, the manager/orchestrator model is the single most influential factor on overall team performance — allocate capability budget here first

The organisations that navigate this well will not necessarily be those with the largest AI budgets. They will be those that learn to treat cost as a design constraint from the beginning — not an afterthought to be optimised once something is already built and running. Getting model selection, routing logic, and framework design right before scaling is far cheaper than unpicking them after the fact.

The economics of AI agents, in other words, reward the same discipline that good engineering has always rewarded: thinking carefully before building, and building only what the problem actually requires. Once the economics are viable, the next constraint is perception: whether an agent can understand the documents, images, voice instructions, dashboards, and software environments in which real enterprise work actually happens.

References

  • Jhandi, P., Kazi, O., Subramanian, S., & Sendas, N. (2024). Small Language Models for Efficient Agentic Tool Calling. Amazon Web Services.
  • OPPO AI Agent Team (2025). Efficient Agents: Building Effective Agents While Reducing Cost. OPPO Research Institute.
  • Zellinger, M.J., & Thomson, M. (2025). Economic Evaluation of Large Language Models. California Institute of Technology.
  • Sabbatella, A. (2025). MALBO: Optimizing LLM-Based Multi-Agent Teams. University of Milano-Bicocca.
  • Sharma, R., & Mehta, M. (2025). Small Language Models for Agentic Systems: A Survey of Architectures, Capabilities, and Deployment Trade-offs. Northeastern University / University of Southern California.
  • Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al. (2022). Training Compute-Optimal Large Language Models. Advances in Neural Information Processing Systems.
  • Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176.

Building agentic AI and wondering why alignment is harder than the technology? Get in touch