Skip to main content
ADVERTISEMENT

Chapter 18 · Accelerating Innovation: Agents in the Product Lifecycle

Speed is not the same as progress. Agents can compress the time between idea and output. Whether the output is any good is a different question.


Where Innovation Actually Gets Slow

The mythology of product development is that the hard part is the idea. In practice, the idea is usually the fast part. What is slow is everything that happens between the idea and the moment a customer can use something: the research to validate it, the specifications to describe it, the engineering work to build it, the testing to verify it, the documentation to explain it, and the ongoing monitoring to improve it after launch. In most organisations, most of this work is not creative in any meaningful sense — it is the necessary scaffolding around the creative work, and it consumes the majority of the time.

Agents are well-positioned to compress this scaffolding. They are not well-positioned to replace the judgement that determines whether the scaffolding is being erected around the right thing. This distinction matters enormously for how agentic tools are deployed in product and innovation contexts, and it is the distinction most frequently blurred in vendor narratives about AI-accelerated development.

This chapter traces agents through the stages of the product lifecycle — from early research and ideation to post-launch iteration — examining where they genuinely accelerate innovation, where they create the appearance of acceleration without the substance, and what the human role must retain if the output is to be worth accelerating.


The Product Lifecycle as a Pipeline

Product development is inherently a pipeline: work flows from one stage to the next, each stage transforming inputs into outputs that the next stage builds on. The pipeline metaphor has limits — real product development is messier, more iterative, and more socially complex than a diagram suggests — but it is analytically useful because it makes the bottlenecks visible.

Agents can contribute at every stage, but the nature of the contribution — and the risks — differ substantially across them. Before examining each stage, it is worth naming the systemic failure mode that spans all of them.


The Acceleration Trap

There is a failure mode in agentic product development that does not belong to any single stage — it emerges from the aggregate effect of agents operating across all of them. Call it the acceleration trap: the compression of cycle time creates the impression of progress when the underlying product decisions are not getting better.

The mechanism is straightforward. If agents reduce the time to produce a market research synthesis from a week to a day, the time to draft a specification from three days to half a day, and the time to write tests from a sprint to an afternoon, the cycle from concept to deployed feature compresses substantially. This looks like accelerated innovation. But the quality of the market research insight, the precision of the specification, and the coverage of the test suite depend on the quality of the human judgement applied to the agent's outputs at each stage — and the time available for that judgement has not increased proportionally. If anything, it has decreased: faster cycles create pressure to review more outputs in the same amount of time.

This trap has a measurable organisational shadow. Enterprise surveys find that while three in four organisations now report positive AI ROI, 43% simultaneously report that sustained AI use is eroding employee skill proficiency — particularly among junior staff still building foundational capabilities.3 The two findings are not contradictory. Organisations can be producing outputs faster while progressively reducing the depth of judgement applied to each one, and the cost surfaces at launch or in the months after — when the accumulated quality debt is most expensive to correct.

The organisations that avoid the acceleration trap compress the generative time rather than the review time. They use the time savings to review more carefully, to include more stakeholder perspectives, to run more user research, and to revisit assumptions more frequently — not to move faster through a process that still requires the same depth of human attention.

The organisations that fall into the trap treat agent-generated outputs as closer to final than they are, compress the review step along with the generation step, and discover the quality debt at launch or in the months after — when the cost of correction is highest.

Key takeaway: The acceleration trap is not a failure of the technology — it is a failure of the review process. Agents compress generation time. They do not compress the time required for good human judgement. Organisations that treat these as the same thing are building speed at the expense of quality.

The sections below apply this lens to each stage of the product lifecycle.


Stage 1: Discovery and Research

Research is the stage where agents offer the most unambiguous acceleration. The work is information-intensive, largely standard in form, and the cost of a gap or error is recoverable — a missed insight in the research phase produces a flawed hypothesis that the ideation phase can correct.

The applications cluster into three patterns:

Market and competitive intelligence. Agents that continuously monitor competitor product releases, pricing changes, patent filings, job postings (a signal of strategic direction), and public regulatory submissions can synthesise a structured intelligence briefing that would take an analyst team days to produce. The value is not just speed — it is coverage. Human analysts naturally concentrate attention on known competitors and familiar signals. An agent monitoring a broader surface can surface weak signals from unexpected sources.

Customer and user research synthesis. Product teams generate enormous volumes of qualitative research — interview transcripts, usability test recordings, support ticket logs, NPS verbatim comments, app store reviews. Synthesising this material into structured insight has historically required dedicated research operations capacity. Agents that can ingest, code, and summarise qualitative data at volume change the economics of this work substantially, making it feasible to synthesise research on a continuous basis rather than as a periodic programme.

Prior art and literature search. In R&D-intensive industries — pharmaceuticals, materials science, semiconductor design, medical devices — a significant proportion of early-stage research time is spent establishing what is already known, what patents exist, and what the current state of the science is. Agents with access to patent databases, academic literature repositories, and regulatory filing archives can compress this work from weeks to hours.

The characteristic risk at this stage operates in two directions. A human researcher tends to find what they are looking for — confirmation bias in search. An agent tends to find what is most semantically similar to the query it was given — a form of query-bounded retrieval that can miss important material framed in terminology the query did not anticipate. The practitioner response is explicit query diversification: run multiple formulations of the same underlying question, and treat the union of results rather than a single search pass as the research output.

A second risk is more structural: agent-synthesised research is subject to the extrinsic hallucination problem that affects all language model outputs — content that cannot be verified against its source, which passes standard quality checks silently because there is no ground truth to compare it against.1 A research synthesis that reads fluently and cites real papers is not the same as one whose claims have been verified against those papers. Retrieval pipelines return what is semantically relevant, not what is accurate.9 In product research, an undetected fabricated statistic or misattributed finding does not surface as an error until it has shaped decisions downstream.

Key takeaway: Agents excel at research coverage and synthesis speed, but their retrieval is bounded by query framing and their outputs are not self-verifying. The human's role in research is not to do what the agent does more slowly — it is to ask questions the agent would not think to ask, and to verify the claims the agent states most confidently.


Stage 2: Ideation and Concept Design

Ideation is the stage at which the agent's role becomes most contested. The argument for deploying agents in ideation is that they can generate large volumes of candidate ideas rapidly, explore design spaces that human teams would not reach due to time or cognitive constraints, and serve as a tireless sparring partner for concept evaluation. The argument against is that ideation is where domain expertise, user empathy, and creative judgement are most concentrated — and that these are precisely the capabilities agents most conspicuously lack.

Both arguments are partially right. The resolution is a distinction between generation and evaluation.

Agents are useful as generative tools in ideation because breadth of coverage is genuinely valuable early in a design process. A product team exploring directions for a new feature benefits from having fifty candidate concepts rather than five — even if most of the fifty are not viable — because the broader landscape exposes options and adjacencies that a narrower search would miss. The agent's lack of domain expertise is less damaging here than it would be later in the lifecycle, because the evaluation step that follows generation can filter out the weak candidates.

Agents are not useful as evaluative tools in ideation, and deploying them as such introduces a failure mode that is particularly insidious: the appearance of structured evaluation applied to a process that actually requires human judgement. An agent that scores fifty concept candidates against a rubric of criteria produces a ranked list that looks like an evaluation. It is not. The scores reflect the agent's interpretation of the rubric and the training distribution from which it extrapolates customer preferences — neither of which is a reliable substitute for a product manager's understanding of actual user needs in a specific market context. This is compounded by the calibration problem documented in Chapter 12: models express confidence that is a poor predictor of actual accuracy, and when an agent reasons through why one concept ranks above another, that expressed reasoning may be post-hoc rationalisation rather than genuine evaluation.5

The independence of the evaluation step is not merely good practice — it is architecturally necessary. Research on LLM planning and reasoning behaviour establishes that the same model cannot reliably verify outputs it has generated: iterative self-critique regimes do not improve accuracy, because the failure modes of generation and verification are shared by the same underlying system.10 In the generate-test-critique architecture this finding implies, the human evaluator is not a luxury — the generate and test steps cannot be collapsed into a single agent loop without losing the quality signal the separation is designed to provide.

The deployment pattern that works: use agents for breadth in generation, human experts for depth in evaluation. The handoff is explicit and the roles are not blurred.

The Intellectual Property Dimension

Ideation with generative AI raises an IP question that organisations are still working through: who owns the output? The U.S. Copyright Office's 2025 analysis of AI copyrightability provides the most precise available guidance: copyright protection requires human authorship of expressive elements, and prompts alone do not establish it — even highly detailed prompts do not give the prompter sufficient control over the model's outputs to constitute creative authorship.6 Protection is established when a human's creative selection, arrangement, or modification of AI-generated material is perceptible in the final work.

For product teams, this has a direct operational implication beyond legal compliance. The generation/evaluation split recommended in this chapter has a parallel IP logic: teams where human designers actively select, refine, and develop from agent-generated concepts are building a defensible authorship position; teams where agents generate and humans merely approve are not. The distinction between active creative contribution and passive approval is the same distinction the Copyright Office draws between protected and unprotected human involvement.

In domains where the creative output is the product — software products where code is the competitive asset, design products where visual originality is the source of value — legal counsel should be engaged on the IP implications of agentic generation before the workflow is standardised.6


Stage 3: Specification and Planning

Specification is unglamorous and essential. The work of translating a validated concept into the detailed functional, technical, and business requirements that engineers and designers can build against is slow, prone to gaps, and frequently the source of expensive downstream rework when the gaps are discovered during development. It is also one of the stages where agents add consistent, measurable value.

The core applications are:

Requirements drafting. Agents can generate structured first drafts of functional requirements documents, user stories, acceptance criteria, and technical specifications from a higher-level brief. The draft will not be correct without human review — it will contain gaps, ambiguities, and assumptions that need to be resolved — but it provides a starting point that significantly reduces the time to a complete, reviewed specification.

Dependency and risk mapping. Given a feature specification, an agent with access to the existing system architecture and codebase can identify likely dependencies, surface potential conflicts with existing functionality, and flag areas of technical complexity. This is not a substitute for an architect's review, but it provides an informed starting point for that review and surfaces issues that might otherwise be discovered late.

Estimation support. Effort estimation is one of the most consistently inaccurate activities in product development. Agents trained on historical project data can generate reference-class estimates — "tasks with similar characteristics in this codebase have taken between X and Y days" — that provide a more grounded starting point than blank-slate estimation. The value is in the anchoring, not in the precision of the estimate itself.

Backlog management. As product backlogs grow, the cost of managing them — triaging, prioritising, de-duplicating, and keeping specifications current — grows with them. Agents can perform routine backlog hygiene: flagging duplicate or near-duplicate items, identifying stale requirements that reference deprecated functionality, and generating summaries of thematic clusters that make prioritisation conversations more efficient. Requirement changes mid-sprint, however, represent a replanning task — generating a valid specification from a materially changed state — and benchmark evidence finds that LLM planning performance on precisely this task type is lower than even its already-poor baseline plan generation scores.8 Human judgement at change-point reviews is structurally irreplaceable rather than merely preferred.

There are two structural limitations on specification agents that are easy to overlook. The first is the horizon-collapse problem documented in Chapter 12: an agent generating a functional specification handles the first few requirements coherently but fails to account for how early requirements constrain later ones, producing a document that is locally coherent section by section but contradictory or incomplete at the level of the whole.8 The second compounds the first: expanding a high-level brief into a fully specified acceptance criteria set is a partial-to-full goal expansion task, where unstated constraints must be inferred and satisfied across the full requirement set. Benchmark testing finds this is precisely where LLM planning performance degrades most sharply, with human accuracy at 78% against under 5% for leading models on equivalent structured planning tasks in the same domain.8

Asking the drafting agent to review its own specification for these gaps will not resolve them. Empirical evaluation establishes that LLMs are as unable to verify plans as they are to generate them: iterative self-critique regimes show no improvement over baseline generation and can make results worse by failing to recognise correct outputs already produced.10 The review must come from an independent source — a human architect checking constraint interactions across sections, or a structured tool verifying that acceptance criteria are consistent with the stated dependencies.

Key takeaway: Specification work is where agent-generated first drafts have the highest unambiguous ROI — but the structural limitation of horizon collapse, the partial-to-full expansion failure, and the inability to self-verify mean that agent-generated specifications require genuinely independent review, not a follow-up prompt to the same agent.


Stage 4: Engineering and Development

Software engineering is the domain where agentic AI has moved furthest and fastest, and where the performance evidence is most extensive. Chapter 4 documented coding as the largest single category of enterprise AI spend, and Chapter 16 mapped the maturity stages through which agentic coding deployments progress — from inline suggestion (Stage 2) through autonomous issue resolution (Stage 3) to orchestrated multi-agent development pipelines (Stage 4–5). Readers should consult that framework for the full infrastructure requirements at each tier. What matters specifically in the product lifecycle context is a question Chapter 16 does not address: which tier of autonomy is appropriate at each point in a product's development?

The answer is not uniform across the lifecycle. The autonomy tier should be calibrated to how well-specified the task is and how costly a wrong assumption becomes if propagated through the pipeline unchecked.

Early in a new product's development — when the design is still exploratory, specifications are fluid, and assumptions are being tested — higher autonomy tiers create compounding risk. A multi-agent pipeline that implements a feature against a specification that will change tomorrow does not save time; it creates rework at the scale of the pipeline's output. The value of agentic coding at this stage concentrates in the lower tiers: code completion and suggestion, where the developer retains tight control over the implementation direction.

As a product matures, specifications stabilise, the codebase develops known patterns, and the categories of work become predictable. This is where higher autonomy tiers earn their cost. Autonomous issue resolution and multi-agent pipelines operate best on well-scoped, well-specified tasks against a stable codebase — precisely the conditions that mature products provide and new products do not.

The failure modes that arise when this calibration is wrong are not hypothetical. They are the compound failures documented in Chapter 12: hallucination generated by one agent accepted as valid by the next, goal drift across pipeline steps, errors that are structurally invisible until they surface in testing or production.7 In a product development pipeline, these compound errors do not just affect the current feature — they can propagate into the codebase in ways that create maintenance costs for subsequent development cycles.

A related limitation affects agents asked to apply established architectural patterns to new feature implementations. Benchmark testing finds near-zero performance in LLMs identifying when a previously demonstrated solution structure applies to a new problem and applying it correctly, suggesting that agentic coding systems generate fresh implementations rather than reliably recognising applicable patterns.8 In practice this means an agent may implement the same underlying pattern inconsistently across the codebase — creating technical debt that an experienced engineer applying a recognised pattern deliberately would not generate.

Test Generation and Quality Assurance

Test generation is one of the highest-value, lowest-risk applications of agents in development. Writing tests is the engineering task most consistently deferred under time pressure — because it is perceived as secondary to writing the feature — and the consequences of under-testing are paid late, when regressions are discovered in production rather than in development.

An agent that generates a test suite for newly written code — covering happy paths, edge cases, and the boundary conditions most likely to produce failures — does not replace engineering judgement about what matters to test. But it removes the blank-page problem for test writing in the same way that specification agents remove it for requirements drafting. The coverage improves, the time cost is lower, and the tests are written when the code is fresh rather than as a deferred task.

Agents in QA extend beyond test generation to test execution and analysis: running test suites, correlating failures with recent code changes, identifying regression patterns across releases, and generating structured diagnostic summaries. Evaluation-driven development — building continuous evaluation pipelines that run against every commit and provide structured feedback across quality dimensions — is the infrastructure that makes this sustainable at scale.9 The same process model positions evaluation as a quality gate at each lifecycle stage transition, not only at testing: each handoff from specification to development, and from development to launch, benefits from evaluation criteria designed before the transition rather than verified only after it.9


Stage 5: Launch and Release

Launch is the stage where the cost of errors is highest and the tolerance for agent autonomy is lowest. The outputs of this stage — release notes, user documentation, marketing materials, support articles, in-product copy — are customer-facing, public, and difficult to retract. They are also, in most organisations, produced under time pressure that creates the conditions for insufficient review.

Agents are well-suited to the generative work of this stage: drafting release notes from a structured changelog, generating user documentation from a functional specification, producing first drafts of support articles from a feature brief. The time savings are real, particularly when a release involves many features that each require their own documentation.

The deployment pattern that works is the same as at the specification stage: agent generates the first draft, human expert reviews and edits before publication. The review must be substantive — a check that the draft accurately represents the feature, uses the organisation's voice and terminology correctly, and does not contain technically incorrect statements — not a light pass for obvious errors.

The failure mode to design against is the review that does not happen because the draft looks good. This is where the acceleration trap bites hardest at launch. Agent-generated documentation that is fluent, well-structured, and broadly accurate but contains a subtle inaccuracy in a key instruction is more dangerous than a draft with obvious errors — the obvious errors would trigger a review; the plausible ones may not. Research on language model outputs establishes that structurally normal, confidently expressed output can be wrong in ways that are difficult to detect without independent verification against ground truth — asymmetric visibility is a fundamental property of how these systems generate text, not a correctable defect.1 The evaluation infrastructure for this stage should therefore include a factual accuracy check against the underlying feature specification, not just an editorial review for clarity and voice.

Key takeaway: At launch, the acceleration trap is at its most dangerous, because the outputs are public and the review pressure is highest. Fluency is not accuracy. A documentation review that stops at "this reads well" is not a quality gate.


Stage 6: Post-Launch Iteration

Post-launch is where the feedback loop that justifies the whole pipeline either closes or doesn't. The product is in the hands of users; the question is whether the organisation is learning from that usage fast enough to improve the product at the speed users expect.

Agents offer substantial value in this stage because the volume of feedback signals generated by a deployed product routinely exceeds what human analysis teams can process in a useful timeframe. The applications mirror those in the discovery stage but operate continuously rather than as a periodic programme:

Usage analytics interpretation. Agents that monitor product telemetry can surface anomalies, identify friction points in user flows, and generate hypotheses about the causes of observed behaviour patterns. The agent does not interpret the data with a product manager's contextual understanding — it pattern-matches against what it observes. The value is in surfacing signals that would otherwise be lost in volume; the interpretation remains human.

Support ticket analysis. Every support ticket is a signal about product quality, usability, or documentation gaps. The volume that flows through a mature product's support channel makes comprehensive manual analysis impractical. Agents that continuously classify, cluster, and summarise ticket patterns give product teams a real-time view of where users are struggling — a view that previously required a monthly analysis programme.

Regression detection. Post-launch, any change to the product can introduce a regression. Continuous agents monitoring error rates, performance metrics, and anomalous usage patterns against pre-deployment baselines provide an early warning system for degradations that would otherwise surface first in customer complaints.

Competitive monitoring continuity. The intelligence-gathering from Stage 1 does not end at launch. Markets evolve, competitors respond, and user expectations shift. Agents that maintain continuous competitive and market monitoring keep product teams informed without requiring a periodic research programme.

Key takeaway: The post-launch stage is where the feedback loop that justifies the entire development investment either closes tightly or leaks. Agents don't close the loop — they make the signals visible at the speed and volume that closing it actually requires.


IP, Authorship, and the Question of Craft

Two questions about the human role in agentic product development deserve explicit treatment before they become implicit and unexamined.

The IP question was introduced in the ideation section. It extends across the full lifecycle. In most product development contexts, what the organisation needs is not copyrightable output per se — it is functional products and their documentation. Copyright ownership of the specific text of a user story or a test file is rarely commercially significant. The IP question becomes more pointed in domains where the creative output is the product: software products where the code itself is the competitive asset, written products where the text is what is sold, and design products where the visual or functional originality is the source of value. In these domains, legal counsel should be engaged on the IP implications of agentic generation before the workflow is standardised.6

The craft question is less legalistic and more organisational. Product development capability — the institutional knowledge of how to write a good functional specification, how to structure a test suite, how to read customer research with appropriate scepticism — is built through practice. If agents handle the generative work routinely, the junior practitioners who would previously have built that capability through practice are instead reviewing outputs that a senior practitioner wrote before them, or worse, reviewing outputs that the agent wrote and the senior practitioner approved without building their own foundational understanding.

This is the same concern raised in Chapter 17 regarding junior engineers and coding tools. It applies equally across product disciplines. A junior product manager who reviews fifty agent-generated user stories a week without ever writing one from scratch is not building the craft that will eventually allow them to identify when the agent's story is wrong in a way that matters. The response is not to withhold agentic tools from junior practitioners — the productivity benefit to the organisation is real. It is to be deliberate about the learning experiences that produce capability, and to ensure those experiences are not entirely displaced by the efficiency of agent-generated first drafts.


What Agents Cannot Accelerate

The honest account of agents in the product lifecycle includes a list of what they cannot compress, regardless of how capable they become. These are not temporary limitations waiting for the next model generation to close — they are structural features of the work itself.

The time to understand users. Customer empathy is built through direct contact — interviews, observation, usage sessions, conversations with support teams. Agents can synthesise research findings, but they cannot replace the researcher's direct experience of watching a user struggle with a feature they designed. The product teams that understand their users best have not reduced that contact time; they have used agents to reduce the administrative cost of synthesis so more time is available for contact. An agent that produces a research summary in an hour instead of a day does not create understanding — it frees the researcher to create more of it.

The time for ideas to be tested by reality. Features need to be in front of real users before their quality is known. Agents can compress the time to build a feature; they cannot compress the time for a sufficient number of users to use it for a sufficient period to generate reliable signal about whether it works. The feedback loop requires real users, real conditions, and real time.

The time for organisational alignment. Product decisions that affect multiple teams, that require trade-offs between competing priorities, or that involve significant resource commitments cannot be accelerated by agent-generated analysis alone. The alignment work — the conversations, the reviews, the escalations — takes the time it takes because it is doing social and institutional work that analysis cannot substitute for. Faster analysis can inform alignment conversations; it cannot replace them.

The time to develop product instinct. The judgement that distinguishes a good product manager from a competent one is built over years of making decisions and observing their consequences. Agents can surface information relevant to those decisions more quickly. They cannot shortcut the accumulation of experience that makes the decisions good. Benchmark evidence is unambiguous about the gap: on structured planning and reasoning tasks equivalent in difficulty to the decisions product development routinely requires, human accuracy sits at 78% where the best language models reach under 5% — not because the humans are experts and the tasks are simple, but because the tasks require constraint satisfaction across interacting goals that pattern-matching cannot reliably perform.8 This is the deepest expression of the craft concern: instinct is pattern recognition built through practice, and it is precisely what the acceleration trap slowly degrades when review is treated as a bottleneck rather than as where the value is actually created.

Key takeaway: Agents compress the scaffolding. They do not compress the substance. The scaffolding is worth compressing — but only if the time recovered is invested in the substance, not consumed by producing more scaffolding faster.


References

  1. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), Article 248, 1–38. https://doi.org/10.1145/3571730 (Previously cited as Ch.13, ref. 1.)
  2. Brynjolfsson, E., Li, D., & Raymond, L.R. (2023). Generative AI at Work. National Bureau of Economic Research. NBER Working Paper No. 31161. (Background reference. Previously cited as Ch.7, ref. 6.)
  3. Wharton Human-AI Research & GBK Collective (2025). Accountable Acceleration: Gen AI Fast-Tracks Into the Enterprise. Wharton Human-AI Research & GBK Collective, University of Pennsylvania. October 2025. (Background reference. Previously cited as Ch.4, ref. 3; Ch.8, ref. 5; Ch.16, ref. 6.)
  4. McKinsey & Company (2025). The State of AI in 2025: Agents, Innovation, and Transformation. QuantumBlack, AI by McKinsey. November 2025. (Background reference. Previously cited as Ch.8, ref. 3; Ch.16, ref. 1.)
  5. Turpin, M., Michael, J., Perez, E., & Bowman, S.R. (2023). Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Advances in Neural Information Processing Systems (NeurIPS 2023), 36. New York University / Anthropic. (Previously cited as Ch.13, ref. 5.)
  6. U.S. Copyright Office (2025). Copyright and Artificial Intelligence, Part 2: Copyrightability. Report of the Register of Copyrights, January 2025. https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-2-Copyrightability-Report.pdf
  7. Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N.V., Wiest, O., & Zhang, X. (2024). Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. University of Notre Dame. (Previously cited as Ch.5, ref. 2; Ch.13, ref. 2.)
  8. Valmeekam, K., Olmo, A., Sreedharan, S., & Kambhampati, S. (2022). Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change). NeurIPS 2022 Workshop on Foundational Models for Decision Making. Arizona State University / Colorado State University. (Previously cited as Ch.13, ref. 3.)
  9. Xia, B., Lu, Q., Zhu, L., Xing, Z., Zhao, D., & Zhang, H. (2024). Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture. CSIRO's Data61 / Adelaide University / UNSW. arXiv:2411.13768. (Previously cited as Ch.12, ref. 7; Ch.16, ref. 4.)
  10. Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., & Murthy, A. (2024). LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks. Proceedings of the 41st International Conference on Machine Learning (ICML 2024). arXiv:2402.01817. Arizona State University. (Previously cited as Ch.13, ref. 4.)

Building agentic AI and wondering why alignment is harder than the technology? Get in touch

ADVERTISEMENT