Chapter 3 · How AI Learned to See, Hear, and Read

The multimodal revolution and what it means for Agentic AI.

The End of the Text-Only Era

For most of the history of AI, models were specialists. A vision model could recognize objects but could not explain what it saw. A language model could write eloquently about a sunset but had no concept of what one looked like. That era is now over.

The rise of Multimodal Large Language Models (MLLMs) represents one of the most consequential shifts in AI development since the Transformer architecture itself. These systems do not simply switch between modalities — they reason across them simultaneously, much the way a human professional reads a chart, listens to an explanation, and formulates a response all at once.

For Agentic AI specifically, this convergence is not a nice-to-have enhancement. It is a prerequisite. Chapter 2 argued that an agent must be economically viable; this chapter adds the next condition: it must be able to perceive the working environment in which the task exists. An agent operating in the real world must be able to perceive that world in all its richness — documents, screenshots, voice commands, sensor feeds, video streams — and act upon that perception intelligently.

How Multimodal Models Work

Modern MLLMs are trained from the ground up on mixed data — vast corpora containing text alongside images, transcripts alongside audio, and code alongside natural language explanations. This joint training allows the model to develop genuinely shared representations.

The technical machinery combines three components: a vision encoder that converts images or video frames into numerical embeddings; a language model backbone that handles reasoning and generation; and a cross-modal bridge that aligns visual embeddings into the same conceptual space as the language model's internal representations.

Key takeaway: Every multimodal model is built from three parts — something that sees, something that thinks, and a bridge that makes them speak the same language.

The field has undergone a significant architectural shift: early MLLMs trained vision and language jointly from scratch, while modern systems plug vision encoders into pretrained LLM backbones, leveraging reasoning capabilities already embedded in large language models. Models also differ significantly in how they bridge the two modalities — from simple linear projections and MLPs, to learnable query-based transformers (Q-Former), to dense cross-attention layers woven directly into the language backbone.⁴ Training now typically proceeds through multiple stages: large-scale pre-training to align modalities, instruction tuning for task versatility, and increasingly a reinforcement learning alignment stage to improve reasoning reliability and safety.³

Key takeaway: Modern multimodal models don't start from zero — they attach a vision system to a language model that already knows how to reason, then train the connection in stages.

Multimodal Input vs. Multimodal Agency

A model that can read images is not automatically an agent. Multimodal input means the system can interpret different kinds of data. Multimodal agency begins only when that perception is connected to goals, tools, memory, state, feedback, and action. A model may describe a screenshot; an agent can inspect the screenshot, infer the user's intent, open the relevant application, change a setting, verify the result, and report back.

This distinction matters because many enterprise systems are marketed as agentic when they are really multimodal assistants wrapped in a fixed workflow. The agentic threshold is crossed when perception changes the next action the system chooses to take.

The Modality Landscape

Modality	Input Tasks	Output Tasks	Maturity
Text	Reading, comprehension, Q&A	Writing, summarisation, translation	✦ Production
Images	Object recognition, OCR, chart analysis	Image generation, captioning	✦ Production
Audio	Speech recognition, tone detection	Speech synthesis, music generation	✦ Production
Code	Bug detection, code review	Code generation, refactoring	✦ Production
Documents	PDF/form parsing, layout understanding	Report generation, data extraction	✦ Production
Video	Action recognition, temporal reasoning	Video generation, captioning	◐ Maturing
3D / Spatial	Point cloud analysis, depth estimation	3D model generation, navigation	○ Early Stage
Sensor Data	IoT inputs, robotics perception	Control signals, anomaly alerts	○ Research

Enterprise Document Intelligence

For enterprise readers, the most important multimodal frontier is often not dramatic video generation or robotics. It is document intelligence: the ability to interpret PDFs, scanned forms, contracts, invoices, slide decks, dashboards, handwritten notes, charts, tables, and screenshots. These are the formats in which organisational knowledge actually lives.

A finance agent may need to read a spreadsheet screenshot embedded in a presentation, reconcile it against a PDF invoice, and then update a forecasting model. A legal agent may need to compare contract clauses across scanned documents. An HR agent may need to interpret policy PDFs, forms, and employee records without losing layout-dependent meaning. In all of these cases, text extraction alone is insufficient. The agent must understand structure, spatial layout, tables, visual emphasis, and the relationship between labels and values.

This is also where risk concentrates. A model that misreads a chart axis, invents a missing table value, or confuses two similar document sections can build a confident downstream action on a false perception. Multimodal capability therefore increases both opportunity and responsibility.

Leading Models Compared

Snapshot as of May 2026. Frontier model capabilities, prices, context windows, benchmark standings, and product names change rapidly. These tables are useful for strategic orientation, but operational decisions should always be checked against current official provider documentation before deployment.⁸⁹¹⁰¹¹

Architecture Overview

Model Family	Developer	Architecture	License	Primary Strength
GPT-5 / GPT-4o	OpenAI	Unified multimodal Transformer	Proprietary	General reasoning, creative tasks
Gemini 3 Pro / Flash	Google DeepMind	Native multimodal Transformer	Proprietary	Multimodal breadth, Google ecosystem
Claude 4 Opus / Sonnet	Anthropic	Hybrid reasoning Transformer	Proprietary	Safety, coding, enterprise reliability
Llama 4 Scout / Maverick	Meta AI	Mixture-of-Experts (MoE)	Open-weight	Open-source flexibility, long context
Qwen2.5-VL 72B	Alibaba	Vision-Language Transformer	Open-weight	Multilingual, visual reasoning
Pixtral 12B	Mistral AI	Multimodal MoE	Open-weight	Lightweight, deployable
MiniCPM-V 8B	ModelBest / Tsinghua	Compact multimodal Transformer	Open-weight	Edge deployment, mobile

Modality Support

Model	Text	Images	Audio	Video	Code	Context Window
GPT-5.2	✦ Excellent	✦ Excellent	✦ Native	✦ Yes	✦ Excellent	400K
Gemini 3 Pro	✦ Excellent	✦ Excellent	✦ Native	✦ Yes	✦ Strong	1M
Claude Sonnet 4.6	✦ Excellent	✦ Strong	◐ Limited	◐ Limited	✦ Excellent	200K / 1M β
Llama 4 Maverick	✦ Strong	✦ Strong	◐ Partial	✦ Yes	✦ Strong	1M
Llama 4 Scout	✦ Strong	✦ Strong	◐ Partial	✦ Yes	✦ Good	10M
Qwen2.5-VL 72B	✦ Strong	✦ Strong	◐ Limited	◐ Partial	✦ Good	128K
MiniCPM-V 8B	✦ Good	✦ Good	✕ None	✕ None	◐ Limited	32K

✦ Fully supported · ◐ Partial or experimental · ✕ Not supported · β = beta

Benchmark Performance

Model	MMLU	MATH 500	GPQA Diamond	SWE-bench
GPT-5.2	~92%	~97%	~93%	~55%
Gemini 3 Pro	~91%	~100% †	~90%	~78%
Gemini 3 Flash	~88%	~100% †	~85%	~78%
Claude Sonnet 4.6	~90%	~96%	~88%	~72%
Llama 4 Maverick	~87%	~93%	~80%	~65%
Qwen2.5-VL 72B	~85%	~90%	~75%	~60%

† With code execution enabled

Deployment and Cost

Model	Deployment	Input Cost / 1M tokens	Best Suited For
GPT-5.2	Cloud API	$1.75	Enterprise reasoning, complex workflows
GPT-5.2 Pro	Cloud API	$21.00	Mission-critical agentic tasks
Gemini 3 Flash	Cloud API	Low-cost tier	High-volume, cost-sensitive workloads
Claude Sonnet 4.6	Cloud / API	Mid-range	Enterprise safety, coding, agents
Llama 4 Maverick	Self-hosted / API	Infrastructure only	Private deployment, data sovereignty
MiniCPM-V 8B	On-device / Edge	Negligible	Mobile, offline, privacy-critical

How the Frontier Models Differ in Practice

GPT-5 / GPT-4o (OpenAI) takes the "omni" approach most literally — a single unified architecture ingesting text, images, audio, and code simultaneously. Hallucination rates have dropped to around 6%. GPT-5.2 introduces a 400K context window and near-perfect performance on AIME 2025 mathematics benchmarks.

Gemini 3 Pro / Flash (Google DeepMind) is the most ambitious attempt to build multimodal AI as a platform rather than a product. Native Google ecosystem integration means agents can access real-world data streams in ways API-only systems cannot. Gemini 3 Flash notably outperforms the Pro variant on SWE-bench coding tasks.

Claude 4 Opus / Sonnet (Anthropic) staked a deliberate position around trustworthiness rather than raw benchmark dominance. Its "extended thinking" mode produces more accurate outputs on complex multi-step problems. The "computer use" capability — navigating and operating a graphical user interface — marks a meaningful step toward true embodied agency.

Llama 4 Scout / Maverick (Meta AI) represents the open-source frontier's most serious challenge to proprietary dominance. Its Mixture-of-Experts architecture activates only a fraction of total parameters per inference. Scout's 10 million token context window — the largest of any model currently available — enables analysis of entire codebases in a single pass.

MiniCPM-V 8B (ModelBest / Tsinghua) is best understood as evidence for a broader trend: the model size required to reach GPT-4V-level performance has been falling rapidly, while end-device compute capacity has been rising — a convergence Hu et al. liken to Moore's Law applied to MLLMs.⁵ In practice, 4-bit quantisation reduces the 8B model's memory footprint from roughly 16 GB to around 5 GB, and combined with NPU acceleration it achieves decoding speeds above human reading pace on current flagship smartphones.

Key takeaway: Frontier-level multimodal AI is moving off the cloud and onto your phone — and the gap is closing faster than most people expect.

The Connection to Agentic AI

The relevance of multimodal models to Agentic AI is architectural, not incidental. Chapter 2 showed that agentic systems must be economically routed; multimodality determines what those routed systems can perceive. Consider what a capable enterprise agent faces in a single complex task: a PDF containing scanned tables, a screenshot of a web application, a voice instruction from a user, a video of a process to replicate, and a codebase to modify. A text-only agent cannot handle most of this without extensive preprocessing pipelines.

Agents that understand the world as humans do — through multiple senses working together — will be the agents most capable of operating effectively alongside humans. Once these capabilities become reliable enough for everyday work, the enterprise question changes from can the model understand this? to where does this create measurable organisational value?

Current Limitations

Despite remarkable progress, five limitations define the honest state of the field:

Hallucination across modalities — Models may confidently describe something in an image that is not there, or misread a table and build analysis on incorrect numbers. This extends to broader trustworthiness concerns — including bias, safety, and ethical alignment — all of which are now recognised as distinct evaluation dimensions requiring dedicated benchmarks and targeted mitigation strategies.¹
Visual reasoning depth — Supporting an image modality and genuinely understanding images are different things. Specialised visual content such as medical scans or engineering diagrams often reveals significant gaps.
Computational cost — Processing an image alongside text requires significantly more computation than text-only inference, and video more still.
Cross-modal consistency — Maintaining consistent context across simultaneous inputs — spoken instruction, displayed document, live system state — remains an open engineering challenge.
Benchmark saturation and production mismatch — Many frontier models now score so highly on established benchmarks that the tests no longer differentiate meaningfully between them. Evaluation itself has become a specialised discipline, with over 200 dedicated benchmarks now spanning accuracy, hallucination, robustness, safety, and cross-cultural understanding — assessed through a combination of automated metrics, human annotation, and LLM-based judging.² Even then, benchmark performance should be treated as an imperfect signal: a model that performs well on curated multimodal tasks may still fail in noisy production settings where documents are incomplete, screenshots are low resolution, tables are rotated, APIs change, and user instructions are ambiguous. The field is already responding in practice: major community leaderboards, including Vellum's 2026 LLM Leaderboard, have dropped MMLU entirely as saturated and now track harder successor tests such as Humanity's Last Exam and ARC-AGI 2, which target multi-disciplinary breadth and novel pattern recognition rather than knowledge recall.⁷

References

Huang, J., & Zhang, J. (2024). A Survey on Evaluation of Multimodal Large Language Models. Nanyang Technological University.
Li, J. et al. (2024). A Survey on Benchmarks of Multimodal Large Language Models. Tencent / PKU / NUS.
Li, Z. et al. (2025). A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges. University of Maryland.
Caffagni, D. et al. (2024). The Revolution of Multimodal Large Language Models: A Survey. University of Modena and Reggio Emilia.
Hu, Y. et al. (2025). Efficient GPT-4V Level Multimodal LLM for Edge Devices (MiniCPM-V). ModelBest / Tsinghua University.
Coogan, B. et al. (2025). Benchmarking MLLMs for Forensic Science and Medicine. University College London.
Vellum AI (2026). LLM Leaderboard 2026. Vellum AI.
OpenAI (2026). API Pricing and Model Documentation. OpenAI.
Google Cloud (2026). Gemini 3 Pro Model Documentation. Google Cloud.
Anthropic (2026). Introducing Claude Sonnet 4.6. Anthropic.
Meta AI (2026). Llama 4 Scout Model Card. Meta AI / Hugging Face.

Building agentic AI and wondering why alignment is harder than the technology? Get in touch

The End of the Text-Only Era​

How Multimodal Models Work​

Multimodal Input vs. Multimodal Agency​

The Modality Landscape​

Enterprise Document Intelligence​

Leading Models Compared​

Architecture Overview​

Modality Support​

Benchmark Performance​

Deployment and Cost​

How the Frontier Models Differ in Practice​

The Connection to Agentic AI​

Current Limitations​

References​