Chapter 3 · How AI Learned to See, Hear, and Read
The multimodal revolution and what it means for Agentic AI.
The End of the Text-Only Era
For most of the history of AI, models were specialists. A vision model could recognize objects but could not explain what it saw. A language model could write eloquently about a sunset but had no concept of what one looked like. That era is now over.
The rise of Multimodal Large Language Models (MLLMs) represents one of the most consequential shifts in AI development since the Transformer architecture itself. These systems do not simply switch between modalities — they reason across them simultaneously, much the way a human professional reads a chart, listens to an explanation, and formulates a response all at once.
For Agentic AI specifically, this convergence is not a nice-to-have enhancement. It is a prerequisite. An agent operating in the real world must be able to perceive that world in all its richness — documents, screenshots, voice commands, sensor feeds, video streams — and act upon that perception intelligently.
How Multimodal Models Work
Modern MLLMs are trained from the ground up on mixed data — vast corpora containing text alongside images, transcripts alongside audio, and code alongside natural language explanations. This joint training allows the model to develop genuinely shared representations.
The technical machinery combines three components: a vision encoder that converts images or video frames into numerical embeddings; a language model backbone that handles reasoning and generation; and a cross-modal bridge that aligns visual embeddings into the same conceptual space as the language model's internal representations.
Key takeaway: Every multimodal model is built from three parts — something that sees, something that thinks, and a bridge that makes them speak the same language.
The field has undergone a significant architectural shift: early MLLMs trained vision and language jointly from scratch, while modern systems plug vision encoders into pretrained LLM backbones, leveraging reasoning capabilities already embedded in large language models. Models also differ significantly in how they bridge the two modalities — from simple linear projections and MLPs, to learnable query-based transformers (Q-Former), to dense cross-attention layers woven directly into the language backbone.4 Training now typically proceeds through multiple stages: large-scale pre-training to align modalities, instruction tuning for task versatility, and increasingly a reinforcement learning alignment stage to improve reasoning reliability and safety.3
Key takeaway: Modern multimodal models don't start from zero — they attach a vision system to a language model that already knows how to reason, then train the connection in stages.
The Modality Landscape
| Modality | Input Tasks | Output Tasks | Maturity |
|---|---|---|---|
| Text | Reading, comprehension, Q&A | Writing, summarisation, translation | ✦ Production |
| Images | Object recognition, OCR, chart analysis | Image generation, captioning | ✦ Production |
| Audio | Speech recognition, tone detection | Speech synthesis, music generation | ✦ Production |
| Code | Bug detection, code review | Code generation, refactoring | ✦ Production |
| Documents | PDF/form parsing, layout understanding | Report generation, data extraction | ✦ Production |
| Video | Action recognition, temporal reasoning | Video generation, captioning | ◐ Maturing |
| 3D / Spatial | Point cloud analysis, depth estimation | 3D model generation, navigation | ○ Early Stage |
| Sensor Data | IoT inputs, robotics perception | Control signals, anomaly alerts | ○ Research |
Leading Models Compared
Architecture Overview
| Model Family | Developer | Architecture | License | Primary Strength |
|---|---|---|---|---|
| GPT-5 / GPT-4o | OpenAI | Unified multimodal Transformer | Proprietary | General reasoning, creative tasks |
| Gemini 3 Pro / Flash | Google DeepMind | Native multimodal Transformer | Proprietary | Multimodal breadth, Google ecosystem |
| Claude 4 Opus / Sonnet | Anthropic | Hybrid reasoning Transformer | Proprietary | Safety, coding, enterprise reliability |
| Llama 4 Scout / Maverick | Meta AI | Mixture-of-Experts (MoE) | Open-weight | Open-source flexibility, long context |
| Qwen2.5-VL 72B | Alibaba | Vision-Language Transformer | Open-weight | Multilingual, visual reasoning |
| Pixtral 12B | Mistral AI | Multimodal MoE | Open-weight | Lightweight, deployable |
| MiniCPM-V 8B | ModelBest / Tsinghua | Compact multimodal Transformer | Open-weight | Edge deployment, mobile |
Modality Support
| Model | Text | Images | Audio | Video | Code | Context Window |
|---|---|---|---|---|---|---|
| GPT-5.2 | ✦ Excellent | ✦ Excellent | ✦ Native | ✦ Yes | ✦ Excellent | 400K |
| Gemini 3 Pro | ✦ Excellent | ✦ Excellent | ✦ Native | ✦ Yes | ✦ Strong | 1M |
| Claude Sonnet 4.6 | ✦ Excellent | ✦ Strong | ◐ Limited | ◐ Limited | ✦ Excellent | 200K / 1M β |
| Llama 4 Maverick | ✦ Strong | ✦ Strong | ◐ Partial | ✦ Yes | ✦ Strong | 1M |
| Llama 4 Scout | ✦ Strong | ✦ Strong | ◐ Partial | ✦ Yes | ✦ Good | 10M |
| Qwen2.5-VL 72B | ✦ Strong | ✦ Strong | ◐ Limited | ◐ Partial | ✦ Good | 128K |
| MiniCPM-V 8B | ✦ Good | ✦ Good | ✕ None | ✕ None | ◐ Limited | 32K |
✦ Fully supported · ◐ Partial or experimental · ✕ Not supported · β = beta
Benchmark Performance
| Model | MMLU | MATH 500 | GPQA Diamond | SWE-bench |
|---|---|---|---|---|
| GPT-5.2 | ~92% | ~97% | ~93% | ~55% |
| Gemini 3 Pro | ~91% | ~100% † | ~90% | ~78% |
| Gemini 3 Flash | ~88% | ~100% † | ~85% | ~78% |
| Claude Sonnet 4.6 | ~90% | ~96% | ~88% | ~72% |
| Llama 4 Maverick | ~87% | ~93% | ~80% | ~65% |
| Qwen2.5-VL 72B | ~85% | ~90% | ~75% | ~60% |
† With code execution enabled
Deployment and Cost
| Model | Deployment | Input Cost / 1M tokens | Best Suited For |
|---|---|---|---|
| GPT-5.2 | Cloud API | $1.75 | Enterprise reasoning, complex workflows |
| GPT-5.2 Pro | Cloud API | $21.00 | Mission-critical agentic tasks |
| Gemini 3 Flash | Cloud API | Low-cost tier | High-volume, cost-sensitive workloads |
| Claude Sonnet 4.6 | Cloud / API | Mid-range | Enterprise safety, coding, agents |
| Llama 4 Maverick | Self-hosted / API | Infrastructure only | Private deployment, data sovereignty |
| MiniCPM-V 8B | On-device / Edge | Negligible | Mobile, offline, privacy-critical |
How the Frontier Models Differ in Practice
GPT-5 / GPT-4o (OpenAI) takes the "omni" approach most literally — a single unified architecture ingesting text, images, audio, and code simultaneously. Hallucination rates have dropped to around 6%. GPT-5.2 introduces a 400K context window and near-perfect performance on AIME 2025 mathematics benchmarks.
Gemini 3 Pro / Flash (Google DeepMind) is the most ambitious attempt to build multimodal AI as a platform rather than a product. Native Google ecosystem integration means agents can access real-world data streams in ways API-only systems cannot. Gemini 3 Flash notably outperforms the Pro variant on SWE-bench coding tasks.
Claude 4 Opus / Sonnet (Anthropic) staked a deliberate position around trustworthiness rather than raw benchmark dominance. Its "extended thinking" mode produces more accurate outputs on complex multi-step problems. The "computer use" capability — navigating and operating a graphical user interface — marks a meaningful step toward true embodied agency.
Llama 4 Scout / Maverick (Meta AI) represents the open-source frontier's most serious challenge to proprietary dominance. Its Mixture-of-Experts architecture activates only a fraction of total parameters per inference. Scout's 10 million token context window — the largest of any model currently available — enables analysis of entire codebases in a single pass.
MiniCPM-V 8B (ModelBest / Tsinghua) is best understood as evidence for a broader trend: the model size required to reach GPT-4V-level performance has been falling rapidly, while end-device compute capacity has been rising — a convergence Hu et al. liken to Moore's Law applied to MLLMs.5 In practice, 4-bit quantisation reduces the 8B model's memory footprint from roughly 16 GB to around 5 GB, and combined with NPU acceleration it achieves decoding speeds above human reading pace on current flagship smartphones.
Key takeaway: Frontier-level multimodal AI is moving off the cloud and onto your phone — and the gap is closing faster than most people expect.
The Connection to Agentic AI
The relevance of multimodal models to Agentic AI is architectural, not incidental. Consider what a capable enterprise agent faces in a single complex task: a PDF containing scanned tables, a screenshot of a web application, a voice instruction from a user, a video of a process to replicate, and a codebase to modify. A text-only agent cannot handle most of this without extensive preprocessing pipelines.
Agents that understand the world as humans do — through multiple senses working together — will be the agents most capable of operating effectively alongside humans.
Current Limitations
Despite remarkable progress, five limitations define the honest state of the field:
- Hallucination across modalities — Models may confidently describe something in an image that is not there, or misread a table and build analysis on incorrect numbers. This extends to broader trustworthiness concerns — including bias, safety, and ethical alignment — all of which are now recognised as distinct evaluation dimensions requiring dedicated benchmarks and targeted mitigation strategies.1
- Visual reasoning depth — Supporting an image modality and genuinely understanding images are different things. Specialised visual content such as medical scans or engineering diagrams often reveals significant gaps.
- Computational cost — Processing an image alongside text requires significantly more computation than text-only inference, and video more still.
- Cross-modal consistency — Maintaining consistent context across simultaneous inputs — spoken instruction, displayed document, live system state — remains an open engineering challenge.
- Benchmark saturation — Many frontier models now score so highly on established benchmarks that the tests no longer differentiate meaningfully between them. Evaluation itself has become a specialised discipline, with over 200 dedicated benchmarks now spanning accuracy, hallucination, robustness, safety, and cross-cultural understanding — assessed through a combination of automated metrics, human annotation, and LLM-based judging.2 The field is already responding in practice: major community leaderboards, including Vellum's 2026 LLM Leaderboard, have dropped MMLU entirely as saturated and now track harder successor tests such as Humanity's Last Exam and ARC-AGI 2, which target multi-disciplinary breadth and novel pattern recognition rather than knowledge recall.7
References
- Huang, J., & Zhang, J. (2024). A Survey on Evaluation of Multimodal Large Language Models. Nanyang Technological University.
- Li, J. et al. (2024). A Survey on Benchmarks of Multimodal Large Language Models. Tencent / PKU / NUS.
- Li, Z. et al. (2025). A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges. University of Maryland.
- Caffagni, D. et al. (2024). The Revolution of Multimodal Large Language Models: A Survey. University of Modena and Reggio Emilia.
- Hu, Y. et al. (2025). Efficient GPT-4V Level Multimodal LLM for Edge Devices (MiniCPM-V). ModelBest / Tsinghua University.
- Coogan, B. et al. (2025). Benchmarking MLLMs for Forensic Science and Medicine. University College London.
- Vellum AI (2026). LLM Leaderboard 2026. Vellum AI.
Building agentic AI and wondering why alignment is harder than the technology? Get in touch