Skip to main content
ADVERTISEMENT

Chapter 3 · How AI Learned to See, Hear, and Read

The multimodal revolution and what it means for Agentic AI.


The End of the Text-Only Era

For most of the history of AI, models were specialists. A vision model could recognize objects but could not explain what it saw. A language model could write eloquently about a sunset but had no concept of what one looked like. That era is now over.

The rise of Multimodal Large Language Models (MLLMs) represents one of the most consequential shifts in AI development since the Transformer architecture itself. These systems do not simply switch between modalities — they reason across them simultaneously, much the way a human professional reads a chart, listens to an explanation, and formulates a response all at once.

For Agentic AI specifically, this convergence is not a nice-to-have enhancement. It is a prerequisite. An agent operating in the real world must be able to perceive that world in all its richness — documents, screenshots, voice commands, sensor feeds, video streams — and act upon that perception intelligently.


How Multimodal Models Work

Modern MLLMs are trained from the ground up on mixed data — vast corpora containing text alongside images, transcripts alongside audio, and code alongside natural language explanations. This joint training allows the model to develop genuinely shared representations.

The technical machinery combines three components: a vision encoder that converts images or video frames into numerical embeddings; a language model backbone that handles reasoning and generation; and a cross-modal bridge that aligns visual embeddings into the same conceptual space as the language model's internal representations.

Key takeaway: Every multimodal model is built from three parts — something that sees, something that thinks, and a bridge that makes them speak the same language.

The field has undergone a significant architectural shift: early MLLMs trained vision and language jointly from scratch, while modern systems plug vision encoders into pretrained LLM backbones, leveraging reasoning capabilities already embedded in large language models. Models also differ significantly in how they bridge the two modalities — from simple linear projections and MLPs, to learnable query-based transformers (Q-Former), to dense cross-attention layers woven directly into the language backbone.4 Training now typically proceeds through multiple stages: large-scale pre-training to align modalities, instruction tuning for task versatility, and increasingly a reinforcement learning alignment stage to improve reasoning reliability and safety.3

Key takeaway: Modern multimodal models don't start from zero — they attach a vision system to a language model that already knows how to reason, then train the connection in stages.


The Modality Landscape

ModalityInput TasksOutput TasksMaturity
TextReading, comprehension, Q&AWriting, summarisation, translation✦ Production
ImagesObject recognition, OCR, chart analysisImage generation, captioning✦ Production
AudioSpeech recognition, tone detectionSpeech synthesis, music generation✦ Production
CodeBug detection, code reviewCode generation, refactoring✦ Production
DocumentsPDF/form parsing, layout understandingReport generation, data extraction✦ Production
VideoAction recognition, temporal reasoningVideo generation, captioning◐ Maturing
3D / SpatialPoint cloud analysis, depth estimation3D model generation, navigation○ Early Stage
Sensor DataIoT inputs, robotics perceptionControl signals, anomaly alerts○ Research

Leading Models Compared

Architecture Overview

Model FamilyDeveloperArchitectureLicensePrimary Strength
GPT-5 / GPT-4oOpenAIUnified multimodal TransformerProprietaryGeneral reasoning, creative tasks
Gemini 3 Pro / FlashGoogle DeepMindNative multimodal TransformerProprietaryMultimodal breadth, Google ecosystem
Claude 4 Opus / SonnetAnthropicHybrid reasoning TransformerProprietarySafety, coding, enterprise reliability
Llama 4 Scout / MaverickMeta AIMixture-of-Experts (MoE)Open-weightOpen-source flexibility, long context
Qwen2.5-VL 72BAlibabaVision-Language TransformerOpen-weightMultilingual, visual reasoning
Pixtral 12BMistral AIMultimodal MoEOpen-weightLightweight, deployable
MiniCPM-V 8BModelBest / TsinghuaCompact multimodal TransformerOpen-weightEdge deployment, mobile

Modality Support

ModelTextImagesAudioVideoCodeContext Window
GPT-5.2✦ Excellent✦ Excellent✦ Native✦ Yes✦ Excellent400K
Gemini 3 Pro✦ Excellent✦ Excellent✦ Native✦ Yes✦ Strong1M
Claude Sonnet 4.6✦ Excellent✦ Strong◐ Limited◐ Limited✦ Excellent200K / 1M β
Llama 4 Maverick✦ Strong✦ Strong◐ Partial✦ Yes✦ Strong1M
Llama 4 Scout✦ Strong✦ Strong◐ Partial✦ Yes✦ Good10M
Qwen2.5-VL 72B✦ Strong✦ Strong◐ Limited◐ Partial✦ Good128K
MiniCPM-V 8B✦ Good✦ Good✕ None✕ None◐ Limited32K

✦ Fully supported · ◐ Partial or experimental · ✕ Not supported · β = beta

Benchmark Performance

ModelMMLUMATH 500GPQA DiamondSWE-bench
GPT-5.2~92%~97%~93%~55%
Gemini 3 Pro~91%~100% †~90%~78%
Gemini 3 Flash~88%~100% †~85%~78%
Claude Sonnet 4.6~90%~96%~88%~72%
Llama 4 Maverick~87%~93%~80%~65%
Qwen2.5-VL 72B~85%~90%~75%~60%

† With code execution enabled

Deployment and Cost

ModelDeploymentInput Cost / 1M tokensBest Suited For
GPT-5.2Cloud API$1.75Enterprise reasoning, complex workflows
GPT-5.2 ProCloud API$21.00Mission-critical agentic tasks
Gemini 3 FlashCloud APILow-cost tierHigh-volume, cost-sensitive workloads
Claude Sonnet 4.6Cloud / APIMid-rangeEnterprise safety, coding, agents
Llama 4 MaverickSelf-hosted / APIInfrastructure onlyPrivate deployment, data sovereignty
MiniCPM-V 8BOn-device / EdgeNegligibleMobile, offline, privacy-critical

How the Frontier Models Differ in Practice

GPT-5 / GPT-4o (OpenAI) takes the "omni" approach most literally — a single unified architecture ingesting text, images, audio, and code simultaneously. Hallucination rates have dropped to around 6%. GPT-5.2 introduces a 400K context window and near-perfect performance on AIME 2025 mathematics benchmarks.

Gemini 3 Pro / Flash (Google DeepMind) is the most ambitious attempt to build multimodal AI as a platform rather than a product. Native Google ecosystem integration means agents can access real-world data streams in ways API-only systems cannot. Gemini 3 Flash notably outperforms the Pro variant on SWE-bench coding tasks.

Claude 4 Opus / Sonnet (Anthropic) staked a deliberate position around trustworthiness rather than raw benchmark dominance. Its "extended thinking" mode produces more accurate outputs on complex multi-step problems. The "computer use" capability — navigating and operating a graphical user interface — marks a meaningful step toward true embodied agency.

Llama 4 Scout / Maverick (Meta AI) represents the open-source frontier's most serious challenge to proprietary dominance. Its Mixture-of-Experts architecture activates only a fraction of total parameters per inference. Scout's 10 million token context window — the largest of any model currently available — enables analysis of entire codebases in a single pass.

MiniCPM-V 8B (ModelBest / Tsinghua) is best understood as evidence for a broader trend: the model size required to reach GPT-4V-level performance has been falling rapidly, while end-device compute capacity has been rising — a convergence Hu et al. liken to Moore's Law applied to MLLMs.5 In practice, 4-bit quantisation reduces the 8B model's memory footprint from roughly 16 GB to around 5 GB, and combined with NPU acceleration it achieves decoding speeds above human reading pace on current flagship smartphones.

Key takeaway: Frontier-level multimodal AI is moving off the cloud and onto your phone — and the gap is closing faster than most people expect.


The Connection to Agentic AI

The relevance of multimodal models to Agentic AI is architectural, not incidental. Consider what a capable enterprise agent faces in a single complex task: a PDF containing scanned tables, a screenshot of a web application, a voice instruction from a user, a video of a process to replicate, and a codebase to modify. A text-only agent cannot handle most of this without extensive preprocessing pipelines.

Agents that understand the world as humans do — through multiple senses working together — will be the agents most capable of operating effectively alongside humans.


Current Limitations

Despite remarkable progress, five limitations define the honest state of the field:

  1. Hallucination across modalities — Models may confidently describe something in an image that is not there, or misread a table and build analysis on incorrect numbers. This extends to broader trustworthiness concerns — including bias, safety, and ethical alignment — all of which are now recognised as distinct evaluation dimensions requiring dedicated benchmarks and targeted mitigation strategies.1
  2. Visual reasoning depth — Supporting an image modality and genuinely understanding images are different things. Specialised visual content such as medical scans or engineering diagrams often reveals significant gaps.
  3. Computational cost — Processing an image alongside text requires significantly more computation than text-only inference, and video more still.
  4. Cross-modal consistency — Maintaining consistent context across simultaneous inputs — spoken instruction, displayed document, live system state — remains an open engineering challenge.
  5. Benchmark saturation — Many frontier models now score so highly on established benchmarks that the tests no longer differentiate meaningfully between them. Evaluation itself has become a specialised discipline, with over 200 dedicated benchmarks now spanning accuracy, hallucination, robustness, safety, and cross-cultural understanding — assessed through a combination of automated metrics, human annotation, and LLM-based judging.2 The field is already responding in practice: major community leaderboards, including Vellum's 2026 LLM Leaderboard, have dropped MMLU entirely as saturated and now track harder successor tests such as Humanity's Last Exam and ARC-AGI 2, which target multi-disciplinary breadth and novel pattern recognition rather than knowledge recall.7

References

  1. Huang, J., & Zhang, J. (2024). A Survey on Evaluation of Multimodal Large Language Models. Nanyang Technological University.
  2. Li, J. et al. (2024). A Survey on Benchmarks of Multimodal Large Language Models. Tencent / PKU / NUS.
  3. Li, Z. et al. (2025). A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges. University of Maryland.
  4. Caffagni, D. et al. (2024). The Revolution of Multimodal Large Language Models: A Survey. University of Modena and Reggio Emilia.
  5. Hu, Y. et al. (2025). Efficient GPT-4V Level Multimodal LLM for Edge Devices (MiniCPM-V). ModelBest / Tsinghua University.
  6. Coogan, B. et al. (2025). Benchmarking MLLMs for Forensic Science and Medicine. University College London.
  7. Vellum AI (2026). LLM Leaderboard 2026. Vellum AI.

Building agentic AI and wondering why alignment is harder than the technology? Get in touch

ADVERTISEMENT