Multi-modal LLMs
This session examines the evolution and architecture of Multimodal Large Language Models (MLLMs), which expand AI capabilities beyond text to include images, audio, video, and structured documents. The text details a three-stage pipeline consisting of a vision encoder, a projection layer for alignment, and an LLM backbone for joint reasoning. It further explores multimodal retrieval-augmented generation (RAG), comparing traditional data conversion methods with newer, native embedding techniques like ColPali that preserve visual fidelity.
Specialized fusion strategies and visual grounding mechanisms are discussed as the technical foundation for how these models “see” and interpret spatial relationships. Finally, the source outlines agentic AI patterns, demonstrating how multimodal perception enables autonomous agents to perform complex tasks like GUI automation and visual evidence verification.

Listen
| Speaker | Text |
|---|---|
| Alex | This is the brief on multi-modal large language models. Today, we’re reviewing 2026 lecture notes on how AI is evolving past just reading text, allowing autonomous agents to actually see, hear, and perceive the world. First, let’s look at the massive shift to multimodal AI. You know, models that only read are incredibly smart, but they’re fundamentally blind. I mean, if you feed them a chart by just extracting the text, the layout and visual meaning are totally destroyed. A text-only LLM is literally like a genius locked in a dark silent room. Giving it multi-modal capabilities is like finally turning on the lights and opening a window. Second, how do we actually build this? Well, it takes a 3-stage pipeline. A vision encoder chops an image into a grid of tiny patches. Next, a projection layer translates those visual patches right into the LLM’s text-based embedding space. Then the LLM backbone processes both together. Now, wait, doesn’t chopping an image into tiny patches destroy the big picture? You’d think so, but the model uses self-attention to globally map how those pieces relate to each other. Finally, this absolutely changes data retrieval with multimodal RAG. Instead of using clunky OCR to pull text from a PDF which totally ruins your charts, new models like Kpali embed document pages natively as images. You just search the visual representation directly. Now, isn’t standard tech search just easier? Sure, until it completely fails because your most important data is locked inside an SEC revenue bar chart. Giving AI the ability to see and hear doesn’t just add cool features, it transforms language models from simple text calculators into genuine perceptual reasoners. |
| Speaker | Text |
|---|---|
| Alex | So, um, Imagine you are trying to describe a really chaotic, messy financial scatter plot to someone over the phone using nothing but words. Oh, that sounds like a total nightmare, right? It’s impossible. You have to explain every single data point, the trajectory of the trend line, the weird clustering of outliers. Uh, the spacing of the access labels within 30 seconds, the person on the other end is just completely lost. |
| Sam | Yeah, they lose all the spatial context, the groupings, the trends, it’s all gone. |
| Alex | Exactly. It is an exercise in futility. But if you think about it, that is exactly how our most advanced AI agents experience the world today, because, you know, as someone out there building retrieval augmented generation pipelines and agentic systems. You already know these text-based models are brilliant. They |
| Sam | are, but fundamentally they are trapped in a sensory deprivation tank. Yeah, |
| Alex | that’s exactly the phrase for |
| Sam | it. It really is a severe cognitive bottleneck. I mean, we have engineered these massive, incredibly capable reasoning engines, but we force them to understand the world entirely through the highly compressed, lossy medium of tokenized text, |
| Alex | which is not how we operate at all. |
| Sam | No, human cognition doesn’t work that way. We integrate visual, auditory, and linguistic information continuously to make sense of our environment. When you force a model to rely purely on text, you are systematically blinding it to the reality of how information is actually structured, |
| Alex | which brings us to our mission for today’s deep dive. We are jumping into a fascinating set of spring 2026 lecture notes from a graduate course called, uh, Building Generative AI Business Solutions. |
| Sam | Really great material on this one. Yeah, |
| Alex | it’s dense, but so good. The focus is on a massive paradigm shift. We’re looking at how multimodal large language models, MLLMs are transforming AI from blind text processors into genuine perceptual reasoners, |
| Sam | and this is huge for anyone building autonomous systems. |
| Alex | Exactly. We’re gonna explore the underlying math of how they actually see, how they fuse modalities, and most importantly for you listening, how to practically wire these concepts into the agent. Loops and retrieval systems you’re building right now. |
| Sam | To really appreciate the gravity of this shift, you just have to consider a standard text extraction pipeline processing a dense enterprise |
| Alex | document, right, like a financial report or something. |
| Sam | Yeah. When you convert a complex formatted page into a flat text string, you are actively destroying semantic meaning. The column alignments in a financial table, the grouping of rows, the relative position of a warning label next to a schematic. That spatial layout isn’t just decoration. |
| Alex | No, it is the actual data. Exactly. |
| Sam | The layout is |
| Alex | the data. So, um, if text models only understand words, how on earth do you feed a massive high resolution image into a text prompt without completely breaking the system’s architecture? Well, |
| Sam | it all comes down to what the lecture notes describe as the token efficiency asymmetry. This is basically The foundational mathematical hurdle of multimodal AI. |
| Alex | OK, token efficiency asymmetry. Break that down |
| Sam | for us. Sure. So text is incredibly dense and highly abstract. A single text token can represent a highly precise concept, right? Like a specific noun, a complex logical operator, a distinct mathematical function. |
| Alex | It packs a lot of punch into a tiny package. |
| Sam | Exactly. But images are the exact opposite. They’re continuous. Continuous two dimensional signals made of massive pixel arrays. They are incredibly dense and raw data, but very sparse in immediate semantic meaning. |
| Alex | So to a computer, an image is just a giant grid of numbers representing colors. And somehow we have to translate that raw grid into the highly refined conceptual language of a large language model. |
| Sam | Precisely. And to pull that off, modern vision language models rely on a strict 3-stage architectural pipeline. Stage 1 is the vision encoder. Typically this is a vision transformer that takes the raw image and mathematically chunks it into a grid of non-overlapping patches. |
| Alex | OK, let’s unpack this mechanism because it’s pretty wild. I’m picturing the vision encoder acting almost like a meticulous, highly analytical translator. That’s a good way to look at it. It takes a beautiful, complex painting, takes a pair of scissors. And cuts it up into a rigid grid of say 14 by 14 pixel squares. Yeah, |
| Sam | that’s the standard patch size. |
| Alex | And then it examines each individual square and describes its visual features, you know, the edges, the colors, the textures in a mathematical vector that the LLM natively understands. |
| Sam | That is the exact mechanism. But the mathematical reality of cutting up that painting introduces a massive computational nightmare. Uh, |
| Alex | why is that? |
| Sam | Well, the formula for the number of patches generated is the image height divided by the patch size, multiplied by the width divided by the patch size. OK, |
| Alex | so just basic area math, |
| Sam | right? So if you hand the model a relatively standard high resolution image, say 1,344 by 1,344 pixels, and you are using that standard 14 pixel patch, you suddenly generate over 9000 patch tokens for a single image. |
| Alex | Wait, hold on, 9000. Tokens just to look at one single image that eats up a context window incredibly fast. I mean, you’re burning through tokens before you even ask a question |
| Sam | and it gets worse because the LLM processes sequence tokens using self-attention. The computational cost scales quadratically. It is an O of N 2 bottleneck, right, |
| Alex | because every token has to look at every other token. Exactly. |
| Sam | So if you flood the context window with 9000 visual tokens. The compute cost doesn’t just increase, it explodes exponentially. The model would just grind to a halt. |
| Alex | So there has to be a filter, right? You can’t just dump 9000 tokens into the prompt and expect it to run efficiently or cheaply, |
| Sam | which brings us to the crucial second stage, the projection layer. This is arguably the most active area of multimodal architectural research right now. |
| Alex | OK, so how does the projection layer fix the bottleneck? Well, |
| Sam | early on, models used really simple approaches like linear MLPs, multi-layer perceptrons. These were computationally cheap to train, but they essentially acted as a direct 1 to 1 mapping tool. Oh, I see where this is going. Yeah, they took every single visual patch and mapped it directly to an LLM token, meaning you kept all nine. 1000 tokens, |
| Alex | which totally fails to solve our exploding context window problem. |
| Sam | It failed completely for high resolution reasoning. That failure drove the industry toward advanced projection strategies like cuformers or querying transformers. Queformers, OK. Instead of a direct map, a queue former introduces a fixed set of learnable query tokens. Think of it as a strict token budget. Maybe you only allow 32 or 64 query tokens total. Oh, |
| Alex | that’s a massive reduction. |
| Sam | It is. These query tokens use cross attention to dynamically scan those 9000 raw visual patches and extract only the most salient task relevant information. |
| Alex | I love that. It’s like, um, instead of sending all 9000 puzzle pieces to the CEO’s desk, you send a team of 64 incredibly smart interns to look at the puzzle and come back with a highly compressed executive summary of what the picture actually shows. |
| Sam | That is a perfect analogy, and that compressed summary, those 64 rich. Tokens is what finally gets passed a stage 3, which is the LLM backbone |
| Alex | itself. So the compression happens before the visual data ever hits the main reasoning engine. |
| Sam | Exactly. It saves you from that quadratic computation nightmare. |
| Alex | OK, so the interns bring back the summary, and now we have these translated visual tokens sitting next to our text tokens. But how do they actually interact? Because just putting a picture and a sentence in the same room doesn’t mean they understand each other, |
| Sam | right? That interaction is governed by token fusion. The architecture dictates when and how deeply the text and visual vectors mix. |
| Alex | Are there different ways to do that? |
| Sam | Yeah, definitely. The most aggressive approach is early fusion, where you concatenate all the visual and text tokens into a single unified sequence before the very first layer of the neural network. |
| Alex | Wow, that sounds like it would give you incredibly deep, rich spatial reasoning. Because every single layer of the model gets to evaluate the text and the image simultaneously. |
| Sam | It does, but it comes with a massive cost. The computational overhead is astronomical because you are running that complex attention mechanism across the combined sequence for the entire depth of the model, right? |
| Alex | The O of N2d problem comes back to bite you. |
| Sam | Exactly. Now, on the completely opposite end of the spectrum, you have late fusion. Here, the image and text are processed in entirely separate neural paths. Pathways and their representations are only mathematically combined at the very final output layers, |
| Alex | which is like two people working on a massive group project in separate rooms, entirely isolated from one another and only comparing their notes five minutes before the actual presentation. Yeah, |
| Sam | that’s exactly it. It’s highly efficient, but you get incredibly weak cross-modal reasoning, |
| Alex | right, because they couldn’t collaborate on the complex parts. |
| Sam | It’s terrible for complex tasks. To solve this, researchers developed cross attention fusion, often referred to as the flamingo style architecture, like the bird, yeah, like the bird. This uses dedicated attention heads spaced at regular intervals specifically for injecting visual data into the tech stream. It’s highly scalable and particularly great for ingesting video frames. But the gold standard today, what you see in the highest performing production models is hybrid fusion, lending the best of both worlds. Exactly. Hybrid architectures use early fusion in the foundational lower layers to establish deep connections for critical tokens, and then they switch to sparse cross attention in the higher layers that |
| Alex | keep the computational cost manageable, |
| Sam | right, while still preserving deep reasoning capabilities. |
| Alex | So once those tokens are fused, How does the model actually leverage them to formulate an answer? Because, say I ask the model, what’s wrong with this car engine? It needs to bridge my textual question with the specific pixels showing a cracked belt. |
| Sam | The notes detail a mechanism for this called grounding. Grounding is basically how visual tokens exert influence over the text generation process. Visual tokens essentially act as persistent contextual anchors. There are 3 types. Spatial grounding is when the model attends heavily to specific patch tokens to isolate locations, |
| Alex | like drawing a bounding box around that cracked engine belt. |
| Sam | Exactly. Temporal grounding applies that same logic across time for video, aligning textual concepts to specific frames, but semantic grounding is where the underlying math gets really interesting. Let’s |
| Alex | drill into semantic grounding. How does that anchor actually alter the model’s output? |
| Sam | Think of it as a distributional shift. The sheer presence of visual tokens in the context window actively shifts the LLM’s vocabulary probability distribution. So if the visual tokens represent a dense forest scene, They act as a gravitational pull. They increase the probability of mass over tokens related to trees, leaves, shadows, and dirt while actively suppressing the probability of tokens related to, say, oceans or skyscrapers. |
| Alex | It forces the language generation engine to remain semantically tethered to the visual reality of the image. |
| Sam | Exactly. It keeps the model grounded in what it’s actually seeing. |
| Alex | That is fascinating, and it brings to mind a very practical, everyday prompt engineering problem for our listeners. Yeah, since the model is processing these token sequences sequentially. And the visual tokens influence the text probabilities. Does the order matter? Like, if I’m prompting a multimodal model, should the image go first or the text? |
| Sam | The empirical data from the course notes provides a definitive answer to that. Placing your text question before the image in the prompt yields a consistent 5 to 10% accuracy boost on complex analytical tasks. |
| Alex | Wait, up 10% just from swapping the order of the inputs. Just the order. Let me think through the mechanics of why that happens. It has to be attentional priming, right? If you put the text front first. You’re forcing the model to establish the relevant semantic concepts, telling it exactly what it needs to look for before it ever encounters the visual data. Exactly. But if you put the image first, the model has to process this massive raw grid of pixels blindly, trying to encode everything at once, and then gets hit with a surprise question at the end. |
| Sam | You’ve nailed the underlying mechanism. When the question tokens run through the attention heads first. They prime the network. That makes so much sense. So when those visual patches finally enter the attention window, the cue former and the cross attention layers already know to ignore the background noise and hyperfocus their computational resources on the specific features you asked about. |
| Alex | This changes so much of how we need to build, and it perfectly sets up the next major hurdle rethinking retrieval. Here’s where it gets really interesting. Let’s talk about multimodal RAG because standard text-based retrieval augmented generation. is failing hard in the enterprise space right now. |
| Sam | It is fundamentally broken when confronted with the reality of complex documents. Take a standard SEC 10K financial filing. |
| Alex | OK, classic messy document, |
| Sam | right? Imagine a page featuring a complex bar chart that illustrates a critical downward trend in quarterly revenue over two years surrounded by. Dense footnotes. If you run that page through a standard text extraction pipeline, what happens? |
| Alex | It’s a total disaster. You might get some messy OCR, you know, optical character recognition that spits out disjointed access labels and a random jumble of numbers. Yeah, just garbage text. But the visual trend itself, the actual slope of the bar showing the revenue collapse, is entirely lost. Your ragt system retrieves the text chunks but completely misses the most crucial piece of evidence on the page because it couldn’t see the chart. |
| Sam | Exactly. And the industry’s first reaction to this was what the notes call modality conversion. People tried to patch the sinking ship of text or rag. They would run OCR. And then employ an early vision model to write a generic textual caption summarizing the chart, and then they’d index that caption into their vector database, |
| Alex | right, because it fits neatly into their existing text-only infrastructure. |
| Sam | It’s convenient, sure, but a generic paragraph of text simply cannot capture the nuance of a complex scatter plot or a multilayered architectural diagram, |
| Alex | which forces a paradigm shift to native multimodal embeddings. Instead of destroying the image to create text, we need to embed the images directly. |
| Sam | Exactly. This is the breakthrough driven by late interaction architectures, specifically models like Cole Poly. Copoly, you’re right. Instead of extracting text, you render the entire PDF page as an image. You process it through a vision. Encoder and you embed the visual representation directly into a shared vector space alongside your text queries. So |
| Alex | you preserve the absolute full visual semantics of the document. You |
| Sam | preserve everything. |
| Alex | OK, hold on, I need to throw a massive red flag on this concept. Uh oh, what’s the flag? If I understand late interaction correctly. A model like Gopali doesn’t just create one neat little vector for an entire PDF page. It creates a vector for every single 14 by 14 pixel patch of that rendered page. That is correct. OK, so if I have an enterprise corpus of, say, 50,000 highly complex documents storing an array of thousands of dense vectors for every single page is going to cause my database infrastructure to melt down. That’s a lot of data. My AWS bill is going to bankrupt my startup in a week. How is this actually viable in production? |
| Sam | Your intuition is spot on. The storage and compute scale of per patch visual embeddings is a terrifying infrastructural challenge. You can’t simply dump billions of visual patch vectors into a standard single vector. and expect subsecond latency. This is why production systems are rapidly adopting hybrid indexing architectures. OK, |
| Alex | so how do we fix the AWS bill |
| Sam | to make this work without bankrupting your startup, you actually run 3 parallel, highly specialized indices. |
| Alex | OK, we have to run 3 separate databases just to search one document. |
| Sam | In essence, yes. First, you maintain your traditional BM 25 index. This is a sparse index, basically your classic keyword |
| Alex | search. Good old BM 25. It |
| Sam | never dies, and it is. Absolutely crucial for finding exact matches on domain-specific acronyms, serial numbers, or unique product codes that dense semantic embeddings tend to blur over. |
| Alex | That makes sense. And the second one. |
| Sam | Second, you have your dense vector index. This handles your standard semantic text matching finding paragraphs that share conceptual meaning with the query. And the |
| Alex | third one handles the visual patches. |
| Sam | All right. The 3rd is a specialized tensor index. This is mathematically optimized to store and. Those massive multidimensional arrays of per patch visual embeddings required for late interaction matching. OK, |
| Alex | so it allows the system to compare the text query against the specific visual geography of the document patches. Exactly. But if I’m firing a query across a keyword database, a semantic text database, and a massive visual patch database, the results coming back are going to have completely different scoring metrics. Oh, |
| Sam | totally different scales. So how do you combine that chaos into a single ranked list of useful documents? |
| Alex | That reconciliation is handled by an algorithm called reciprocal rank fusion, or RRF. RRF is brilliant in its simplicity. How does it work? It doesn’t try to mathematically normalize the disparate scoring systems of a keyword hit versus a visual tensor match, which is almost impossible anyway. Instead, it looks at the rank positions. Oh, like a ranked choice voting system, exactly like that. It takes the ranked lists from all three indices and combines them based on the mathematical reciprocals of their rank positions. |
| Sam | So it just cares about where it placed, not the raw score. |
| Alex | Right? So if a document ranks number 2 in the keyword search, number 5 in the semantic search, and number 1 in the visual patch search, RRF mathematically bubbles it to the very top. |
| Sam | It heavily rewards documents that perform well across multiple independent search modalities. Yep. |
| Alex | It’s how we move from clumsily chunking text to surgically searching the visual structure of human knowledge. Wow. So once an agent has these two capabilities. Once it can natively perceive visual data through token fusion and accurately retrieve complex visual documents through hybrid indexing, what can it actually do out in the wild that it couldn’t do before? How does this change eugenic execution for our listeners building these systems? |
| Sam | The leap in execution capabilities is staggering. Let’s look at how visual perception seamlessly translates into planning. OK. Previously, if you wanted an agent to analyze a competitor’s financial report, it would stumble over the tables. Now an agent can look at that visual SEC page, natively understand the tabular layouts in the narrative context, and autonomously convert those raw pixels into structured symbolic logic. |
| Alex | So it can look at a visual table and instantly construct like a NO 4J knowledge graph. |
| Sam | Yes, a highly structured web of mathematical relationships and entities that a downstream planning agent can mathematically reason over. |
| Alex | It turns visual chaos into machine readable logic. That’s incredible. But what really catches my attention in the notes is how this changes software integration. Oh, this is the best part, because right now, if I want my AI agent to update a customer record in Salesforce or pull data from an ERP system. I have to spend weeks building custom backend API integrations. I have to manage authentication, endpoints, data schemas. It’s a massive headache. |
| Sam | And that is exactly what multimodal tool use renders obsolete. |
| Alex | Wait, obsolete completely. |
| Sam | Completely. Think about how you train a new human employee. You don’t hand them an API key and tell them. To write a Python script to update the CRM. |
| Alex | No, I sit them in front of a monitor, |
| Sam | right? They look at the graphical user interface, they visually locate the submit button, they move the mouse, and they click it. Multimodal agents can now do exactly the same thing. |
| Alex | The UI becomes the API. |
| Sam | Precisely. The agent takes a screenshot of the web browser. The vision encoder processes the visual layout. The LL. understands the semantic goal-like update customer address and outputs the precise X and Y screen coordinates to move the digital mouse and execute a click. |
| Alex | That is mind blowing. You bypass the need for a backend integration entirely. |
| Sam | The agent navigates the software visually using the exact same interface designed for humans. |
| Alex | That completely shatters the bottleneck of software automation. But, um, what happens when the agent makes a mistake? Because in a text-only world, if an agent writes a script to generate a chart, it executes the code and just blindly hopes the output isn’t garbage, |
| Sam | which is why the most advanced architecture right now is the generate render evaluate loop or the GRU loop. The GRE loop. Multimodal agents possess the ability to self-correct using their own spatial grounding. The agent writes the code for data visualization. It executes the code and renders the chart. But then it takes a picture of its own output and acts as its own judge. |
| Alex | Oh wow. So it leverages that semantic grounding we talked about earlier. Exactly. It looks at the image and sees that the axis labels are overlapping or that the color contrast makes the data unreadable. |
| Sam | Exactly. It spots the visual error natively, feeds that visual context back into its own prompt, rewrites its code to adjust the padding or the color hex codes, and tries again. |
| Alex | It is a closed loop of visual self-correction. And entirely independent of human oversight. |
| Sam | That’s the dream of autonomous agents right there. |
| Alex | To pull all of this together for everyone listening, we’ve traced a massive evolutionary leap today. We started by recognizing the severe cognitive limitations of blind text-only models, |
| Sam | the sensory deprivation tank, |
| Alex | right. Then we unpacked the token efficiency asymmetry and how vision encoders and cue formers act like a team of interns. Mathematically compressing massive pixel grids into a language the LLM can actually process, |
| Sam | without blowing up your compute budget. |
| Alex | Exactly, we explored how hybrid fusion and grounding anchor the model’s vocabulary to visual reality. Then we completely dismantled traditional RA pipelines, highlighting the necessity of native embeddings and hybrid vector indices using RRF to retrieve complex visual knowledge. |
| Sam | And finally, we saw how these capabilities empower agents to ditch complex APIs, interact with graphical interfaces natively, and visually correct their own work. |
| Alex | It is a profound. Shift in capability. But before we wrap up, I know you had a final thought you wanted to leave everyone with. |
| Sam | Yeah, I do. If we project this trajectory just a few steps further, it introduces an incredibly provocative implication for the future of spatial computing. OK, I’m listening right now. We are talking about agents processing static screenshots, you know, point in time visual perception. But what happens when these multimodal architectures are wired into the continuous, always on video feeds of augmented reality smart glasses? |
| Alex | Oh man. They stop analyzing discrete moments and start building continuous temporal world models. |
| Sam | Exactly. The agent won’t just look at a screen to click a button. It will continuously observe your physical environment, track the special relationships of the objects around you in real time, and anticipate your physical needs before you even articulate a prompt. |
| Alex | We are moving from AI that reads texts in a vacuum. To AI that possesses persistent spatial awareness of the physical world. |
| Sam | From a digital assistant to a continuous spatial co-pilot, it completely reframes how we will interact with technology. |
| Alex | That is something to think about as you’re building your next system. Thank you for joining us on this deep dive. Keep experimenting, keep pushing the boundaries of your agent architectures, and remember, your autonomous systems don’t have to be trapped in the dark anymore. |
Presentation
- Architecting Multi-modal LLMs - How to add vision, audio and other cognitive perceptions to AI agents
Read
Additional Reading
- Radford, A., Kim, J. W., Hallacy, C., et al. (2021). “Learning Transferable Visual ModelsFrom Natural Language Supervision.” OpenAI. https://arxiv.org/abs/2103.00020
- Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). “Visual Instruction Tuning” (LLaVA). NeurIPS2023. https://arxiv.org/abs/2304.08485
- Faysse, M., Music, H., Hudelot, C., Clinchant, S., & Piwowarski, B. (2024). “ColPali: Efficient Document Retrieval with Vision Language Models.” https://arxiv.org/abs/2407.01449
- Girdhar, R., El-Nouby, A., Liu, Z., et al. (2023). “ImageBind: One Embedding Space ToBind Them All.” CVPR 2023. https://arxiv.org/abs/2305.05665
- RAGFlow Team (2025). “From RAG to Context: 2025 Year-End Review of RAG.“https://ragflow.io/blog/rag-review-2025-from-rag-to-context
- Beyer, L., Steiner, A., Pinto, A. S., et al. (2024). “PaliGemma: A Versatile 3B VLM forTransfer.” Google DeepMind. https://arxiv.org/abs/2407.07726
- BentoML (2026). “Multimodal AI: Best Open-Source Vision Language Models.” https://www.bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models