Multi-modal LLMs

This session examines the evolution and architecture of Multimodal Large Language Models (MLLMs), which expand AI capabilities beyond text to include images, audio, video, and structured documents. The text details a three-stage pipeline consisting of a vision encoder, a projection layer for alignment, and an LLM backbone for joint reasoning. It further explores multimodal retrieval-augmented generation (RAG), comparing traditional data conversion methods with newer, native embedding techniques like ColPali that preserve visual fidelity.

Specialized fusion strategies and visual grounding mechanisms are discussed as the technical foundation for how these models “see” and interpret spatial relationships. Finally, the source outlines agentic AI patterns, demonstrating how multimodal perception enables autonomous agents to perform complex tasks like GUI automation and visual evidence verification.

Figure: Infrographic that summarizes the content of this session

Listen

Session Overview: Multi-modal LLMs

Transcript

Speaker	Text
Alex	This is the brief on multi-modal large language models. Today, we’re reviewing 2026 lecture notes on how AI is evolving past just reading text, allowing autonomous agents to actually see, hear, and perceive the world. First, let’s look at the massive shift to multimodal AI. You know, models that only read are incredibly smart, but they’re fundamentally blind. I mean, if you feed them a chart by just extracting the text, the layout and visual meaning are totally destroyed. A text-only LLM is literally like a genius locked in a dark silent room. Giving it multi-modal capabilities is like finally turning on the lights and opening a window. Second, how do we actually build this? Well, it takes a 3-stage pipeline. A vision encoder chops an image into a grid of tiny patches. Next, a projection layer translates those visual patches right into the LLM’s text-based embedding space. Then the LLM backbone processes both together. Now, wait, doesn’t chopping an image into tiny patches destroy the big picture? You’d think so, but the model uses self-attention to globally map how those pieces relate to each other. Finally, this absolutely changes data retrieval with multimodal RAG. Instead of using clunky OCR to pull text from a PDF which totally ruins your charts, new models like Kpali embed document pages natively as images. You just search the visual representation directly. Now, isn’t standard tech search just easier? Sure, until it completely fails because your most important data is locked inside an SEC revenue bar chart. Giving AI the ability to see and hear doesn’t just add cool features, it transforms language models from simple text calculators into genuine perceptual reasoners.

Speaker

Text

Alex

This is the brief on multi-modal large language models. Today, we’re reviewing 2026 lecture notes on how AI is evolving past just reading text, allowing autonomous agents to actually see, hear, and perceive the world. First, let’s look at the massive shift to multimodal AI. You know, models that only read are incredibly smart, but they’re fundamentally blind. I mean, if you feed them a chart by just extracting the text, the layout and visual meaning are totally destroyed. A text-only LLM is literally like a genius locked in a dark silent room. Giving it multi-modal capabilities is like finally turning on the lights and opening a window. Second, how do we actually build this? Well, it takes a 3-stage pipeline. A vision encoder chops an image into a grid of tiny patches. Next, a projection layer translates those visual patches right into the LLM’s text-based embedding space. Then the LLM backbone processes both together. Now, wait, doesn’t chopping an image into tiny patches destroy the big picture? You’d think so, but the model uses self-attention to globally map how those pieces relate to each other. Finally, this absolutely changes data retrieval with multimodal RAG. Instead of using clunky OCR to pull text from a PDF which totally ruins your charts, new models like Kpali embed document pages natively as images. You just search the visual representation directly. Now, isn’t standard tech search just easier? Sure, until it completely fails because your most important data is locked inside an SEC revenue bar chart. Giving AI the ability to see and hear doesn’t just add cool features, it transforms language models from simple text calculators into genuine perceptual reasoners.

Deep Dive: Multi-modal LLMs

Transcript

Speaker	Text
Alex	So, um, Imagine you are trying to describe a really chaotic, messy financial scatter plot to someone over the phone using nothing but words. Oh, that sounds like a total nightmare, right? It’s impossible. You have to explain every single data point, the trajectory of the trend line, the weird clustering of outliers. Uh, the spacing of the access labels within 30 seconds, the person on the other end is just completely lost.
Sam	Yeah, they lose all the spatial context, the groupings, the trends, it’s all gone.
Alex	Exactly. It is an exercise in futility. But if you think about it, that is exactly how our most advanced AI agents experience the world today, because, you know, as someone out there building retrieval augmented generation pipelines and agentic systems. You already know these text-based models are brilliant. They
Sam	are, but fundamentally they are trapped in a sensory deprivation tank. Yeah,
Alex	that’s exactly the phrase for
Sam	it. It really is a severe cognitive bottleneck. I mean, we have engineered these massive, incredibly capable reasoning engines, but we force them to understand the world entirely through the highly compressed, lossy medium of tokenized text,
Alex	which is not how we operate at all.
Sam	No, human cognition doesn’t work that way. We integrate visual, auditory, and linguistic information continuously to make sense of our environment. When you force a model to rely purely on text, you are systematically blinding it to the reality of how information is actually structured,
Alex	which brings us to our mission for today’s deep dive. We are jumping into a fascinating set of spring 2026 lecture notes from a graduate course called, uh, Building Generative AI Business Solutions.
Sam	Really great material on this one. Yeah,
Alex	it’s dense, but so good. The focus is on a massive paradigm shift. We’re looking at how multimodal large language models, MLLMs are transforming AI from blind text processors into genuine perceptual reasoners,
Sam	and this is huge for anyone building autonomous systems.
Alex	Exactly. We’re gonna explore the underlying math of how they actually see, how they fuse modalities, and most importantly for you listening, how to practically wire these concepts into the agent. Loops and retrieval systems you’re building right now.
Sam	To really appreciate the gravity of this shift, you just have to consider a standard text extraction pipeline processing a dense enterprise
Alex	document, right, like a financial report or something.
Sam	Yeah. When you convert a complex formatted page into a flat text string, you are actively destroying semantic meaning. The column alignments in a financial table, the grouping of rows, the relative position of a warning label next to a schematic. That spatial layout isn’t just decoration.
Alex	No, it is the actual data. Exactly.
Sam	The layout is
Alex	the data. So, um, if text models only understand words, how on earth do you feed a massive high resolution image into a text prompt without completely breaking the system’s architecture? Well,
Sam	it all comes down to what the lecture notes describe as the token efficiency asymmetry. This is basically The foundational mathematical hurdle of multimodal AI.
Alex	OK, token efficiency asymmetry. Break that down
Sam	for us. Sure. So text is incredibly dense and highly abstract. A single text token can represent a highly precise concept, right? Like a specific noun, a complex logical operator, a distinct mathematical function.
Alex	It packs a lot of punch into a tiny package.
Sam	Exactly. But images are the exact opposite. They’re continuous. Continuous two dimensional signals made of massive pixel arrays. They are incredibly dense and raw data, but very sparse in immediate semantic meaning.
Alex	So to a computer, an image is just a giant grid of numbers representing colors. And somehow we have to translate that raw grid into the highly refined conceptual language of a large language model.
Sam	Precisely. And to pull that off, modern vision language models rely on a strict 3-stage architectural pipeline. Stage 1 is the vision encoder. Typically this is a vision transformer that takes the raw image and mathematically chunks it into a grid of non-overlapping patches.
Alex	OK, let’s unpack this mechanism because it’s pretty wild. I’m picturing the vision encoder acting almost like a meticulous, highly analytical translator. That’s a good way to look at it. It takes a beautiful, complex painting, takes a pair of scissors. And cuts it up into a rigid grid of say 14 by 14 pixel squares. Yeah,
Sam	that’s the standard patch size.
Alex	And then it examines each individual square and describes its visual features, you know, the edges, the colors, the textures in a mathematical vector that the LLM natively understands.
Sam	That is the exact mechanism. But the mathematical reality of cutting up that painting introduces a massive computational nightmare. Uh,
Alex	why is that?
Sam	Well, the formula for the number of patches generated is the image height divided by the patch size, multiplied by the width divided by the patch size. OK,
Alex	so just basic area math,
Sam	right? So if you hand the model a relatively standard high resolution image, say 1,344 by 1,344 pixels, and you are using that standard 14 pixel patch, you suddenly generate over 9000 patch tokens for a single image.
Alex	Wait, hold on, 9000. Tokens just to look at one single image that eats up a context window incredibly fast. I mean, you’re burning through tokens before you even ask a question
Sam	and it gets worse because the LLM processes sequence tokens using self-attention. The computational cost scales quadratically. It is an O of N 2 bottleneck, right,
Alex	because every token has to look at every other token. Exactly.
Sam	So if you flood the context window with 9000 visual tokens. The compute cost doesn’t just increase, it explodes exponentially. The model would just grind to a halt.
Alex	So there has to be a filter, right? You can’t just dump 9000 tokens into the prompt and expect it to run efficiently or cheaply,
Sam	which brings us to the crucial second stage, the projection layer. This is arguably the most active area of multimodal architectural research right now.
Alex	OK, so how does the projection layer fix the bottleneck? Well,
Sam	early on, models used really simple approaches like linear MLPs, multi-layer perceptrons. These were computationally cheap to train, but they essentially acted as a direct 1 to 1 mapping tool. Oh, I see where this is going. Yeah, they took every single visual patch and mapped it directly to an LLM token, meaning you kept all nine. 1000 tokens,
Alex	which totally fails to solve our exploding context window problem.
Sam	It failed completely for high resolution reasoning. That failure drove the industry toward advanced projection strategies like cuformers or querying transformers. Queformers, OK. Instead of a direct map, a queue former introduces a fixed set of learnable query tokens. Think of it as a strict token budget. Maybe you only allow 32 or 64 query tokens total. Oh,
Alex	that’s a massive reduction.
Sam	It is. These query tokens use cross attention to dynamically scan those 9000 raw visual patches and extract only the most salient task relevant information.
Alex	I love that. It’s like, um, instead of sending all 9000 puzzle pieces to the CEO’s desk, you send a team of 64 incredibly smart interns to look at the puzzle and come back with a highly compressed executive summary of what the picture actually shows.
Sam	That is a perfect analogy, and that compressed summary, those 64 rich. Tokens is what finally gets passed a stage 3, which is the LLM backbone
Alex	itself. So the compression happens before the visual data ever hits the main reasoning engine.
Sam	Exactly. It saves you from that quadratic computation nightmare.
Alex	OK, so the interns bring back the summary, and now we have these translated visual tokens sitting next to our text tokens. But how do they actually interact? Because just putting a picture and a sentence in the same room doesn’t mean they understand each other,
Sam	right? That interaction is governed by token fusion. The architecture dictates when and how deeply the text and visual vectors mix.
Alex	Are there different ways to do that?
Sam	Yeah, definitely. The most aggressive approach is early fusion, where you concatenate all the visual and text tokens into a single unified sequence before the very first layer of the neural network.
Alex	Wow, that sounds like it would give you incredibly deep, rich spatial reasoning. Because every single layer of the model gets to evaluate the text and the image simultaneously.
Sam	It does, but it comes with a massive cost. The computational overhead is astronomical because you are running that complex attention mechanism across the combined sequence for the entire depth of the model, right?
Alex	The O of N2d problem comes back to bite you.
Sam	Exactly. Now, on the completely opposite end of the spectrum, you have late fusion. Here, the image and text are processed in entirely separate neural paths. Pathways and their representations are only mathematically combined at the very final output layers,
Alex	which is like two people working on a massive group project in separate rooms, entirely isolated from one another and only comparing their notes five minutes before the actual presentation. Yeah,
Sam	that’s exactly it. It’s highly efficient, but you get incredibly weak cross-modal reasoning,
Alex	right, because they couldn’t collaborate on the complex parts.
Sam	It’s terrible for complex tasks. To solve this, researchers developed cross attention fusion, often referred to as the flamingo style architecture, like the bird, yeah, like the bird. This uses dedicated attention heads spaced at regular intervals specifically for injecting visual data into the tech stream. It’s highly scalable and particularly great for ingesting video frames. But the gold standard today, what you see in the highest performing production models is hybrid fusion, lending the best of both worlds. Exactly. Hybrid architectures use early fusion in the foundational lower layers to establish deep connections for critical tokens, and then they switch to sparse cross attention in the higher layers that
Alex	keep the computational cost manageable,
Sam	right, while still preserving deep reasoning capabilities.
Alex	So once those tokens are fused, How does the model actually leverage them to formulate an answer? Because, say I ask the model, what’s wrong with this car engine? It needs to bridge my textual question with the specific pixels showing a cracked belt.
Sam	The notes detail a mechanism for this called grounding. Grounding is basically how visual tokens exert influence over the text generation process. Visual tokens essentially act as persistent contextual anchors. There are 3 types. Spatial grounding is when the model attends heavily to specific patch tokens to isolate locations,
Alex	like drawing a bounding box around that cracked engine belt.
Sam	Exactly. Temporal grounding applies that same logic across time for video, aligning textual concepts to specific frames, but semantic grounding is where the underlying math gets really interesting. Let’s
Alex	drill into semantic grounding. How does that anchor actually alter the model’s output?
Sam	Think of it as a distributional shift. The sheer presence of visual tokens in the context window actively shifts the LLM’s vocabulary probability distribution. So if the visual tokens represent a dense forest scene, They act as a gravitational pull. They increase the probability of mass over tokens related to trees, leaves, shadows, and dirt while actively suppressing the probability of tokens related to, say, oceans or skyscrapers.
Alex	It forces the language generation engine to remain semantically tethered to the visual reality of the image.
Sam	Exactly. It keeps the model grounded in what it’s actually seeing.
Alex	That is fascinating, and it brings to mind a very practical, everyday prompt engineering problem for our listeners. Yeah, since the model is processing these token sequences sequentially. And the visual tokens influence the text probabilities. Does the order matter? Like, if I’m prompting a multimodal model, should the image go first or the text?
Sam	The empirical data from the course notes provides a definitive answer to that. Placing your text question before the image in the prompt yields a consistent 5 to 10% accuracy boost on complex analytical tasks.
Alex	Wait, up 10% just from swapping the order of the inputs. Just the order. Let me think through the mechanics of why that happens. It has to be attentional priming, right? If you put the text front first. You’re forcing the model to establish the relevant semantic concepts, telling it exactly what it needs to look for before it ever encounters the visual data. Exactly. But if you put the image first, the model has to process this massive raw grid of pixels blindly, trying to encode everything at once, and then gets hit with a surprise question at the end.
Sam	You’ve nailed the underlying mechanism. When the question tokens run through the attention heads first. They prime the network. That makes so much sense. So when those visual patches finally enter the attention window, the cue former and the cross attention layers already know to ignore the background noise and hyperfocus their computational resources on the specific features you asked about.
Alex	This changes so much of how we need to build, and it perfectly sets up the next major hurdle rethinking retrieval. Here’s where it gets really interesting. Let’s talk about multimodal RAG because standard text-based retrieval augmented generation. is failing hard in the enterprise space right now.
Sam	It is fundamentally broken when confronted with the reality of complex documents. Take a standard SEC 10K financial filing.
Alex	OK, classic messy document,
Sam	right? Imagine a page featuring a complex bar chart that illustrates a critical downward trend in quarterly revenue over two years surrounded by. Dense footnotes. If you run that page through a standard text extraction pipeline, what happens?
Alex	It’s a total disaster. You might get some messy OCR, you know, optical character recognition that spits out disjointed access labels and a random jumble of numbers. Yeah, just garbage text. But the visual trend itself, the actual slope of the bar showing the revenue collapse, is entirely lost. Your ragt system retrieves the text chunks but completely misses the most crucial piece of evidence on the page because it couldn’t see the chart.
Sam	Exactly. And the industry’s first reaction to this was what the notes call modality conversion. People tried to patch the sinking ship of text or rag. They would run OCR. And then employ an early vision model to write a generic textual caption summarizing the chart, and then they’d index that caption into their vector database,
Alex	right, because it fits neatly into their existing text-only infrastructure.
Sam	It’s convenient, sure, but a generic paragraph of text simply cannot capture the nuance of a complex scatter plot or a multilayered architectural diagram,
Alex	which forces a paradigm shift to native multimodal embeddings. Instead of destroying the image to create text, we need to embed the images directly.
Sam	Exactly. This is the breakthrough driven by late interaction architectures, specifically models like Cole Poly. Copoly, you’re right. Instead of extracting text, you render the entire PDF page as an image. You process it through a vision. Encoder and you embed the visual representation directly into a shared vector space alongside your text queries. So
Alex	you preserve the absolute full visual semantics of the document. You
Sam	preserve everything.
Alex	OK, hold on, I need to throw a massive red flag on this concept. Uh oh, what’s the flag? If I understand late interaction correctly. A model like Gopali doesn’t just create one neat little vector for an entire PDF page. It creates a vector for every single 14 by 14 pixel patch of that rendered page. That is correct. OK, so if I have an enterprise corpus of, say, 50,000 highly complex documents storing an array of thousands of dense vectors for every single page is going to cause my database infrastructure to melt down. That’s a lot of data. My AWS bill is going to bankrupt my startup in a week. How is this actually viable in production?
Sam	Your intuition is spot on. The storage and compute scale of per patch visual embeddings is a terrifying infrastructural challenge. You can’t simply dump billions of visual patch vectors into a standard single vector. and expect subsecond latency. This is why production systems are rapidly adopting hybrid indexing architectures. OK,
Alex	so how do we fix the AWS bill
Sam	to make this work without bankrupting your startup, you actually run 3 parallel, highly specialized indices.
Alex	OK, we have to run 3 separate databases just to search one document.
Sam	In essence, yes. First, you maintain your traditional BM 25 index. This is a sparse index, basically your classic keyword
Alex	search. Good old BM 25. It
Sam	never dies, and it is. Absolutely crucial for finding exact matches on domain-specific acronyms, serial numbers, or unique product codes that dense semantic embeddings tend to blur over.
Alex	That makes sense. And the second one.
Sam	Second, you have your dense vector index. This handles your standard semantic text matching finding paragraphs that share conceptual meaning with the query. And the
Alex	third one handles the visual patches.
Sam	All right. The 3rd is a specialized tensor index. This is mathematically optimized to store and. Those massive multidimensional arrays of per patch visual embeddings required for late interaction matching. OK,
Alex	so it allows the system to compare the text query against the specific visual geography of the document patches. Exactly. But if I’m firing a query across a keyword database, a semantic text database, and a massive visual patch database, the results coming back are going to have completely different scoring metrics. Oh,
Sam	totally different scales. So how do you combine that chaos into a single ranked list of useful documents?
Alex	That reconciliation is handled by an algorithm called reciprocal rank fusion, or RRF. RRF is brilliant in its simplicity. How does it work? It doesn’t try to mathematically normalize the disparate scoring systems of a keyword hit versus a visual tensor match, which is almost impossible anyway. Instead, it looks at the rank positions. Oh, like a ranked choice voting system, exactly like that. It takes the ranked lists from all three indices and combines them based on the mathematical reciprocals of their rank positions.
Sam	So it just cares about where it placed, not the raw score.
Alex	Right? So if a document ranks number 2 in the keyword search, number 5 in the semantic search, and number 1 in the visual patch search, RRF mathematically bubbles it to the very top.
Sam	It heavily rewards documents that perform well across multiple independent search modalities. Yep.
Alex	It’s how we move from clumsily chunking text to surgically searching the visual structure of human knowledge. Wow. So once an agent has these two capabilities. Once it can natively perceive visual data through token fusion and accurately retrieve complex visual documents through hybrid indexing, what can it actually do out in the wild that it couldn’t do before? How does this change eugenic execution for our listeners building these systems?
Sam	The leap in execution capabilities is staggering. Let’s look at how visual perception seamlessly translates into planning. OK. Previously, if you wanted an agent to analyze a competitor’s financial report, it would stumble over the tables. Now an agent can look at that visual SEC page, natively understand the tabular layouts in the narrative context, and autonomously convert those raw pixels into structured symbolic logic.
Alex	So it can look at a visual table and instantly construct like a NO 4J knowledge graph.
Sam	Yes, a highly structured web of mathematical relationships and entities that a downstream planning agent can mathematically reason over.
Alex	It turns visual chaos into machine readable logic. That’s incredible. But what really catches my attention in the notes is how this changes software integration. Oh, this is the best part, because right now, if I want my AI agent to update a customer record in Salesforce or pull data from an ERP system. I have to spend weeks building custom backend API integrations. I have to manage authentication, endpoints, data schemas. It’s a massive headache.
Sam	And that is exactly what multimodal tool use renders obsolete.
Alex	Wait, obsolete completely.
Sam	Completely. Think about how you train a new human employee. You don’t hand them an API key and tell them. To write a Python script to update the CRM.
Alex	No, I sit them in front of a monitor,
Sam	right? They look at the graphical user interface, they visually locate the submit button, they move the mouse, and they click it. Multimodal agents can now do exactly the same thing.
Alex	The UI becomes the API.
Sam	Precisely. The agent takes a screenshot of the web browser. The vision encoder processes the visual layout. The LL. understands the semantic goal-like update customer address and outputs the precise X and Y screen coordinates to move the digital mouse and execute a click.
Alex	That is mind blowing. You bypass the need for a backend integration entirely.
Sam	The agent navigates the software visually using the exact same interface designed for humans.
Alex	That completely shatters the bottleneck of software automation. But, um, what happens when the agent makes a mistake? Because in a text-only world, if an agent writes a script to generate a chart, it executes the code and just blindly hopes the output isn’t garbage,
Sam	which is why the most advanced architecture right now is the generate render evaluate loop or the GRU loop. The GRE loop. Multimodal agents possess the ability to self-correct using their own spatial grounding. The agent writes the code for data visualization. It executes the code and renders the chart. But then it takes a picture of its own output and acts as its own judge.
Alex	Oh wow. So it leverages that semantic grounding we talked about earlier. Exactly. It looks at the image and sees that the axis labels are overlapping or that the color contrast makes the data unreadable.
Sam	Exactly. It spots the visual error natively, feeds that visual context back into its own prompt, rewrites its code to adjust the padding or the color hex codes, and tries again.
Alex	It is a closed loop of visual self-correction. And entirely independent of human oversight.
Sam	That’s the dream of autonomous agents right there.
Alex	To pull all of this together for everyone listening, we’ve traced a massive evolutionary leap today. We started by recognizing the severe cognitive limitations of blind text-only models,
Sam	the sensory deprivation tank,
Alex	right. Then we unpacked the token efficiency asymmetry and how vision encoders and cue formers act like a team of interns. Mathematically compressing massive pixel grids into a language the LLM can actually process,
Sam	without blowing up your compute budget.
Alex	Exactly, we explored how hybrid fusion and grounding anchor the model’s vocabulary to visual reality. Then we completely dismantled traditional RA pipelines, highlighting the necessity of native embeddings and hybrid vector indices using RRF to retrieve complex visual knowledge.
Sam	And finally, we saw how these capabilities empower agents to ditch complex APIs, interact with graphical interfaces natively, and visually correct their own work.
Alex	It is a profound. Shift in capability. But before we wrap up, I know you had a final thought you wanted to leave everyone with.
Sam	Yeah, I do. If we project this trajectory just a few steps further, it introduces an incredibly provocative implication for the future of spatial computing. OK, I’m listening right now. We are talking about agents processing static screenshots, you know, point in time visual perception. But what happens when these multimodal architectures are wired into the continuous, always on video feeds of augmented reality smart glasses?
Alex	Oh man. They stop analyzing discrete moments and start building continuous temporal world models.
Sam	Exactly. The agent won’t just look at a screen to click a button. It will continuously observe your physical environment, track the special relationships of the objects around you in real time, and anticipate your physical needs before you even articulate a prompt.
Alex	We are moving from AI that reads texts in a vacuum. To AI that possesses persistent spatial awareness of the physical world.
Sam	From a digital assistant to a continuous spatial co-pilot, it completely reframes how we will interact with technology.
Alex	That is something to think about as you’re building your next system. Thank you for joining us on this deep dive. Keep experimenting, keep pushing the boundaries of your agent architectures, and remember, your autonomous systems don’t have to be trapped in the dark anymore.

Presentation

Architecting Multi-modal LLMs - How to add vision, audio and other cognitive perceptions to AI agents

Read

Multimodal Large Language Models - Images, Audio, Video & Vector Retrieval Across Modalities

Additional Reading

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). “Learning Transferable Visual ModelsFrom Natural Language Supervision.” OpenAI. https://arxiv.org/abs/2103.00020
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). “Visual Instruction Tuning” (LLaVA). NeurIPS2023. https://arxiv.org/abs/2304.08485
Faysse, M., Music, H., Hudelot, C., Clinchant, S., & Piwowarski, B. (2024). “ColPali: Efficient Document Retrieval with Vision Language Models.” https://arxiv.org/abs/2407.01449
Girdhar, R., El-Nouby, A., Liu, Z., et al. (2023). “ImageBind: One Embedding Space ToBind Them All.” CVPR 2023. https://arxiv.org/abs/2305.05665
RAGFlow Team (2025). “From RAG to Context: 2025 Year-End Review of RAG.“https://ragflow.io/blog/rag-review-2025-from-rag-to-context
Beyer, L., Steiner, A., Pinto, A. S., et al. (2024). “PaliGemma: A Versatile 3B VLM forTransfer.” Google DeepMind. https://arxiv.org/abs/2407.07726
BentoML (2026). “Multimodal AI: Best Open-Source Vision Language Models.” https://www.bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models