Multi-modal LLMs

This session examines the evolution and architecture of Multimodal Large Language Models (MLLMs), which expand AI capabilities beyond text to include images, audio, video, and structured documents. The text details a three-stage pipeline consisting of a vision encoder, a projection layer for alignment, and an LLM backbone for joint reasoning. It further explores multimodal retrieval-augmented generation (RAG), comparing traditional data conversion methods with newer, native embedding techniques like ColPali that preserve visual fidelity.

Specialized fusion strategies and visual grounding mechanisms are discussed as the technical foundation for how these models “see” and interpret spatial relationships. Finally, the source outlines agentic AI patterns, demonstrating how multimodal perception enables autonomous agents to perform complex tasks like GUI automation and visual evidence verification.

Figure: Infrographic that summarizes the content of this session

Listen

Session Overview: Multi-modal LLMs
Deep Dive: Multi-modal LLMs

Presentation

Read

Additional Reading

  1. Radford, A., Kim, J. W., Hallacy, C., et al. (2021). “Learning Transferable Visual ModelsFrom Natural Language Supervision.” OpenAI. https://arxiv.org/abs/2103.00020
  2. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). “Visual Instruction Tuning” (LLaVA). NeurIPS2023. https://arxiv.org/abs/2304.08485
  3. Faysse, M., Music, H., Hudelot, C., Clinchant, S., & Piwowarski, B. (2024). “ColPali: Efficient Document Retrieval with Vision Language Models.” https://arxiv.org/abs/2407.01449
  4. Girdhar, R., El-Nouby, A., Liu, Z., et al. (2023). “ImageBind: One Embedding Space ToBind Them All.” CVPR 2023. https://arxiv.org/abs/2305.05665
  5. RAGFlow Team (2025). “From RAG to Context: 2025 Year-End Review of RAG.“https://ragflow.io/blog/rag-review-2025-from-rag-to-context
  6. Beyer, L., Steiner, A., Pinto, A. S., et al. (2024). “PaliGemma: A Versatile 3B VLM forTransfer.” Google DeepMind. https://arxiv.org/abs/2407.07726
  7. BentoML (2026). “Multimodal AI: Best Open-Source Vision Language Models.” https://www.bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models