Stage 1 — Vision Encoder (ViT/CLIP): Splits image into P x P patches → linear projection → position embeddings → transformer encoder layers
Stage 2 — Projection / Adapter Layer: Maps vision encoder output dimensionality to LLM embedding space (e.g., 1024d → 4096d); MLP or Q-Former
Stage 3 — LLM Backbone + Token Fusion: Visual tokens prepended to text tokens → unified causal self-attention; no LLM architecture changes needed
Training Phase 1: Freeze vision encoder, train projection layer on image-caption pairs (alignment)
Training Phase 2: Fine-tune projection + LLM (or LoRA) on visual instruction-following datasets
Stage 1: Vision Encoder (ViT / CLIP)
The vision encoder transforms a raw image into a sequence of dense feature vectors. The dominant architecture is the Vision Transformer (ViT), which operates by splitting an input image of resolution H × W into a grid of non-overlapping patches, each of size P × P pixels. For a standard ViT-L/14 configuration with input resolution 224 × 224 and patch size P = 14, this produces N = (H/P) × (W/P) = 16 × 16 = 256 patches. Each patch is flattened into a vector of dimension P² × C = 14² × 3 = 588 and linearly projected to a d-dimensional embedding (1024 for ViT-L). Learnable position embeddings are added to each patch token to encode spatial location. A special [CLS] token is prepended to the sequence to aggregate global image semantics. The resulting sequence of N + 1 tokens is processed through L transformer encoder layers with multi-head self-attention.
Stage 2: Projection / Adapter Layer
The vision encoder outputs embeddings in its own dimensionality (e.g., 1024d for ViT-L/14, 1408d for EVA-CLIP), which must be projected to match the LLM backbone’s token embedding space (e.g., 4096d for LLaMA-7B, 5120d for LLaMA-13B). Two dominant approaches exist for this alignment — Linear MLP Projection and Q-Former — which we will examine in detail on the next slides.
Stage 3: LLM Backbone and Token Fusion
The projected visual tokens are introduced into the LLM’s input sequence alongside text tokens.
In the simplest formulation, visual tokens are prepended to the text token sequence: the LLM receives $[V₁, V₂, ..., V_N, T₁, T₂, ..., T_M]$ as its input, where $V_i$ are visual tokens and $T_j$ are text tokens.
The LLM’s causal self-attention mechanism then operates over this unified sequence — every text token can attend to all visual tokens and all preceding text tokens. Generation is autoregressive: the model predicts the next text token conditioned on all visual tokens and all previously generated text tokens. This fusion mechanism requires no architectural modification to the LLM itself — only the input representation changes.
Stage 3: LLM Backbone and Token Fusion (cont’d)
The training procedure typically follows a two-phase approach.
Phase 1: Pre-training alignment — the vision encoder is frozen and only the projection layer is trained on large-scale image-caption pairs (e.g., 558K filtered pairs from CC3M in LLaVA). The objective is to align the visual feature space with the LLM’s token space.
Phase 2: Visual instruction tuning — the projection layer and LLM (or LoRA adapters on the LLM) are jointly fine-tuned on curated instruction-following datasets that pair images with multi-turn question-answer conversations.
Patch Count and Computational Cost
Patch count formula: N = (H / P) x (W / P) — determines tokens per image
Dynamic resolution — resize to minimum needed for the task
Patch merging — combine adjacent patches in later layers
Windowed attention — local windows + periodic global attention → near-linear scaling
Image tokens have much lower info density than text tokens — central design challenge
Vision encoder pre-trained via CLIP contrastive learning before connecting to LLM
CLIP: Contrastive Language-Image Pre-training
Two parallel encoders: ViT (images) + text transformer (captions) → shared d-dimensional space
Trained on 400M image-caption pairs via InfoNCE contrastive loss
B x B similarity matrix per batch; loss pushes matching pairs together, non-matching apart
Temperature parameter τ controls distribution sharpness (learned, init 1/0.07)
Large batch sizes essential (32,768) — negative pairs scale as B2 - B
Creates shared embedding space enabling:
Zero-shot classification (embed class names, find nearest image)
Cross-modal retrieval
Linguistically-aligned visual features for VLMs
Limitation: 77-token text context; LLM2CLIP (2024) extends to thousands of tokens
MLP Projection vs. Q-Former
MLP Projection (LLaVA-style):
2-layer MLP + GELU activation; ~20M parameters
1:1 patch-to-token mapping — preserves all spatial detail
Trade-off: all 256+ patch tokens passed to LLM, consuming context window
Q-Former (InstructBLIP-style):
Cross-attention with 32-64 learnable query tokens
Compresses 256+ visual tokens → 32-64 tokens
Queries selectively attend to most informative patches
Pre-trained in 2 stages: vision-language representation, then generative learning
Key trade-off: MLP preserves spatial fidelity (better for OCR, small objects); Q-Former saves context (enables more images per window)
Four Fusion Strategies
Early Fusion (LLaVA, Qwen-VL): All visual + text tokens in single sequence from layer 1; richest cross-modal attention; cost O((V+T)2) per layer
Late Fusion: Separate streams until final layers; computationally efficient; weak cross-modal reasoning — only for coarse visual tasks
Cross-Attention Fusion (Flamingo): Dedicated cross-attention heads at intervals (e.g., every 4th layer); Perceiver Resampler keeps visual token count fixed; scales well for video
Hybrid Fusion (Gemini, GPT-4o): Combines strategies — early fusion for critical visual tokens, sparse cross-attention for full set; balances cost and reasoning depth
Visual Grounding
Visual tokens are persistent anchors in the attention window throughout generation
Spatial Grounding: High attention on specific patch regions → bounding box reasoning, relative position, anomaly localization; fidelity depends on patch resolution
Semantic Grounding: Visual tokens shift output distribution toward visually-consistent language (distributional effect, not explicit reasoning)
Temporal Grounding (video): Frame-level attention aligns text to specific time segments
Prompting tip: Question-before-image yields 5-10% higher accuracy (attentional priming — model knows what to look for before seeing the image)
Signal Routing (“Optic Nerve”): Projection layer bridges vision encoder → LLM token space
Planning / Memory (“Prefrontal Cortex”): LLM reasons over fused visual + text tokens to select actions
Execution: Actions produce new visual states → closed perception-action loop
Key insight: VLMs enable agents to navigate GUIs, verify visual outputs, process complex documents, and reason spatially — not just chatbots that can see