MSA 8700 — Module 7
| Category | Metrics | Key Question |
|---|---|---|
| Precision-Oriented | P@K, AP, MAP, R-Precision, AUC-PR | “Of the chunks we retrieved, how many were useful?” |
| Recall-Oriented | R@K, MR@K, Coverage Recall | “Of all useful chunks, how many did we find?” |
| Rank-Sensitive | DCG, NDCG, RR, MRR, ERR | “Are the best chunks ranked highest?” |
| RAG-Specific | Context Precision/Recall, Hit Rate, Chunk Attribution | “Does retrieval help generation?” |
| Evaluation Protocol | Nugget Recall, Metric@K Curves, Stratified Eval | “How do we run a rigorous evaluation?” |
In ranked retrieval evaluation, you need three things:
| Type | Values | Example | When to Use |
|---|---|---|---|
| Binary | relevant / not relevant | {"c1", "c3", "c5"} | Simple yes/no relevance |
| Graded | 0, 1, 2, … | {"c1": 2, "c2": 0, "c3": 1} | Chunk partially answers query |
The challenge in RAG: relevance is often partial — a chunk may contain part of an answer.
Binary judgments are often insufficient. Graded relevance captures this nuance.
RETRIEVED = ["c1", "c2", "c3", "c4", "c5",
"c6", "c7", "c8", "c9", "c10"]
RELEVANT = {"c1", "c3", "c5", "c8"}
# Graded relevance: 0=not relevant, 1=partial, 2=highly relevant
GRADES = {
"c1": 2, "c2": 0, "c3": 1, "c4": 0,
"c5": 2, "c6": 0, "c7": 0, "c8": 1,
"c9": 0, "c10": 0,
}
Rank: 1 2 3 4 5 6 7 8 9 10
Chunk: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
✓ ✓ ✓ ✓
“Of the chunks we retrieved, how many were actually useful?”
These metrics penalize the retriever for returning irrelevant chunks.
The fraction of the top-K retrieved chunks that are relevant.
$$P@K = \frac{|\lbrace\text{relevant chunks in top-K}\rbrace|}{K}$$Key properties:
retrieved = [c1, c2, c3, c4, c5] relevant = {c1, c3}
P@3 = 2/3 ≈ 0.667 (c1 and c3 are relevant; c2 is not)
P@5 = 2/5 = 0.400
AP computes P@K at each position where a relevant chunk appears, then averages.
$$AP = \frac{1}{|R|} \sum_{k=1}^{n} P@k \cdot \mathbb{1}[\text{retrieved}[k] \in R]$$Key properties:
retrieved = [c1, c2, c3, c4, c5] relevant = {c1, c3}
Hit at rank 1: P@1 = 1/1 = 1.000
Hit at rank 3: P@3 = 2/3 = 0.667
AP = (1.000 + 0.667) / 2 = 0.833
AP distinguishes between good and bad rankings:
Good ranking (relevant chunks at top):
[c1, c3, c5, c8, c2, c4, c6, c7, c9, c10]
AP = 1.000 (all relevant chunks ranked first)
Bad ranking (relevant chunks buried):
[c2, c4, c6, c7, c9, c1, c3, c5, c8, c10]
AP = 0.354 (relevant chunks scattered at bottom)
P@K alone would not distinguish rank quality — AP captures this.
MAP is the mean of AP scores across multiple queries — the standard aggregate metric for comparing retrieval systems.
$$MAP = \frac{1}{|Q|} \sum_{q=1}^{|Q|} AP(q)$$def mean_average_precision(queries):
ap_scores = [average_precision(r, rel) for r, rel in queries]
return float(np.mean(ap_scores))
Precision at exactly rank R, where R = |relevant|.
$$\text{R-Precision} = P@R \quad \text{where } R = |\text{relevant}|$$Self-normalizes across queries with different numbers of relevant chunks.
relevant = {c1, c3, c5, c8} → R = 4
Top 4 retrieved: [c1, c2, c3, c4]
Hits in top-4: {c1, c3} → 2 of 4
R-Precision = 2/4 = 0.500
The interpolated P-R curve plots precision at standard recall breakpoints (0.0, 0.1, …, 1.0).
Interpolation rule: At each recall level $r$, precision is the max precision at any $r' \geq r$:
$$P_{\text{interp}}(r) = \max_{r' \geq r} P(r')$$AUC-PR = Area under this curve (trapezoidal rule) — summarizes the entire curve in one number.
“Of all the relevant chunks that exist, how many did we actually retrieve?”
In RAG, recall is arguably more important than precision — a missed relevant chunk can mean a missed fact, leading to hallucination or incomplete answers.
The fraction of all relevant chunks that appear in the top-K results.
$$R@K = \frac{|\lbrace\text{relevant chunks in top-K}\rbrace|}{|\text{all relevant chunks}|}$$Key difference from P@K:
relevant = {c1, c3, c5, c8} (4 relevant chunks)
R@1 = 1/4 = 0.250 (found c1)
R@3 = 2/4 = 0.500 (found c1, c3)
R@5 = 3/4 = 0.750 (found c1, c3, c5)
R@10 = 4/4 = 1.000 (found all)
As K increases:
K P@K R@K
── ───── ─────
1 1.000 0.250 ← High precision, low recall
3 0.667 0.500
5 0.600 0.750
8 0.500 1.000
10 0.400 1.000 ← Low precision, high recall
The optimal K balances both — depends on your application.
A RAG-specific variant: does the retrieved set collectively cover all sub-questions or facets of a complex query?
$$\text{Coverage Recall} = \frac{|\lbrace\text{facets with} \geq 1 \text{ retrieved chunk}\rbrace|}{|\text{all facets}|}$$Essential for multi-hop or multi-faceted queries.
Query: "Tell me about the Eiffel Tower"
Facet 1 (construction): {c1, c2} → covered by c1 ✓
Facet 2 (design): {c3, c5} → covered by c3 ✓
Facet 3 (renovation): {c8, c9} → covered by c8? depends on K
Facet 4 (tourism): {c6, c7} → not in top-5 ✗
Coverage Recall @5 = 3/4 = 0.750
These metrics account for where in the ranked list relevant chunks appear, and can handle graded relevance (not just binary).
Particularly well-suited to RAG — chunk relevance is rarely binary.
Uses graded relevance with a logarithmic discount to lower-ranked positions.
$$DCG@K = \sum_{i=1}^{K} \frac{\text{grade}_i}{\log_2(i + 1)}$$The logarithmic discount:
| Rank | Divisor | Credit |
|---|---|---|
| 1 | $\log_2(2) = 1.000$ | 100% |
| 2 | $\log_2(3) = 1.585$ | 63% |
| 5 | $\log_2(6) = 2.585$ | 39% |
| 10 | $\log_2(11) = 3.459$ | 29% |
Higher-ranked relevant chunks contribute more to the score.
Retrieved: [c1, c2, c3, c4, c5]
Grades: 2 0 1 0 2
DCG@5 Calculation:
Rank 1: grade=2 → 2 / log₂(2) = 2.000
Rank 2: grade=0 → 0 / log₂(3) = 0.000
Rank 3: grade=1 → 1 / log₂(4) = 0.500
Rank 4: grade=0 → 0 / log₂(5) = 0.000
Rank 5: grade=2 → 2 / log₂(6) = 0.774
─────
DCG@5 = 3.274
The gold standard for ranked evaluation. Normalizes DCG by the Ideal DCG (IDCG).
$$NDCG@K = \frac{DCG@K}{IDCG@K}$$Actual DCG@5 = 3.274 (our ranking)
Ideal ranking = [c1, c5, c3, c8, ...] (sorted by grade)
IDCG@5 = 4.131 (best possible DCG)
NDCG@5 = 3.274 / 4.131 = 0.793
RR = $1 / \text{rank}$ of the first relevant chunk found.
$$RR = \frac{1}{\text{rank of first relevant chunk}}$$MRR = mean of RR across multiple queries.
$$MRR = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \frac{1}{\text{rank}_q}$$Answers: “How quickly does the system surface at least one relevant chunk?”
Limitation: Only considers the first hit — ignores the rest of the ranked list.
Retrieved = [c1, c2, c3, ...] c1 is relevant
RR = 1/1 = 1.000
Retrieved = [c2, c4, c1, ...] c1 is first relevant (rank 3)
RR = 1/3 = 0.333
Extends MRR by modeling the probability a user is satisfied at each rank, with graded relevance.
$$ERR = \sum_{r=1}^{n} \frac{1}{r} \cdot R_r \cdot \prod_{i=1}^{r-1} (1 - R_i)$$where $R_i = \text{grade}_i / \text{max\_grade}$
Cascading effect: If a highly relevant chunk is found early, the user is less likely to keep scanning.
Rank 1: grade=2 → R₁=1.0 P(satisfied)=1.0 → 1.0 × 1/1 = 1.000
Rank 2: grade=0 → R₂=0.0 P(reach rank 2)=0.0
ERR = 1.000 (user satisfied immediately at rank 1)
Metrics developed specifically for the RAG paradigm — bridging retrieval quality with generation quality.
Measures whether relevant chunks appear earlier in the ranked list.
Computed as Average Precision — equivalent to MAP for a single query. Popularized by the RAGAS framework.
$$\text{Context Precision} = AP(\text{retrieved}, \text{relevant})$$Good order: [c1, c3, c5, c8, c2] → Context Precision = 1.000
Bad order: [c2, c4, c1, c3, c5] → Context Precision = 0.533
Same chunks retrieved, different rankings → different scores.
How much of the ground-truth answer can be attributed to retrieved context?
Uses an LLM judge to check whether each claim in a reference answer is supported.
$$\text{Context Recall} = \frac{|\lbrace\text{claims supported by context}\rbrace|}{|\text{all reference claims}|}$$Reference claims:
1. "Eiffel Tower is 330m tall" → supported ✓
2. "Built in 1889" → supported ✓
3. "Designed by Gustave Eiffel" → NOT supported ✗
4. "Most visited monument in world" → supported ✓
Context Recall = 3/4 = 0.750
This is semantic, generation-aware recall — not pure retrieval.
The simplest retrieval metric: what fraction of queries have at least one relevant chunk in the top-K?
$$\text{Hit Rate@K} = \frac{|\lbrace\text{queries with} \geq 1 \text{ hit in top-K}\rbrace|}{|\text{all queries}|}$$Coarse but practical — especially when you only need one good chunk to answer a question.
Query 1: top-3 has relevant chunk → hit ✓
Query 2: top-3 has relevant chunk → hit ✓
Query 3: top-3 has NO relevant → miss ✗
Query 4: top-3 has relevant chunk → hit ✓
Hit Rate @3 = 3/4 = 0.750
What fraction of retrieved chunks are actually cited or used by the generator?
$$\text{Attribution Rate} = \frac{|\lbrace\text{retrieved chunks cited in answer}\rbrace|}{|\text{retrieved chunks}|}$$Low attribution → the retriever fetches technically relevant but practically ignored chunks — wasting context window space.
Retrieved: [c1, c2, c3, c4, c5, c6, c7, c8, c9, c10]
Cited: {c1, c3, c5}
Attribution Rate = 3/10 = 0.300
This bridges retrieval quality with generation quality.
Beyond individual metrics, rigorous RAG evaluation requires careful protocol design.
Decompose a reference answer into atomic facts (“nuggets”) and check which are covered by retrieved chunks.
More fine-grained than document-level judgments — handles multi-faceted queries better.
Nuggets for "Tell me about the Eiffel Tower":
1. "tower height" → supported by {c1, c2} ✓
2. "construction year" → supported by {c3} ✓
3. "architect name" → supported by {c8} ✓
4. "visitor statistics" → supported by {c11} ✗
Nugget Recall = 3/4 = 0.750 (c11 not retrieved)
Plot your metric as K varies from 1 to your context window limit.
The curve shape reveals:
K P@K R@K NDCG@K
── ───── ───── ──────
1 1.000 0.250 0.613
3 0.667 0.500 0.686
5 0.600 0.750 0.793 ← sweet spot?
8 0.500 1.000 0.870
10 0.400 1.000 0.870 ← diminishing returns
Break your query set into categories and report metrics per stratum.
A retriever might excel at factoid queries but fail on multi-hop ones — aggregate metrics can hide this.
Common strata:
Stratum MAP MRR MR@5 HitRate@5
───────── ───── ───── ───── ─────────
factoid 0.875 1.000 0.875 1.000
multi_hop 0.655 0.750 0.688 1.000
Retrieved: [c1, c2, c3, c4, c5, c6, c7, c8, c9, c10]
Relevant: {c1, c3, c5, c8}
Grades: c1:2 c2:0 c3:1 c4:0 c5:2 c6:0 c7:0 c8:1 c9:0 c10:0
PRECISION-ORIENTED RECALL-ORIENTED
P@1 = 1.000 R@1 = 0.250
P@3 = 0.667 R@3 = 0.500
P@5 = 0.600 R@5 = 0.750
P@10 = 0.400 R@10 = 1.000
AP = 0.729 Coverage = 0.750
R-Precision = 0.500
RANK-SENSITIVE RAG-SPECIFIC
DCG@5 = 3.274 Context Precision = 0.729
NDCG@5 = 0.793 Hit Rate @3 = 0.750
RR = 1.000 Hit Rate @5 = 1.000
ERR = 1.000 Attribution Rate = 0.300
| Use Case | Recommended Metrics | Rationale |
|---|---|---|
| Missing chunks is catastrophic (medical, legal) | Recall@K, Context Recall | Must find all relevant evidence |
| Tight context window (few chunks) | NDCG@K, P@K with small K | Every slot must count |
| Conversational assistant | MRR | Just need one relevant chunk fast |
| General benchmarking (graded) | NDCG | Most principled overall choice |
| Multi-faceted queries | Coverage Recall, Nugget Recall | Must cover all aspects |
| Debug retriever vs. generator | Chunk Attribution Rate | Is the retriever helping? |
| Compare retrieval systems | MAP, Stratified Evaluation | Fair aggregate comparison |
Is missing a relevant chunk catastrophic?
├── YES → Use Recall@K + Context Recall
└── NO
│
Is the relevance binary or graded?
├── BINARY → Use MAP + MRR
└── GRADED → Use NDCG@K + ERR
│
Do you need per-category analysis?
├── YES → Add Stratified Evaluation
└── NO → Report aggregate + P/R curves
Always also consider:
• Metric@K curves to find optimal K
• Chunk Attribution Rate to debug retriever-generator gap
Never rely on a single metric — report at least one precision, one recall, and one rank-sensitive metric
Always stratify by query type to avoid hiding failures behind strong aggregate numbers
Plot Metric@K curves to find the optimal number of chunks to retrieve
Use NDCG when possible — it is the most principled choice for graded relevance
Retrieval Evaluation assess the text-chunk retrieval and ranking of the RAG system

