RAG Evaluation

This session explores the intricacies of prompt engineering for large language models (LLMs), emphasizing its importance in optimizing LLM performance for specific tasks. Unlike traditional machine learning models, evaluating LLMs involves subjective metrics like context relevance, answer faithfulness, and prompt relevance.

The session highlights frameworks like ARES, which uses smaller LLMs as evaluators, and LLMA, which focuses on instruction-following capabilities. Techniques such as chain-of-thought prompting, few-shot prompting, and retrieval-augmented generation are presented as effective strategies to guide LLMs in reasoning, learning from examples, and leveraging external information.

The session also introduces public datasets like KILT and SuperGLUE, along with task-specific datasets such as Natural Questions, HotpotQA, and FEVER. These datasets provide standardized benchmarks to test and refine prompt engineering approaches. The hosts emphasize the need for creativity and experimentation in this evolving field, blending technical expertise with an intuitive understanding of language and machine learning.

Required Reading and Listening

Listen to the podcast (transcription):

Replacing Vibe Checks with Rigorous Evaluation

Transcript

Speaker	Text
Alex	Welcome back to the deep dive. I want to, uh, start today’s session by taking a quick mental trip back in time. Let’s go back to, say, 2019. You’re a data scientist. You’re building a classifier, maybe a churn prediction model or a fraud detector. You finish training. You run your test set, and you look at your console. What do you see?
Sam	Well, you see a confusion matrix, you see an F1 score, you see precision recall, maybe an area under the curve metric. It’s clean. It’s deterministic.
Alex	Exactly. It was comforting, wasn’t it?
Sam	Very comforting. You could sleep at night.
Alex	You knew if your model was drifting. You knew if the new deployment was better than the last one. Because the F1 score went from, you know, 0.82 to 0.84, the math
Sam	told you exactly where you stood.
Alex	But today, today, our audience is building RA pipelines. They’re fine tuning LMA 3. They’re deploying agentic workflows that generate 1000 word reports. And when they ask, when their product manager asks, is this model better? What’s the answer?
Sam	The answer is usually, uh, well, I looked at 5 examples and they seem pretty good. The dreaded vibe check. It is the absolute crisis of our time in machine learning right now. We have moved from a world of hard metrics to a world of soft intuition. And for the engineers listening, the people who are used to rigorous CICD pipelines. This is a nightmare.
Alex	It really is. You cannot scale a vibe check.
Sam	No,
Alex	you can’t. You cannot automate a vibe check, and you certainly cannot explain to your VP of engineering that the reason you deployed a hallucinating model to production is because, uh, It felt right on Tuesday. Right?
Sam	The vibes were immaculate is not a valid defense in a postmortem.
Alex	So that is our mission today. We are going to completely dismantle the vibe check.
Sam	I love it. Let’s tear it down.
Alex	We are going to explore how to apply rigorous data science, statistics, and automation to the evaluation of generative AI. We’re talking about moving from it looks good to. We are 95% confident in this score within a specific confidence interval.
Sam	We’re basically talking about evaluation engineering as a distinct discipline. We aren’t just summarizing papers today. We are effectively building a roadmap for a modern, robust evaluation stack.
Alex	We have 3 seminal sources on the table today, and we’re going to tackle them in 3 stages. We’re going to start at level one. Which is evaluating the LLM itself. This comes from the paper judging LLM as a judge,
Sam	which is all about the biases inherent when we ask models to grade other models,
Alex	right? Then we move to level two, evaluating rag systems with the ARES framework. This is the heavy hitter on statistics, how to evaluate context retrieval and faithfulness without, you know, hiring an army of human annotators. The labeling
Sam	bottleneck, a huge pain point for everyone right now.
Alex	Huge. And finally, level 3. Evaluating agents with the PROXYQA framework. How do you grade long form content where there is no gold standard answer? It’s
Sam	a natural progression, right? From the atomic unit of a single chat response up to the complexity of vector retrieval, all the way to the open-ended multi-step nature of agents.
Alex	Let’s get right into it. Level one, the foundation, judging LLM as a judge. This research comes out of the LMSYS org, the team behind Chatbot Arena.
Sam	Right. For the data scientists listening, you likely know Chatbot Arena intimately. It’s that crowdsourced leaderboard where you enter a prompt. Two anonymous models generate answers side by side, and you just vote on which is better
Alex	Model A, Model B, or a tie.
Sam	Exactly. And it uses an EO rating system very similar to what they use in chess rankings to stack rank the models. It is currently the gold standard for human preference alignment.
Alex	But here is the massive problem with human preference. It is slow. And it is insanely expensive, prohibitively
Sam	expensive if you are doing rapid iteration, right?
Alex	If I am a data scientist tweaking a hyperparameter in my local RAG pipeline or testing a new few-shot prompt technique, I can’t submit my model to Chabot Arena and wait 2 weeks for crowdsourced results. I need feedback in 5 minutes.
Sam	You need a metric you can put in a loop. So the industry pivoted hard to this concept of LLM as a judge. The hypothesis was pretty straightforward. If a frontier model like GPT-4 is already highly aligned with human preferences, why don’t we just ask GPT-4 to act as the human?
Alex	We feed it the user prompt, answer A and answer B, and basically say, Hey, pick a winner.
Sam	Exactly. It sounds like a bit of an infinite loop using the model to grade the model, but the immediate question the community had was, does it actually work?
Alex	And the short answer for the paper is yes, but with massive asterisks,
Sam	giant asterisks. The researchers found that a strong judge like GPT-4 matches human preference judgments over 80% of the time.
Alex	Now I know some listeners might scoff at 80%. They might think that’s a minus, that’s 20% error, but you
Sam	have to contextualize that number. You have to realize that humans only agree with each other about 80%. 2% of the time on these subjective tasks.
Alex	So GPT-4 is effectively hitting the noise ceiling of human agreement.
Sam	Correct. If you and I both look at two haikus written by different models, we might disagree on which is better. Human preference is noisy. GPT-4 is as good as a generic human labeler at capturing the average preference.
Alex	However, and this is the crucial difference, unlike a human, a model has systematic algorithmic biases. If a human is tiled, they might make a random error. If a model has a bias, it makes the exact same error a million times in a row.
Sam	And if you are building an automated evaluation pipeline, you need to account for these systematic biases or your metrics will be absolute garbage. You will be optimizing your system for the wrong things. Let’s
Alex	unpack. These specific biases because this is the real gotcha for anyone writing an evil script right now. The first one they identified is position bias. Oh,
Sam	this is a fascinating glitch. The research showed that if you present two answers to an LLM judge, answer A and answer B. The judge has a statistically significant preference for whichever answer is presented first,
Alex	just because it showed up first in the context window,
Sam	literally just because it was first. The older models like GPT 3.5 were notorious for this. It’s almost like cognitive laziness. That model reads the first option, decides it’s good enough, and then the mathematical bar for the second option becomes impossibly high.
Alex	It anchors on the first output. So if I’m running a regression test on my laptop, And I always pass my new fine-tuned model’s output as option A against my baseline as option B. I am artificially inflating my success rate.
Sam	You are grading on a rigged curve. Your new model has a built-in advantage, and the scary part is the judge model will give you a highly coherent, completely fabricated reason for its choice.
Alex	It rationalizes the bias.
Sam	Exactly. It will say I preferred option A because it was more concise and had better structure, but if you run the exact same prompt and just swap the order show B first, it will confidently say, I preferred option B because it was more concise and had better structure.
Alex	It’s hallucinating the justification to fit the position bias. That is wild.
Sam	It’s a huge problem. So what is the engineering fix?
Alex	It’s simple, but it effectively doubles your inference bill. You have to enforce symmetry. You run the evaluation twice for every single judgment,
Sam	right? First, A versus B. Then B
Alex	versus A, and you only count it as a win if the judge is consistent. Ideally,
Sam	yes. If the judge picks your new model when it’s in slot A and also picks your new model when it’s in slot B, that’s a true win. If it just picks whichever one is in slot A both times, that is a contradiction. It’s noise. You either treat it as a tie or you discard the sample entirely.
Alex	If you aren’t doing position swapping in your LLM as a judge pipeline right now, your error bars are massive, and you don’t even know it. You’re flying blind. OK, that’s manageable. But the next bias is the one that really keeps me up at night regarding the future of model training, verbosity bias.
Sam	The length bias. This one is insidious because it aligns with a very real human flaw. We humans tend to think longer answers are smarter answers. The researchers tested this with what they called a repetitive list attack. Walk
Alex	us through how that attack works.
Sam	So they took a high quality, concise answer from a model. Let’s say the prompt was about the causes of the French Revolution, and the model gave 5. Clear distinct bullet points. A
Alex	perfect dense
Sam	answer. Right then they attacked it by manually rewriting it. They didn’t add any new historical facts. They just made the sentences much longer. They repeated the introduction and the conclusion, and they added a bunch of transitional fluff.
Alex	Just relentlessly padding the word count.
Sam	Pure filler. And the LLM judges, especially the mid-tier models, consistently rated the verbose repetitive answer as objectively better than the concise one.
Alex	Think about the incentives there. If you are using LLM as a judge to fine tune your model via reinforcement learning. You are mathematically optimizing your model to be a windbag.
Sam	You are literally training for yapping. You aren’t optimizing for information density. You’re optimizing for token generation, and we see this in the wild all the time.
Alex	Oh, absolutely. How often do you use a chatbot to ask a simple coding question, and it gives you 4 paragraphs of intro about what Python is before giving you the one-line fix?
Sam	Certainly, Python is a versatile programming language. Yes, we know. That behavior is likely a direct artifact of verbosity bias in the RLHF and evaluation stages.
Alex	So how do we fix that as evaluation engineers? Do we just tell the judge, Don’t like long answers.
Sam	You have to be very specific and aggressive in your system, prompt. You can’t just say be critical. You have to explicitly instruct the judge, penalize repetition. Prefer conciseness. Do not favor length alone. Does that actually work? It mitigates it, but it’s a stubborn bias. GPT-4 is more resistant than others, which is why it’s the standard judge right now, but it’s definitely not
Alex	immune. Then there is self-enhancement bias. This one feels. Almost like an AI mirror test. It
Sam	really does. The data suggests that models have a slight but measurable preference for outputs that mimic their own training data or their own stylistic quirks.
Alex	GPT-4 slightly prefers GPT-4 outputs. Clawed slightly prefers clawed outputs. It creates a massive echo chamber.
Sam	It creates a monoculture of style. If every data science team uses GPT-4 to grade their open source models. Every open source model, every specialized medical model, every legal AI is being pushed to sound exactly like GPT-4.
Alex	We might be actively penalizing valid, correct answers just because they use a slightly different tone or structure than what OpenAI baked into their alignment process.
Sam	That is a profound point. We are subtly standardizing the definition of intelligence to mean sounding like chat GPT.
Alex	And for data scientists listening, the tactical takeaway here is that you need diversity in your judges. Don’t just use one API for all your evaluation.
Sam	Use an ensemble, root some evils to Cloud, some to Gemini, some to a local Lama 3 instance, or as we’ll see in the next paper, use a smaller model, fine tune. Entirely on your specific domain data.
Alex	One last limitation on the judging aspect before we move into rag reasoning and logic.
Sam	This is critical for anyone building math bots or code assistants or multi-step logic agents. The paper found that LLMs are decent at grading creative writing or summarization. But they struggle immensely to grade logic errors if they don’t have external help.
Alex	This is the confident hallucination problem,
Sam	right? Let’s say a student model outputs a math proof. The proof looks like a proof. It has the right formatting. The equations are properly indented. It ends with a confident, therefore, x equals 5. But it makes a subtle arithmetic error in step 3.
Alex	The LLM judge often misses it.
Sam	The judge sees the vibe of a correct proof and gives it a pass. It rates it a 10 out of 10.
Alex	It’s like a teaching assistant who is grading 500 papers at 2 a.m. They’re too tired to check the actual math, so they just check if the handwriting is neat and the final answer is boxed.
Sam	Precisely. The fix here is what the researchers call reference guided grading. You cannot ask the judge to solve the problem and grade the student at the exact same time.
Alex	It overwhelms the attention mechanism. Yes,
Sam	you must provide the ground truth answer, the golden reference in the system prompt, or if you don’t have a golden reference, you use chain of thought grading. You force the judge to explicitly solve the problem step by step in its own scratch pad before it is allowed to look at the student’s answer.
Alex	You force it to the hard work first. You ground the judge in reality before you let it issue a verdict.
Sam	Exactly. So that is the baseline. That is just evaluating a raw language model against another language model. But most of our listeners are dealing with something vastly more complex in production.
Alex	They’re building Araji systems. Retrieval augmented generation, and that brings us to level 2.
Sam	Araji completely changes the evaluation game because it introduces multi-component failure. Right,
Alex	let’s break that down. In a simple chatbot, if the answer is wrong, the LLM is wrong. It hallucinated. But in a rag pipeline, if the answer is wrong, it could be the LLM hallucinating, or it could be the vector database retrieving a recipe for lasagna when the user asks for a specific legal precedent.
Sam	And distinguishing between those two distinct points of failure is the fundamental job of the data scientist. This brings us to the second paper. ARA
Alex	AES stands for an Automated evaluation Framework for retrieval Augmented Generation Systems. This is probably the most rigorous paper of the bunch regarding pure statistical methodology.
Sam	It is a master class in evaluation data science. The core problem ARAS tackles is what we call the labeling bottleneck.
Alex	I’m sure every listener has felt the pain of this bottleneck. You build an RG system for a niche domain, let’s say, analyzing proprietary semiconductor manufacturing logs. To evaluate that system properly, you need a senior semiconductor engineer to sit down and read thousands of query document answer triples.
Sam	They have to look at the user’s question, read the log the database retrieved, and read the AI’s generated summary and manually score it. Is this log actually? Relevant to the query is the summary completely faithful to the log,
Alex	and that senior engineer costs $300 an hour and has significantly better things to do.
Sam	So you end up with maybe 50 labeled examples if you’re lucky, and 50 examples is nowhere near enough for statistical significance when you’re tweaking embedding models or chunk sizes.
Alex	So what do teams typically do in this situation? They usually grab off the shelf tools like RGAs.
Sam	They do, and RGAS is a great starting point, but it relies on generalized zero-shot prompts. It effectively uses GPT-4 to ask, Is this retrieved text relevant to this question,
Alex	which goes back to the bias problem. The
Sam	AR’s authors argue that these static heuristics don’t adapt well to domain shifts. A zero shot prompt that works perfectly for checking relevance in Wikipedia articles might fail miserably when checking relevance in messy OCR scanned technical manuals full of domain-specific acronyms.
Alex	So how does A solve this without bankrupting the company on expert human labeling?
Sam	They treat evaluation as an end to end data science pipeline, not just a clever prompt. It is a three-step process. Step 1 is synthetic data generation.
Alex	Now, I can immediately hear the listeners pausing the deep dive. Synthetic data, isn’t that just garbage in, garbage out? If the target model hallucinates and we use a model to generate the training data, aren’t we just compounding the errors? That
Sam	is the common, very valid fear. But A does something exceptionally clever. They don’t just generate positives. Perfectly clean examples, they generate a comprehensive curriculum of failures.
Alex	A curriculum of failures. I like that phrasing. Explain how they build that.
Sam	Let’s focus on the three core Ag metrics context relevance, answer faithfulness, and answer relevance.
Alex	Context relevance being, did the retriever fits the right document? Answer faithfulness. Did the LLM stick to the facts in that document without making things up? And answer irrelevance. Did the final response actually answer the user’s initial question?
Sam	Exactly. So to generate synthetic data, they take a document from your specific proprietary corpus. They use a strong LLM. The paper used FLANT5 XXL, but you likely use GPT-4 or LAMA 3 today. To read the document and generate a relevant question that can be answered by it that creates a positive pair, a good retrieval example. OK, that makes sense. But then they intentionally generate negatives to simulate a retrieval failure, a context relevance failure, they take a generated question, but they swap the actual document with a completely random document from the corpus.
Alex	Ah, so they force a mismatch.
Sam	Yes, and to simulate a hallucination, a faithfulness failure, they provide the correct document, but they prompt. LLM to generate an answer that directly contradicts the facts in the text. I see.
Alex	So they’re procedurally generating the exact specific errors they want their evaluation metric to be able to catch.
Sam	They are building a massive data set of here is what a lie looks like in the context of semiconductor logs, and here is what a bad retrieval looks like. They are defining the boundary conditions of failure for the judge.
Alex	That leads directly to step 2 of Aries training lightweight judges,
Sam	right? This is where it gets highly efficient. They don’t use a massive, expensive model like GPT-4 for the actual evaluation at runtime,
Alex	because running GPT-4 over tens of thousands of logs every time you update your vector index would cost a fortune. It’s
Sam	unsustainable. So Instead, they take that massive synthetic data set they just generated the positives and the intentional negatives, and they fine tune a much smaller local model. Specifically, they use DeBerta V3 Large. Deberta
Alex	is an encoder-only BERT model. It’s tiny compared to modern LLMs. You can literally run that on a standard CPU. It’s
Sam	incredibly fast and practically free to run, but here’s the magic because it was fine tuned. Entirely on your specific domain data using that synthetic curriculum of failures, it becomes an absolute specialist.
Alex	It becomes a sniper for finding errors in semiconductor logs, whereas GTT-4 is just a really smart generalist. Exactly.
Sam	The Doberta judge learns the specific vocabulary, the acronyms, the typical sentence structures of your data.
Alex	We have a fast, specialized, cheap judge, but we still have that lingering statistical doubt. It was trained entirely on synthetic data generated by an AI. How do we know we can actually trust its score when we deploy it on real messy user data?
Sam	That is the million dollar question for any evaluation framework, and that is where step 3 comes in prediction powered inference or
Alex	PPI. Let’s slow down here. PPI is the statistical core of this paper. This is the meat and potatoes for our data science audience. How does prediction powered inference actually work?
Sam	PPI is a statistical method for correcting the bias of an automated model using a very small, highly trusted sample of real human data.
Alex	So remember that senior semiconductor engineer we talked about earlier? We still need him.
Sam	Yes, but instead of forcing him to label 5000 documents, which he won’t do, we ask him to label, say, 150 documents.
Alex	150 is totally manageable. That’s maybe one afternoon of work.
Sam	Exactly. So here is the mathematical workflow. You run your fast toberta judge over your massive unlabeled production set of 10,000 user logs. The judge spits out a score. Let’s say it estimates that your system has an 85% accuracy rate.
Alex	But we don’t trust that 85% yet because the judge might be biased by its synthetic training data, right?
Sam	So we take those 150 examples that the human engineer carefully labeled, the ground truth. We run our Doberda judge on those exact same 150 examples, and we compare them. We compare the machine’s predictions against the human’s ground truth. We see exactly where the judge agreed with the human and more importantly, where it systematically disagreed. We use this comparison to calculate a rectifier, a
Alex	rectifier, like a calibration weight.
Sam	Essentially, yes, it calculates the error rate or the specific directional bias of the judge on that ground truth sample.
Alex	You are calibrating your cheap instrument against a highly accurate, expensive instrument.
Sam	Precisely. And PPI uses that rectifier along with some pretty elegant statistical theory leveraging things like the central limit theorem to construct a. Confidence interval.
Alex	So instead of just blindly reporting 85% accuracy, what do you report?
Sam	It allows you to go to your engineering lead and say, based on the automated judge’s score on the 10,000 items and mathematically calibrated by its observed error on the 150 human labeled items, we are 95% confident that the true accuracy of the system is between 82% and 86%.
Alex	That is the holy. Grail for ML Engineering. You aren’t just giving a point estimate. You are providing a bounded metric with mathematically sound error bars,
Sam	and that is what actually allows you to deploy with confidence. If the lower bound of your confidence interval, that 82%, is above your business’s safety threshold, you shift
Alex	the model. If it’s below, you go back to the drawing board. It transforms evaluation from a game of vibes into rigorous risk management,
Sam	and the empirical results from the AES. Paper backing this up were incredibly strong.
Alex	What kind of improvements did they see over the baselines?
Sam	By utilizing PPI, Air rager required 78% fewer human annotations than traditional fine tuning methods to reach the exact same level of statistical confidence.
Alex	That is a massive reduction in labeling costs.
Sam	And crucially, it significantly outperformed generalized tools like RGAS in accurately ranking different rag systems. When Researchers intentionally shifted domains like generating the synthetic training data on Wikipedia articles, but then evaluating the system on dense scientific papers. A adapted seamlessly
Alex	because you just run the synthetic generation step on the new scientific corpus.
Sam	Exactly. The static handwritten prompts in RAS failed to adapt to the new vocabulary, but Eris just built a new specialist judge for the new
Alex	domain. The tactical takeaway for anyone building Eric right now is clear. Don’t just write a massive grading prompt and hope for the best. Treat evaluation as a system. Build a dataset generator. Train a lightweight metric, and calibrate it with
Sam	PPI. It definitely sounds like a lot of work up front, but this is what true evaluation engineering looks like. You are building a machine to measure the machine, and once it’s built, it scales infinitely. Let’s
Alex	move to level 3, the frontier. We’ve done chatbots, we’ve done RAG. Now let’s talk about agents and long form content generation.
Sam	This is arguably the hardest unsolved problem in the space right now.
Alex	We are looking at the paper P R O X Y Q A. Let’s set the scenario. You have an egentic workflow. You prompt it. Write a comprehensive deep dive market research report on the economic impact of the 2024 Olympics on the Parisian real estate market.
Sam	The agent is going to do some research and generate a 2000, maybe 3000 word output.
Alex	There’s no single correct string of text to compare that to,
Sam	none. Standard NLP metrics like Rouge or BLU, which basically just Measure nnogram overlap between a generation and a reference text are completely useless here
Alex	because they measure the specific phrasing, not the underlying facts,
Sam	right? If the reference says the budget was large and the AI says the expenditures were massive, Rouge scores that very poorly because the words don’t match even though the semantic meaning is identical. And
Alex	as we discussed earlier, falling back to a vibe check is incredibly dangerous here because the generated report might flow beautifully. The grammar is flawless. It sounds highly authoritative,
Sam	but it might completely miss the single most important statistic about infrastructure spending or hallucinate the inflation rate. Fluent vivacuous, exactly. So the PRXYQA researchers proposed a radically new framework. They argue that we need to stop trying to evaluate the text itself and start evaluating the information contained inside the text.
Alex	How do they mechanically separate the
Sam	two? They break the evaluation down into what they call atomic facts. They start with the user’s high-level prompt the meta question. Write a report on the Paris Olympics. Then they use an LLM to generate a predefined list of proxy questions based on that topic. These are highly specific, factual, boolean, true or false questions that simply must be answerable if the generated report is truly comprehensive.
Alex	Give
Sam	me
Alex	a concrete example for the Paris Olympics
Sam	report. A proxy question might be. True or false, did the transport infrastructure projects finish ahead of schedule? Or true or false, did the total operating budget exceed €8 billion? Or true or false, does the report explicitly mention the impact of the games on short-term Airbnb rental prices?
Alex	I see the genius in that. They’re taking a highly subjective, open-ended essay prompt and converting it into a rigid checklist of objective facts.
Sam	Precisely. They convert a generative task, which is notoriously hard to evaluate, into a classification task, which is trivial to evaluate.
Alex	Walk us through the actual operational workflow of P R O X YQA.
Sam	Step one, your agent generates the massive long form report. Step two, step two, you pass that generated report to. Evaluator model, again, usually something robust like GPT 4.
Alex	Step
Sam	3, step 3, you give that evaluator model the predefined list of proxy questions. And step 4, you prompt the evaluator. Based strictly and entirely on the text provided in this report. Answer these true or false questions. If the information required to answer a question is missing from the text, mark it as not found.
Alex	So the final evaluation score isn’t a nebulous 7 out of 10 for narrative flow. The final score is a hard percentage. This report successfully contained the necessary information to answer 85% of the expected factual proxy questions.
Sam	Exactly. It specifically measures informativeness and knowledge coverage. It effectively calculates the information density of the final output.
Alex	This is so actionable for data scientists. Think about the debugging cycle. If your agent’s score suddenly drops from 90% to 60%, you don’t just stare at a block of text wondering what went wrong. You can look at the specific proxy questions that failed,
Sam	right? You can say, oh, look at this, we are consistently missing all the proxy questions related to financial metrics. Our agent is good at narrative, but bad at numbers. We need to go adjust the retrieval weights for our financial documents in the vector database.
Alex	It makes the qualitative explicitly quantitative.
Sam	And the empirical findings from PROXYQA paper were really revealing, especially regarding the current state of the models we use today.
Alex	Let’s dive into those findings. Let’s talk about the gap between open source and proprietary models on this specific metric.
Sam	It was a stark contrast. The researchers tested major open source models like LA 213b and Vicuna against proprietary models like GPT-4 on these complex long form tasks.
Alex	And how did Lama perform?
Sam	Well, the Laman models often generated very long reports. They produced a massive amount of text,
Alex	verbosity bias rearing its head again.
Sam	Exactly. But when they ran the PROXQA evaluation, when they actually checked for facts, the scores were dismally low. The open source models were just
Alex	yapping.
Sam	They were filling
Alex	the space with fluff, structural repetition, and vague generalities. Sounds smart without actually saying
Sam	anything. The information density was terrible. Conversely, GPT-4 turbo often generated significantly shorter reports, but it hit a much higher percentage of the proxy questions. It was denser. It actually respected the user’s time and delivered the facts.
Alex	But the real winner in their benchmarks wasn’t just a raw LLM. It was ARG, right? Specifically,
Sam	Webb augmented RAG. They tested Bing chat, which has real-time search capabilities. It completely dominated the benchmark,
Alex	even beating GPT-4 Turbo.
Sam	Yes, even GPT-4 Turbo, when it was cut off from the web and had to rely entirely on its internal weights, struggled on the really hard, obscure proxy questions.
Alex	It proves that for long form, comprehensive content generation, parametric memory, with the model memorized during its pre-training phase. Is simply not enough. You absolutely need active external retrieval,
Sam	which ties us beautifully right back to the Aries
Alex	paper. Right, to get a high PRO X YQA score at the agent level, you inherently need a high Aries score at the retrieval level. If your context retrieval is bad, your agent’s report will inevitably miss facts.
Sam	It’s a full stack evaluation strategy. The LMSYS paper tests your foundational model’s basic conversational and logic ability. Ares tests your system’s retrieval and grounding layer, and PR X YQA tests your agent’s ability to synthesize that information and achieve comprehensive coverage of a topic.
Alex	You really can’t ignore any layer of that stack if you want a reliable product.
Sam	You can’t. They are interlocked.
Alex	I want to take a moment to synthesize all of this for the listeners. We have thrown a massive amount of acronyms, statistical methods, and frameworks at you today. If I am a lead data scientist sitting at my desk right now listening to this, and I want to overhaul my team’s evaluation strategy by next quarter, what is the step by step roadmap?
Sam	I think it comes down to 3 very concrete actionable directives. First, sanitize your judge.
Alex	If you are currently using a simple LLM as a judge prompt, fix it today. Yes,
Sam	you must immediately implement the swap check from the LMSYS paper. Compare answer A to answer B and then force the system to compare B to A. If the judge disagrees with itself, throw the result out. It’s noise.
Alex	And update your system prompts to explicitly penalize verbosity. That is the lowest hanging fruit. Absolutely.
Sam	Step 2, for your REG pipelines, you need to transition aggressively towards synthetic data generation.
Alex	Stop waiting for your actual users to generate enough logs and stop waiting for your product managers to label massive Excel sheets.
Sam	Use a strong frontier model to generate your own golden sets from your existing documents. Generate the positive examples and intentionally generate the negative failures. Build that curriculum of
Alex	failure and implement prediction-powered inference. The Python libraries for PPI are open source and readily available. It allows you to take a tiny human sample and project it onto your massive unlabeled data set with mathematical confidence intervals.
Sam	Exactly. And step three, if you are building multi-step agents, adopt the atomic fact mindset from PRO X YQA.
Alex	Stop trying to evaluate the final output text as a monolithic block.
Sam	Define what success actually looks like in terms of specific extractable bits of information. Even if you just Sit down and manually handwrite 50 true false proxy questions for your core test set that is infinitely more valuable than doing a vague vibe check on the final paragraph.
Alex	Listening to this roadmap, it really seems like the fundamental role of the data scientist is shifting dramatically. Well, we used to be model trainers. We spent our time tuning hyperparametters, tweaking learning rates, designing architectures. But now, since the foundational models are largely commodities, we just pull off the shelf via an API. Our job is becoming a valuation architects. I
Sam	think that is an incredibly accurate assessment. The competitive moat for a company is no longer the foundational model itself. Everyone has access to GPT-4 or LAA 3. The real moat is your organizational ability to rigorously measure if that model is actually working for your specific highly specialized use case.
Alex	The engineering teams that can scientifically measure their system’s performance using PPI, synthetic failures, and proxy tasks are the ones who will be able to iterate safely and win.
Sam	Market and the teams that keep relying on vibe checks are going to hit a performance ceiling they simply can’t break through because they won’t even have the metrics to tell them why they are failing.
Alex	They’ll just be tweaking prompts randomly in the dark. Maybe if I type, you are a highly intelligent expert in all caps, the error rate will go down.
Sam	That’s not computer science. That’s superstition. Papers like RREs and Probex YQA give us the engineering tools to move past that.
Alex	I want to end today’s deep dive on a bit of a provocation. We talked earlier about self-enhancement bias in the LMSYS paper, right?
Sam	Models preferring their own style.
Alex	In the PRX YQA paper, the researchers noted a similar phenomenon. They found that GPT-based evaluators were suspiciously overconfident when grading answers that were generated by GPT models.
Sam	Yes, this is what we call the closed loop danger.
Alex	Think about the implications of the stack we just outlined. If we use GPT-4 to generate the synthetic training data in AREs, and we use GPT-4 to power the agent writing the report in PRO X YQA. And then we use GPT 4 as the final judge to score that report. Aren’t we just building a massive, expensive hall of mirrors?
Sam	We are at extreme risk of creating a monoculture of intelligence. If the judge model and the student model share the exact same underlying architecture and the exact same pre-training data. They fundamentally share the exact same blind spots.
Alex	If GPT-4 is inherently bad at a certain type of spatial reasoning. It won’t even notice when the student model makes a massive spatial reasoning error. It will just wave it through.
Sam	It creates a false sense of perfection. The metrics look great. But the system is actually broken in ways the AI simply cannot perceive.
Alex	So what is the ultimate solution to the closed loop?
Sam	Diversity in your evaluation stack. Model ensembling.
Alex	Don’t just rely on one provider.
Sam	Exactly. If you use GPT to generate the data, use cloud to judge it. Use a locally fine-tuned llama model for the proxy extraction. Break the architectural loop,
Alex	and most importantly, keep the human in the loop.
Sam	Always through methods like prediction powered inference, the human provides the fundamental anchor to reality.
Alex	The human sets the semantics standard, and the AI simply scales the measurement of that standard.
Sam	If you cut the human anchor entirely, you inevitably drift into statistical noise. You need that small, highly curated sample of ground truth to keep the entire automated system tethered to reality.
Alex	That is a very powerful place to leave it. We’ve covered a Massive amount of ground today. We’ve moved from the vulnerability of it looks good to me to the engineering rigor of we are 95% confident within a 3% margin of error based on atomic fact coverage.
Sam	It is a difficult journey to build that infrastructure, but it’s the only way to build reliable AI applications in production. To
Alex	our listeners, the data scientists and evaluation engineers out there, we will link all three seminal papers in the show notes. Judging LLM as a judge, AREs and Prox sex YQA,
Sam	read them.
Alex	They’re required reading. Stop doing vibe checks. Start building real evaluation pipelines. Keep digging deeper. Keep
Sam	digging.

Presentations

Notebooks

Response Evaluation based on the comparison of RAG responses to given questions to the ground truth answer
- Example notebook RAG_Evaluation
- Example code in response_evaluation
Retrieval Evaluation assess the tech-chunnk retrieval and ranking of the RAG system
- Example notebook RAG_Retrieval_Evaluation
- Example code in retrieval_evaluation

Reading

Paper: ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Additional Resources

ARES is a framework for evaluating Retrieval-Augmented Generation (RAG) models.
G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. The G-Eval metric is the most versatile type of metric deepeval has to offer, and is capable of evaluating almost any use case with human-like accuracy.
RAGAS is a library that provides tools to supercharge the evaluation of Large Language Model (LLM) applications.