RAG Evaluation
This session explores the intricacies of prompt engineering for large language models (LLMs), emphasizing its importance in optimizing LLM performance for specific tasks. Unlike traditional machine learning models, evaluating LLMs involves subjective metrics like context relevance, answer faithfulness, and prompt relevance.
The session highlights frameworks like ARES, which uses smaller LLMs as evaluators, and LLMA, which focuses on instruction-following capabilities. Techniques such as chain-of-thought prompting, few-shot prompting, and retrieval-augmented generation are presented as effective strategies to guide LLMs in reasoning, learning from examples, and leveraging external information.
The session also introduces public datasets like KILT and SuperGLUE, along with task-specific datasets such as Natural Questions, HotpotQA, and FEVER. These datasets provide standardized benchmarks to test and refine prompt engineering approaches. The hosts emphasize the need for creativity and experimentation in this evolving field, blending technical expertise with an intuitive understanding of language and machine learning.
Required Reading and Listening
Listen to the podcast (transcription):
| Speaker | Text |
|---|---|
| Alex | Welcome back to the deep dive. I want to, uh, start today’s session by taking a quick mental trip back in time. Let’s go back to, say, 2019. You’re a data scientist. You’re building a classifier, maybe a churn prediction model or a fraud detector. You finish training. You run your test set, and you look at your console. What do you see? |
| Sam | Well, you see a confusion matrix, you see an F1 score, you see precision recall, maybe an area under the curve metric. It’s clean. It’s deterministic. |
| Alex | Exactly. It was comforting, wasn’t it? |
| Sam | Very comforting. You could sleep at night. |
| Alex | You knew if your model was drifting. You knew if the new deployment was better than the last one. Because the F1 score went from, you know, 0.82 to 0.84, the math |
| Sam | told you exactly where you stood. |
| Alex | But today, today, our audience is building RA pipelines. They’re fine tuning LMA 3. They’re deploying agentic workflows that generate 1000 word reports. And when they ask, when their product manager asks, is this model better? What’s the answer? |
| Sam | The answer is usually, uh, well, I looked at 5 examples and they seem pretty good. The dreaded vibe check. It is the absolute crisis of our time in machine learning right now. We have moved from a world of hard metrics to a world of soft intuition. And for the engineers listening, the people who are used to rigorous CICD pipelines. This is a nightmare. |
| Alex | It really is. You cannot scale a vibe check. |
| Sam | No, |
| Alex | you can’t. You cannot automate a vibe check, and you certainly cannot explain to your VP of engineering that the reason you deployed a hallucinating model to production is because, uh, It felt right on Tuesday. Right? |
| Sam | The vibes were immaculate is not a valid defense in a postmortem. |
| Alex | So that is our mission today. We are going to completely dismantle the vibe check. |
| Sam | I love it. Let’s tear it down. |
| Alex | We are going to explore how to apply rigorous data science, statistics, and automation to the evaluation of generative AI. We’re talking about moving from it looks good to. We are 95% confident in this score within a specific confidence interval. |
| Sam | We’re basically talking about evaluation engineering as a distinct discipline. We aren’t just summarizing papers today. We are effectively building a roadmap for a modern, robust evaluation stack. |
| Alex | We have 3 seminal sources on the table today, and we’re going to tackle them in 3 stages. We’re going to start at level one. Which is evaluating the LLM itself. This comes from the paper judging LLM as a judge, |
| Sam | which is all about the biases inherent when we ask models to grade other models, |
| Alex | right? Then we move to level two, evaluating rag systems with the ARES framework. This is the heavy hitter on statistics, how to evaluate context retrieval and faithfulness without, you know, hiring an army of human annotators. The labeling |
| Sam | bottleneck, a huge pain point for everyone right now. |
| Alex | Huge. And finally, level 3. Evaluating agents with the PROXYQA framework. How do you grade long form content where there is no gold standard answer? It’s |
| Sam | a natural progression, right? From the atomic unit of a single chat response up to the complexity of vector retrieval, all the way to the open-ended multi-step nature of agents. |
| Alex | Let’s get right into it. Level one, the foundation, judging LLM as a judge. This research comes out of the LMSYS org, the team behind Chatbot Arena. |
| Sam | Right. For the data scientists listening, you likely know Chatbot Arena intimately. It’s that crowdsourced leaderboard where you enter a prompt. Two anonymous models generate answers side by side, and you just vote on which is better |
| Alex | Model A, Model B, or a tie. |
| Sam | Exactly. And it uses an EO rating system very similar to what they use in chess rankings to stack rank the models. It is currently the gold standard for human preference alignment. |
| Alex | But here is the massive problem with human preference. It is slow. And it is insanely expensive, prohibitively |
| Sam | expensive if you are doing rapid iteration, right? |
| Alex | If I am a data scientist tweaking a hyperparameter in my local RAG pipeline or testing a new few-shot prompt technique, I can’t submit my model to Chabot Arena and wait 2 weeks for crowdsourced results. I need feedback in 5 minutes. |
| Sam | You need a metric you can put in a loop. So the industry pivoted hard to this concept of LLM as a judge. The hypothesis was pretty straightforward. If a frontier model like GPT-4 is already highly aligned with human preferences, why don’t we just ask GPT-4 to act as the human? |
| Alex | We feed it the user prompt, answer A and answer B, and basically say, Hey, pick a winner. |
| Sam | Exactly. It sounds like a bit of an infinite loop using the model to grade the model, but the immediate question the community had was, does it actually work? |
| Alex | And the short answer for the paper is yes, but with massive asterisks, |
| Sam | giant asterisks. The researchers found that a strong judge like GPT-4 matches human preference judgments over 80% of the time. |
| Alex | Now I know some listeners might scoff at 80%. They might think that’s a minus, that’s 20% error, but you |
| Sam | have to contextualize that number. You have to realize that humans only agree with each other about 80%. 2% of the time on these subjective tasks. |
| Alex | So GPT-4 is effectively hitting the noise ceiling of human agreement. |
| Sam | Correct. If you and I both look at two haikus written by different models, we might disagree on which is better. Human preference is noisy. GPT-4 is as good as a generic human labeler at capturing the average preference. |
| Alex | However, and this is the crucial difference, unlike a human, a model has systematic algorithmic biases. If a human is tiled, they might make a random error. If a model has a bias, it makes the exact same error a million times in a row. |
| Sam | And if you are building an automated evaluation pipeline, you need to account for these systematic biases or your metrics will be absolute garbage. You will be optimizing your system for the wrong things. Let’s |
| Alex | unpack. These specific biases because this is the real gotcha for anyone writing an evil script right now. The first one they identified is position bias. Oh, |
| Sam | this is a fascinating glitch. The research showed that if you present two answers to an LLM judge, answer A and answer B. The judge has a statistically significant preference for whichever answer is presented first, |
| Alex | just because it showed up first in the context window, |
| Sam | literally just because it was first. The older models like GPT 3.5 were notorious for this. It’s almost like cognitive laziness. That model reads the first option, decides it’s good enough, and then the mathematical bar for the second option becomes impossibly high. |
| Alex | It anchors on the first output. So if I’m running a regression test on my laptop, And I always pass my new fine-tuned model’s output as option A against my baseline as option B. I am artificially inflating my success rate. |
| Sam | You are grading on a rigged curve. Your new model has a built-in advantage, and the scary part is the judge model will give you a highly coherent, completely fabricated reason for its choice. |
| Alex | It rationalizes the bias. |
| Sam | Exactly. It will say I preferred option A because it was more concise and had better structure, but if you run the exact same prompt and just swap the order show B first, it will confidently say, I preferred option B because it was more concise and had better structure. |
| Alex | It’s hallucinating the justification to fit the position bias. That is wild. |
| Sam | It’s a huge problem. So what is the engineering fix? |
| Alex | It’s simple, but it effectively doubles your inference bill. You have to enforce symmetry. You run the evaluation twice for every single judgment, |
| Sam | right? First, A versus B. Then B |
| Alex | versus A, and you only count it as a win if the judge is consistent. Ideally, |
| Sam | yes. If the judge picks your new model when it’s in slot A and also picks your new model when it’s in slot B, that’s a true win. If it just picks whichever one is in slot A both times, that is a contradiction. It’s noise. You either treat it as a tie or you discard the sample entirely. |
| Alex | If you aren’t doing position swapping in your LLM as a judge pipeline right now, your error bars are massive, and you don’t even know it. You’re flying blind. OK, that’s manageable. But the next bias is the one that really keeps me up at night regarding the future of model training, verbosity bias. |
| Sam | The length bias. This one is insidious because it aligns with a very real human flaw. We humans tend to think longer answers are smarter answers. The researchers tested this with what they called a repetitive list attack. Walk |
| Alex | us through how that attack works. |
| Sam | So they took a high quality, concise answer from a model. Let’s say the prompt was about the causes of the French Revolution, and the model gave 5. Clear distinct bullet points. A |
| Alex | perfect dense |
| Sam | answer. Right then they attacked it by manually rewriting it. They didn’t add any new historical facts. They just made the sentences much longer. They repeated the introduction and the conclusion, and they added a bunch of transitional fluff. |
| Alex | Just relentlessly padding the word count. |
| Sam | Pure filler. And the LLM judges, especially the mid-tier models, consistently rated the verbose repetitive answer as objectively better than the concise one. |
| Alex | Think about the incentives there. If you are using LLM as a judge to fine tune your model via reinforcement learning. You are mathematically optimizing your model to be a windbag. |
| Sam | You are literally training for yapping. You aren’t optimizing for information density. You’re optimizing for token generation, and we see this in the wild all the time. |
| Alex | Oh, absolutely. How often do you use a chatbot to ask a simple coding question, and it gives you 4 paragraphs of intro about what Python is before giving you the one-line fix? |
| Sam | Certainly, Python is a versatile programming language. Yes, we know. That behavior is likely a direct artifact of verbosity bias in the RLHF and evaluation stages. |
| Alex | So how do we fix that as evaluation engineers? Do we just tell the judge, Don’t like long answers. |
| Sam | You have to be very specific and aggressive in your system, prompt. You can’t just say be critical. You have to explicitly instruct the judge, penalize repetition. Prefer conciseness. Do not favor length alone. Does that actually work? It mitigates it, but it’s a stubborn bias. GPT-4 is more resistant than others, which is why it’s the standard judge right now, but it’s definitely not |
| Alex | immune. Then there is self-enhancement bias. This one feels. Almost like an AI mirror test. It |
| Sam | really does. The data suggests that models have a slight but measurable preference for outputs that mimic their own training data or their own stylistic quirks. |
| Alex | GPT-4 slightly prefers GPT-4 outputs. Clawed slightly prefers clawed outputs. It creates a massive echo chamber. |
| Sam | It creates a monoculture of style. If every data science team uses GPT-4 to grade their open source models. Every open source model, every specialized medical model, every legal AI is being pushed to sound exactly like GPT-4. |
| Alex | We might be actively penalizing valid, correct answers just because they use a slightly different tone or structure than what OpenAI baked into their alignment process. |
| Sam | That is a profound point. We are subtly standardizing the definition of intelligence to mean sounding like chat GPT. |
| Alex | And for data scientists listening, the tactical takeaway here is that you need diversity in your judges. Don’t just use one API for all your evaluation. |
| Sam | Use an ensemble, root some evils to Cloud, some to Gemini, some to a local Lama 3 instance, or as we’ll see in the next paper, use a smaller model, fine tune. Entirely on your specific domain data. |
| Alex | One last limitation on the judging aspect before we move into rag reasoning and logic. |
| Sam | This is critical for anyone building math bots or code assistants or multi-step logic agents. The paper found that LLMs are decent at grading creative writing or summarization. But they struggle immensely to grade logic errors if they don’t have external help. |
| Alex | This is the confident hallucination problem, |
| Sam | right? Let’s say a student model outputs a math proof. The proof looks like a proof. It has the right formatting. The equations are properly indented. It ends with a confident, therefore, x equals 5. But it makes a subtle arithmetic error in step 3. |
| Alex | The LLM judge often misses it. |
| Sam | The judge sees the vibe of a correct proof and gives it a pass. It rates it a 10 out of 10. |
| Alex | It’s like a teaching assistant who is grading 500 papers at 2 a.m. They’re too tired to check the actual math, so they just check if the handwriting is neat and the final answer is boxed. |
| Sam | Precisely. The fix here is what the researchers call reference guided grading. You cannot ask the judge to solve the problem and grade the student at the exact same time. |
| Alex | It overwhelms the attention mechanism. Yes, |
| Sam | you must provide the ground truth answer, the golden reference in the system prompt, or if you don’t have a golden reference, you use chain of thought grading. You force the judge to explicitly solve the problem step by step in its own scratch pad before it is allowed to look at the student’s answer. |
| Alex | You force it to the hard work first. You ground the judge in reality before you let it issue a verdict. |
| Sam | Exactly. So that is the baseline. That is just evaluating a raw language model against another language model. But most of our listeners are dealing with something vastly more complex in production. |
| Alex | They’re building Araji systems. Retrieval augmented generation, and that brings us to level 2. |
| Sam | Araji completely changes the evaluation game because it introduces multi-component failure. Right, |
| Alex | let’s break that down. In a simple chatbot, if the answer is wrong, the LLM is wrong. It hallucinated. But in a rag pipeline, if the answer is wrong, it could be the LLM hallucinating, or it could be the vector database retrieving a recipe for lasagna when the user asks for a specific legal precedent. |
| Sam | And distinguishing between those two distinct points of failure is the fundamental job of the data scientist. This brings us to the second paper. ARA |
| Alex | AES stands for an Automated evaluation Framework for retrieval Augmented Generation Systems. This is probably the most rigorous paper of the bunch regarding pure statistical methodology. |
| Sam | It is a master class in evaluation data science. The core problem ARAS tackles is what we call the labeling bottleneck. |
| Alex | I’m sure every listener has felt the pain of this bottleneck. You build an RG system for a niche domain, let’s say, analyzing proprietary semiconductor manufacturing logs. To evaluate that system properly, you need a senior semiconductor engineer to sit down and read thousands of query document answer triples. |
| Sam | They have to look at the user’s question, read the log the database retrieved, and read the AI’s generated summary and manually score it. Is this log actually? Relevant to the query is the summary completely faithful to the log, |
| Alex | and that senior engineer costs $300 an hour and has significantly better things to do. |
| Sam | So you end up with maybe 50 labeled examples if you’re lucky, and 50 examples is nowhere near enough for statistical significance when you’re tweaking embedding models or chunk sizes. |
| Alex | So what do teams typically do in this situation? They usually grab off the shelf tools like RGAs. |
| Sam | They do, and RGAS is a great starting point, but it relies on generalized zero-shot prompts. It effectively uses GPT-4 to ask, Is this retrieved text relevant to this question, |
| Alex | which goes back to the bias problem. The |
| Sam | AR’s authors argue that these static heuristics don’t adapt well to domain shifts. A zero shot prompt that works perfectly for checking relevance in Wikipedia articles might fail miserably when checking relevance in messy OCR scanned technical manuals full of domain-specific acronyms. |
| Alex | So how does A solve this without bankrupting the company on expert human labeling? |
| Sam | They treat evaluation as an end to end data science pipeline, not just a clever prompt. It is a three-step process. Step 1 is synthetic data generation. |
| Alex | Now, I can immediately hear the listeners pausing the deep dive. Synthetic data, isn’t that just garbage in, garbage out? If the target model hallucinates and we use a model to generate the training data, aren’t we just compounding the errors? That |
| Sam | is the common, very valid fear. But A does something exceptionally clever. They don’t just generate positives. Perfectly clean examples, they generate a comprehensive curriculum of failures. |
| Alex | A curriculum of failures. I like that phrasing. Explain how they build that. |
| Sam | Let’s focus on the three core Ag metrics context relevance, answer faithfulness, and answer relevance. |
| Alex | Context relevance being, did the retriever fits the right document? Answer faithfulness. Did the LLM stick to the facts in that document without making things up? And answer irrelevance. Did the final response actually answer the user’s initial question? |
| Sam | Exactly. So to generate synthetic data, they take a document from your specific proprietary corpus. They use a strong LLM. The paper used FLANT5 XXL, but you likely use GPT-4 or LAMA 3 today. To read the document and generate a relevant question that can be answered by it that creates a positive pair, a good retrieval example. OK, that makes sense. But then they intentionally generate negatives to simulate a retrieval failure, a context relevance failure, they take a generated question, but they swap the actual document with a completely random document from the corpus. |
| Alex | Ah, so they force a mismatch. |
| Sam | Yes, and to simulate a hallucination, a faithfulness failure, they provide the correct document, but they prompt. LLM to generate an answer that directly contradicts the facts in the text. I see. |
| Alex | So they’re procedurally generating the exact specific errors they want their evaluation metric to be able to catch. |
| Sam | They are building a massive data set of here is what a lie looks like in the context of semiconductor logs, and here is what a bad retrieval looks like. They are defining the boundary conditions of failure for the judge. |
| Alex | That leads directly to step 2 of Aries training lightweight judges, |
| Sam | right? This is where it gets highly efficient. They don’t use a massive, expensive model like GPT-4 for the actual evaluation at runtime, |
| Alex | because running GPT-4 over tens of thousands of logs every time you update your vector index would cost a fortune. It’s |
| Sam | unsustainable. So Instead, they take that massive synthetic data set they just generated the positives and the intentional negatives, and they fine tune a much smaller local model. Specifically, they use DeBerta V3 Large. Deberta |
| Alex | is an encoder-only BERT model. It’s tiny compared to modern LLMs. You can literally run that on a standard CPU. It’s |
| Sam | incredibly fast and practically free to run, but here’s the magic because it was fine tuned. Entirely on your specific domain data using that synthetic curriculum of failures, it becomes an absolute specialist. |
| Alex | It becomes a sniper for finding errors in semiconductor logs, whereas GTT-4 is just a really smart generalist. Exactly. |
| Sam | The Doberta judge learns the specific vocabulary, the acronyms, the typical sentence structures of your data. |
| Alex | We have a fast, specialized, cheap judge, but we still have that lingering statistical doubt. It was trained entirely on synthetic data generated by an AI. How do we know we can actually trust its score when we deploy it on real messy user data? |
| Sam | That is the million dollar question for any evaluation framework, and that is where step 3 comes in prediction powered inference or |
| Alex | PPI. Let’s slow down here. PPI is the statistical core of this paper. This is the meat and potatoes for our data science audience. How does prediction powered inference actually work? |
| Sam | PPI is a statistical method for correcting the bias of an automated model using a very small, highly trusted sample of real human data. |
| Alex | So remember that senior semiconductor engineer we talked about earlier? We still need him. |
| Sam | Yes, but instead of forcing him to label 5000 documents, which he won’t do, we ask him to label, say, 150 documents. |
| Alex | 150 is totally manageable. That’s maybe one afternoon of work. |
| Sam | Exactly. So here is the mathematical workflow. You run your fast toberta judge over your massive unlabeled production set of 10,000 user logs. The judge spits out a score. Let’s say it estimates that your system has an 85% accuracy rate. |
| Alex | But we don’t trust that 85% yet because the judge might be biased by its synthetic training data, right? |
| Sam | So we take those 150 examples that the human engineer carefully labeled, the ground truth. We run our Doberda judge on those exact same 150 examples, and we compare them. We compare the machine’s predictions against the human’s ground truth. We see exactly where the judge agreed with the human and more importantly, where it systematically disagreed. We use this comparison to calculate a rectifier, a |
| Alex | rectifier, like a calibration weight. |
| Sam | Essentially, yes, it calculates the error rate or the specific directional bias of the judge on that ground truth sample. |
| Alex | You are calibrating your cheap instrument against a highly accurate, expensive instrument. |
| Sam | Precisely. And PPI uses that rectifier along with some pretty elegant statistical theory leveraging things like the central limit theorem to construct a. Confidence interval. |
| Alex | So instead of just blindly reporting 85% accuracy, what do you report? |
| Sam | It allows you to go to your engineering lead and say, based on the automated judge’s score on the 10,000 items and mathematically calibrated by its observed error on the 150 human labeled items, we are 95% confident that the true accuracy of the system is between 82% and 86%. |
| Alex | That is the holy. Grail for ML Engineering. You aren’t just giving a point estimate. You are providing a bounded metric with mathematically sound error bars, |
| Sam | and that is what actually allows you to deploy with confidence. If the lower bound of your confidence interval, that 82%, is above your business’s safety threshold, you shift |
| Alex | the model. If it’s below, you go back to the drawing board. It transforms evaluation from a game of vibes into rigorous risk management, |
| Sam | and the empirical results from the AES. Paper backing this up were incredibly strong. |
| Alex | What kind of improvements did they see over the baselines? |
| Sam | By utilizing PPI, Air rager required 78% fewer human annotations than traditional fine tuning methods to reach the exact same level of statistical confidence. |
| Alex | That is a massive reduction in labeling costs. |
| Sam | And crucially, it significantly outperformed generalized tools like RGAS in accurately ranking different rag systems. When Researchers intentionally shifted domains like generating the synthetic training data on Wikipedia articles, but then evaluating the system on dense scientific papers. A adapted seamlessly |
| Alex | because you just run the synthetic generation step on the new scientific corpus. |
| Sam | Exactly. The static handwritten prompts in RAS failed to adapt to the new vocabulary, but Eris just built a new specialist judge for the new |
| Alex | domain. The tactical takeaway for anyone building Eric right now is clear. Don’t just write a massive grading prompt and hope for the best. Treat evaluation as a system. Build a dataset generator. Train a lightweight metric, and calibrate it with |
| Sam | PPI. It definitely sounds like a lot of work up front, but this is what true evaluation engineering looks like. You are building a machine to measure the machine, and once it’s built, it scales infinitely. Let’s |
| Alex | move to level 3, the frontier. We’ve done chatbots, we’ve done RAG. Now let’s talk about agents and long form content generation. |
| Sam | This is arguably the hardest unsolved problem in the space right now. |
| Alex | We are looking at the paper P R O X Y Q A. Let’s set the scenario. You have an egentic workflow. You prompt it. Write a comprehensive deep dive market research report on the economic impact of the 2024 Olympics on the Parisian real estate market. |
| Sam | The agent is going to do some research and generate a 2000, maybe 3000 word output. |
| Alex | There’s no single correct string of text to compare that to, |
| Sam | none. Standard NLP metrics like Rouge or BLU, which basically just Measure nnogram overlap between a generation and a reference text are completely useless here |
| Alex | because they measure the specific phrasing, not the underlying facts, |
| Sam | right? If the reference says the budget was large and the AI says the expenditures were massive, Rouge scores that very poorly because the words don’t match even though the semantic meaning is identical. And |
| Alex | as we discussed earlier, falling back to a vibe check is incredibly dangerous here because the generated report might flow beautifully. The grammar is flawless. It sounds highly authoritative, |
| Sam | but it might completely miss the single most important statistic about infrastructure spending or hallucinate the inflation rate. Fluent vivacuous, exactly. So the PRXYQA researchers proposed a radically new framework. They argue that we need to stop trying to evaluate the text itself and start evaluating the information contained inside the text. |
| Alex | How do they mechanically separate the |
| Sam | two? They break the evaluation down into what they call atomic facts. They start with the user’s high-level prompt the meta question. Write a report on the Paris Olympics. Then they use an LLM to generate a predefined list of proxy questions based on that topic. These are highly specific, factual, boolean, true or false questions that simply must be answerable if the generated report is truly comprehensive. |
| Alex | Give |
| Sam | me |
| Alex | a concrete example for the Paris Olympics |
| Sam | report. A proxy question might be. True or false, did the transport infrastructure projects finish ahead of schedule? Or true or false, did the total operating budget exceed €8 billion? Or true or false, does the report explicitly mention the impact of the games on short-term Airbnb rental prices? |
| Alex | I see the genius in that. They’re taking a highly subjective, open-ended essay prompt and converting it into a rigid checklist of objective facts. |
| Sam | Precisely. They convert a generative task, which is notoriously hard to evaluate, into a classification task, which is trivial to evaluate. |
| Alex | Walk us through the actual operational workflow of P R O X YQA. |
| Sam | Step one, your agent generates the massive long form report. Step two, step two, you pass that generated report to. Evaluator model, again, usually something robust like GPT 4. |
| Alex | Step |
| Sam | 3, step 3, you give that evaluator model the predefined list of proxy questions. And step 4, you prompt the evaluator. Based strictly and entirely on the text provided in this report. Answer these true or false questions. If the information required to answer a question is missing from the text, mark it as not found. |
| Alex | So the final evaluation score isn’t a nebulous 7 out of 10 for narrative flow. The final score is a hard percentage. This report successfully contained the necessary information to answer 85% of the expected factual proxy questions. |
| Sam | Exactly. It specifically measures informativeness and knowledge coverage. It effectively calculates the information density of the final output. |
| Alex | This is so actionable for data scientists. Think about the debugging cycle. If your agent’s score suddenly drops from 90% to 60%, you don’t just stare at a block of text wondering what went wrong. You can look at the specific proxy questions that failed, |
| Sam | right? You can say, oh, look at this, we are consistently missing all the proxy questions related to financial metrics. Our agent is good at narrative, but bad at numbers. We need to go adjust the retrieval weights for our financial documents in the vector database. |
| Alex | It makes the qualitative explicitly quantitative. |
| Sam | And the empirical findings from PROXYQA paper were really revealing, especially regarding the current state of the models we use today. |
| Alex | Let’s dive into those findings. Let’s talk about the gap between open source and proprietary models on this specific metric. |
| Sam | It was a stark contrast. The researchers tested major open source models like LA 213b and Vicuna against proprietary models like GPT-4 on these complex long form tasks. |
| Alex | And how did Lama perform? |
| Sam | Well, the Laman models often generated very long reports. They produced a massive amount of text, |
| Alex | verbosity bias rearing its head again. |
| Sam | Exactly. But when they ran the PROXQA evaluation, when they actually checked for facts, the scores were dismally low. The open source models were just |
| Alex | yapping. |
| Sam | They were filling |
| Alex | the space with fluff, structural repetition, and vague generalities. Sounds smart without actually saying |
| Sam | anything. The information density was terrible. Conversely, GPT-4 turbo often generated significantly shorter reports, but it hit a much higher percentage of the proxy questions. It was denser. It actually respected the user’s time and delivered the facts. |
| Alex | But the real winner in their benchmarks wasn’t just a raw LLM. It was ARG, right? Specifically, |
| Sam | Webb augmented RAG. They tested Bing chat, which has real-time search capabilities. It completely dominated the benchmark, |
| Alex | even beating GPT-4 Turbo. |
| Sam | Yes, even GPT-4 Turbo, when it was cut off from the web and had to rely entirely on its internal weights, struggled on the really hard, obscure proxy questions. |
| Alex | It proves that for long form, comprehensive content generation, parametric memory, with the model memorized during its pre-training phase. Is simply not enough. You absolutely need active external retrieval, |
| Sam | which ties us beautifully right back to the Aries |
| Alex | paper. Right, to get a high PRO X YQA score at the agent level, you inherently need a high Aries score at the retrieval level. If your context retrieval is bad, your agent’s report will inevitably miss facts. |
| Sam | It’s a full stack evaluation strategy. The LMSYS paper tests your foundational model’s basic conversational and logic ability. Ares tests your system’s retrieval and grounding layer, and PR X YQA tests your agent’s ability to synthesize that information and achieve comprehensive coverage of a topic. |
| Alex | You really can’t ignore any layer of that stack if you want a reliable product. |
| Sam | You can’t. They are interlocked. |
| Alex | I want to take a moment to synthesize all of this for the listeners. We have thrown a massive amount of acronyms, statistical methods, and frameworks at you today. If I am a lead data scientist sitting at my desk right now listening to this, and I want to overhaul my team’s evaluation strategy by next quarter, what is the step by step roadmap? |
| Sam | I think it comes down to 3 very concrete actionable directives. First, sanitize your judge. |
| Alex | If you are currently using a simple LLM as a judge prompt, fix it today. Yes, |
| Sam | you must immediately implement the swap check from the LMSYS paper. Compare answer A to answer B and then force the system to compare B to A. If the judge disagrees with itself, throw the result out. It’s noise. |
| Alex | And update your system prompts to explicitly penalize verbosity. That is the lowest hanging fruit. Absolutely. |
| Sam | Step 2, for your REG pipelines, you need to transition aggressively towards synthetic data generation. |
| Alex | Stop waiting for your actual users to generate enough logs and stop waiting for your product managers to label massive Excel sheets. |
| Sam | Use a strong frontier model to generate your own golden sets from your existing documents. Generate the positive examples and intentionally generate the negative failures. Build that curriculum of |
| Alex | failure and implement prediction-powered inference. The Python libraries for PPI are open source and readily available. It allows you to take a tiny human sample and project it onto your massive unlabeled data set with mathematical confidence intervals. |
| Sam | Exactly. And step three, if you are building multi-step agents, adopt the atomic fact mindset from PRO X YQA. |
| Alex | Stop trying to evaluate the final output text as a monolithic block. |
| Sam | Define what success actually looks like in terms of specific extractable bits of information. Even if you just Sit down and manually handwrite 50 true false proxy questions for your core test set that is infinitely more valuable than doing a vague vibe check on the final paragraph. |
| Alex | Listening to this roadmap, it really seems like the fundamental role of the data scientist is shifting dramatically. Well, we used to be model trainers. We spent our time tuning hyperparametters, tweaking learning rates, designing architectures. But now, since the foundational models are largely commodities, we just pull off the shelf via an API. Our job is becoming a valuation architects. I |
| Sam | think that is an incredibly accurate assessment. The competitive moat for a company is no longer the foundational model itself. Everyone has access to GPT-4 or LAA 3. The real moat is your organizational ability to rigorously measure if that model is actually working for your specific highly specialized use case. |
| Alex | The engineering teams that can scientifically measure their system’s performance using PPI, synthetic failures, and proxy tasks are the ones who will be able to iterate safely and win. |
| Sam | Market and the teams that keep relying on vibe checks are going to hit a performance ceiling they simply can’t break through because they won’t even have the metrics to tell them why they are failing. |
| Alex | They’ll just be tweaking prompts randomly in the dark. Maybe if I type, you are a highly intelligent expert in all caps, the error rate will go down. |
| Sam | That’s not computer science. That’s superstition. Papers like RREs and Probex YQA give us the engineering tools to move past that. |
| Alex | I want to end today’s deep dive on a bit of a provocation. We talked earlier about self-enhancement bias in the LMSYS paper, right? |
| Sam | Models preferring their own style. |
| Alex | In the PRX YQA paper, the researchers noted a similar phenomenon. They found that GPT-based evaluators were suspiciously overconfident when grading answers that were generated by GPT models. |
| Sam | Yes, this is what we call the closed loop danger. |
| Alex | Think about the implications of the stack we just outlined. If we use GPT-4 to generate the synthetic training data in AREs, and we use GPT-4 to power the agent writing the report in PRO X YQA. And then we use GPT 4 as the final judge to score that report. Aren’t we just building a massive, expensive hall of mirrors? |
| Sam | We are at extreme risk of creating a monoculture of intelligence. If the judge model and the student model share the exact same underlying architecture and the exact same pre-training data. They fundamentally share the exact same blind spots. |
| Alex | If GPT-4 is inherently bad at a certain type of spatial reasoning. It won’t even notice when the student model makes a massive spatial reasoning error. It will just wave it through. |
| Sam | It creates a false sense of perfection. The metrics look great. But the system is actually broken in ways the AI simply cannot perceive. |
| Alex | So what is the ultimate solution to the closed loop? |
| Sam | Diversity in your evaluation stack. Model ensembling. |
| Alex | Don’t just rely on one provider. |
| Sam | Exactly. If you use GPT to generate the data, use cloud to judge it. Use a locally fine-tuned llama model for the proxy extraction. Break the architectural loop, |
| Alex | and most importantly, keep the human in the loop. |
| Sam | Always through methods like prediction powered inference, the human provides the fundamental anchor to reality. |
| Alex | The human sets the semantics standard, and the AI simply scales the measurement of that standard. |
| Sam | If you cut the human anchor entirely, you inevitably drift into statistical noise. You need that small, highly curated sample of ground truth to keep the entire automated system tethered to reality. |
| Alex | That is a very powerful place to leave it. We’ve covered a Massive amount of ground today. We’ve moved from the vulnerability of it looks good to me to the engineering rigor of we are 95% confident within a 3% margin of error based on atomic fact coverage. |
| Sam | It is a difficult journey to build that infrastructure, but it’s the only way to build reliable AI applications in production. To |
| Alex | our listeners, the data scientists and evaluation engineers out there, we will link all three seminal papers in the show notes. Judging LLM as a judge, AREs and Prox sex YQA, |
| Sam | read them. |
| Alex | They’re required reading. Stop doing vibe checks. Start building real evaluation pipelines. Keep digging deeper. Keep |
| Sam | digging. |
Presentations
Notebooks
- Response Evaluation based on the comparison of RAG responses to given questions to the ground truth answer
- Example notebook RAG_Evaluation
- Example code in response_evaluation
- Retrieval Evaluation assess the tech-chunnk retrieval and ranking of the RAG system
- Example notebook RAG_Retrieval_Evaluation
- Example code in retrieval_evaluation
Reading
Additional Resources
- ARES is a framework for evaluating Retrieval-Augmented Generation (RAG) models.
- G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. The G-Eval metric is the most versatile type of metric deepeval has to offer, and is capable of evaluating almost any use case with human-like accuracy.
- RAGAS is a library that provides tools to supercharge the evaluation of Large Language Model (LLM) applications.