Data Science on Agentic System

The focus of this session is on managing agentic AI systems using established data science methodologies rather than treating them as unique technical mysteries. We explore how techniques such as funnel analysis, queueing theory, and statistical process control can be directly applied to monitoring and evaluating intelligent systems.

Through a detailed mapping of classic analytics to AI-specific challenges - including classifying agent failure modes, optimizing retrieval quality, and detecting performance degradation - we illustrate how familiar analogies like clickstream data and call center operations translate to instrumenting execution logs and designing controlled experiments.

Manage Agentic AI with Traditional Analytics

Transcript

Speaker	Text
Alex	This is the brief on mapping traditional data science to agentic AI. You know, managing intelligent agents doesn’t actually require brand new math. It’s really just a business process where your existing analytics toolkit applies perfectly. So let’s dive right into the three ways we’re going to use what we already know. First up, treat agent logs as event data. Think of an agent’s execution path like tracking a shopper’s customer journey on a website. A task completion rate is literally just your conversion rate. Grab a Sankey diagram to visualize exactly where the agent drops off. Second, we’ve got to move to performance monitoring overtime. Agent systems degrade silently with no error codes, just worse answers. So how do you know your agent got worse before your users start complaining? Well, we use statistical process control and change point detection to catch task distribution drift, exactly like tracking KPI erosion or concept drift in business forecasting. Finally, let’s shift from tracking quality to managing system speed. Treat multi-agent coordination as operations analytics, because a multi-agent pipeline is simply a queuing network. It’s basically a digital call center or a hospital triage. Just apply Little’s law. Bottleneck analysis to fix those latency and load balancing issues. The bottom line is, you already have the tools to evaluate, monitor, and manage AI. You just need to map them to this new domain. Stop looking for a magic AI metric and start putting your classic analytics toolkit to work.

Speaker

Text

Alex

This is the brief on mapping traditional data science to agentic AI. You know, managing intelligent agents doesn’t actually require brand new math. It’s really just a business process where your existing analytics toolkit applies perfectly. So let’s dive right into the three ways we’re going to use what we already know. First up, treat agent logs as event data. Think of an agent’s execution path like tracking a shopper’s customer journey on a website. A task completion rate is literally just your conversion rate. Grab a Sankey diagram to visualize exactly where the agent drops off. Second, we’ve got to move to performance monitoring overtime. Agent systems degrade silently with no error codes, just worse answers. So how do you know your agent got worse before your users start complaining? Well, we use statistical process control and change point detection to catch task distribution drift, exactly like tracking KPI erosion or concept drift in business forecasting. Finally, let’s shift from tracking quality to managing system speed. Treat multi-agent coordination as operations analytics, because a multi-agent pipeline is simply a queuing network. It’s basically a digital call center or a hospital triage. Just apply Little’s law. Bottleneck analysis to fix those latency and load balancing issues. The bottom line is, you already have the tools to evaluate, monitor, and manage AI. You just need to map them to this new domain. Stop looking for a magic AI metric and start putting your classic analytics toolkit to work.

Deep Dive

Forcing Structured Outputs with Constrained Decoding

Transcript

Speaker	Text
Alex	You know, it really wasn’t that long ago that building an AI data pipeline felt like, um, like negotiating with a highly unpredictable hostage taker. Oh
Sam	yeah, completely. It was stressful.
Alex	I mean, I’m talking about the dark ages of prompt engineering, which, if you are actively building agentic AI solutions right now, was essentially just last year,
Sam	right? It really was just yesterday in tech
Alex	time, exactly. You’d write this, you know, beautifully optimized, complex data pipeline. You’d feed your contests into the LLM, and you’d literally beg it in the system
Sam	prompt. So the begging return strictly valid
Alex	JSON. Yes, no preamble, no conversational filler. My job depends
Sam	on this, and you just cross your fingers and hit run,
Alex	right? And the model would process your request, generate this perfect data structure, and then right before the opening bracket politely append, sure. Here’s the JSON you requested, which
Sam	instantly detonates your entire pipeline.
Alex	Yeah, a total crash, just a parser error and everything breaks. Or,
Sam	you know, it would decide to get creative with your SQL query, like hallucinating a missing comma or um inventing an entirely new column out of thin air
Alex	because probabilistically that column just looked like it belonged there.
Sam	Exactly. We were spending half our compute cycles just running rejects patches. Endless string manipulation scripts just trying to force probabilistic text generators to behave like, well. Deterministic software components, which they just aren’t. No, they aren’t. The friction was immense and the failure rates in production were just totally unacceptable for any serious grad student or developer.
Alex	And that is exactly the mission for today’s deep dive. We are taking a stack of lecture materials, research notes, and API documentation on generative LLMs and structured outputs, and we are unpacking. How this problem was actually solved. It’s
Sam	a really fascinating evolution, honestly. It
Alex	really is. We’re going to get under the hood of text generation. We’ll explore how the math of constrained decoding finally guarantees valid syntax and look at the massive architectural differences between deploying on OpenAI versus, you know, local inference on a llama,
Sam	which is a huge trap for a lot of people building local agents right now. Oh,
Alex	massive trap. We’ll also establish the scheme of design best practices you need for type-safe agents. But I mean, to appreciate the cure, we first really have to understand the disease, right?
Sam	Why do they fail out of the box?
Alex	Exactly? Why do LLMs fail so spectacularly at structured output natively?
Sam	Well, it really comes down to confronting the mathematical reality of token by token generation. Let’s break down that autoregressive process. OK, let’s unpack this. Walk us through it. So when you send a prompt, The model doesn’t like think of the entire response as one cohesive document. It tokenizes your input, converts it to embeddings, and passes those through its transformer layers, right? The standard forward pass, exactly. And the final output layer computes what we call logits. This is a raw, unnormalized vector of scores representing the entire vocabulary of the model. And
Alex	for modern models, that vocabulary is what, massive,
Sam	huge, anywhere from 50,000 to 100,000 distinct tokens.
Alex	OK, wait, so every single generation step, the model is computing a raw score for all 100,000 possible next
Sam	tokens. Correct. Every single time it generates a chunk of text. And those raw logics are then passed through a soft max function
Alex	which converts them into probabilities, right,
Sam	right, it converts them into a probability distribution that sums exactly to one. So given all the tokens generated so far, the model calculates the probability of what comes next. OK, I’m with you. But how it actually chooses the winning token from that massive distribution depends entirely on the decoding strategy you configure, right,
Alex	because we have all these sampling parameters we tweak in our API calls. Like you have greedy decoding, which is just the model statically picking the absolute highest probability token every single time,
Sam	exactly zero creativity.
Alex	And then you have temperature, and temperature isn’t just a, you know, a randomness style, right? It’s mathematically altering the logits before the soft max function.
Sam	That’s a great way to put it. It shifts the math,
Alex	right? So if you set the temperature below one, you are sharpening the distribution, making the likely tokens even more dominant. If you set it above one, you are flattening it, making lower probability tokens more viable.
Sam	That is a crucial distinction. And beyond temperature, you have the sampling filters. Like Top K.
Alex	Remind us how top K
Sam	works
Alex	again. Sure,
Sam	Top K truncates the list to only consider the K most likely tokens, and it just discards the rest of the tail, so you’re ignoring the really unlikely stuff entirely.
Alex	Got it. And then there’s topec or nucleus sampling. Yeah,
Sam	Top K dynamically samples from a subset of tokens whose cumulative probability hits your threshold, say 90%.
Alex	OK, and I’ve seen minf popping up a lot lately too.
Sam	Yeah, the newer MIFO strategy filters out. Any tokens that fall below a specific percentage of the most likely tokens probability. It’s really effective.
Alex	So when you look at all of these strategies, you know, greedy, temperature, top up, they’re really just different mathematical flavors of educated guessing.
Sam	That’s exactly what they are weighted dice rolls,
Alex	which perfectly explains why unconstrained generation fails for strict data structures. Working with a standard LLM is like, um, it’s like working with a highly educated improv actor.
Sam	I love that analogy,
Alex	right, because they are brilliant at keeping the scene going based on the context, and they can pattern match the rhythm of a conversation perfectly, but they are absolutely terrible at filling out a strict tax return without supervision.
Sam	Yeah, because they just go with the vibe. The model has seen millions of JSON files in its training data, so it knows the general vibe of JSON,
Alex	but
Sam	it
Alex	has no hard mathematical guarantee of the syntax.
Sam	Exactly. It will happily mix types. Like giving you the key age with the value 25 spelled out as a string instead of the integer 25,
Alex	just because the word 25 was probabilistically likely in that specific linguistic context, right?
Sam	It lacks structural awareness entirely. It operates strictly on statistical associations. It doesn’t actually know what a JSON schema is.
Alex	It just knows that an opening brace is usually followed by a quotation mark.
Sam	Yep. So here’s the million dollar question for the developers listening. If the model is fundamentally just rolling weighted dice to pick the next token out of 100,000 options, how do we physically force it to only pick the ones that form valid
Alex	code? OK, yes. How do we do that? Because begging didn’t work.
Sam	Begging definitely didn’t work. This is where we introduced the mechanism of constrained decoding. If probabilistic guessing is the disease, constrained decoding is, is the cure. I like the sound of that. Instead of letting the model sample freely from the entire post-op max vocabulary, we introduce a formal grammar. Think of it as a finite state machine that runs parallel to the generation process.
Alex	OK, a finite state machine, so it’s tracking the state of the output at every step.
Sam	Exactly. At step T, this state machine evaluates your schema and the partial output generated so far. Then it determines exactly which tokens are syntactically lethal to generate next.
Alex	I like to think of it like a mechanical template placed over a piano keyboard. Oh, that’s a good visual. Like if you are only allowed to play the notes in a specific chord, the template physically locks down all the wrong keys. So even if the pianist, or in this case the LLM tries to mash a wrong key because it probabilistically feels right. The key simply won’t depress.
Sam	That is perfectly stated. It just blocks the action.
Alex	But what is the actual math happening under the hood to lock down those keys?
Sam	What’s fascinating here is the mathematical trick. It happens right at the logic computation phase before the softmax function is applied. OK, before softmax, right? The constrained decoding engine identifies all the syntactically invalid tokens for that specific step. It then masks their probabilities by setting their raw logic scores to negative infinity.
Alex	Oh wow, negative infinity, because when you pass negative infinity through an exponential softmax function, it exponentiates to exactly 0,
Sam	precisely 0, not a tiny fraction, but actual 0. Those invalid tokens now have a mathematically 0% chance of being sampled,
Alex	so they’re completely off the table, yep.
Sam	The remaining valid tokens, the keys that aren’t locked down, were then renormalized, so there are probabilities sum back to one. And
Alex	then the model just samples from that restricted pool using your standard temperature or top settings.
Sam	Exactly. The benefit is absolute. You get guaranteed syntax validity and you can finally throw away your rejects postprocessing scripts. Wait,
Alex	I need to push back on the efficiency of this though. You’re saying we’re running a complex finite state machine against a vocabulary of 100,000 tokens at every single generation step. If my model outputs 500 tokens, I’m doing that grammar evaluation 50 million times. Doesn’t that introduce like Massive computational overhead and latency. It absolutely does. Especially for listeners who are building local agents and counting every millisecond of time to first token. That sounds incredibly heavy.
Sam	It is a significant computational bottleneck, and you’ve hit on one of the major engineering challenges in the space. Building the finite state machine from a complex JSON schema takes upfront compute, and
Alex	evaluating the mask at every step adds latency.
Sam	Right. Frameworks mitigate this through heavy caching. If the state machine knows that the next 15 tokens must be a specific string, it can cache that path. But the overhead is real, so it’s not a free lunch. Definitely not. And it gets even more complicated when you factor in subword tokenization. Wait,
Alex	how so? Why does subword tokenization mess it up? Well,
Sam	the grammar engine is evaluating bytes or characters, right? But the LLM generates tokens, which are often chunks of characters grouped together. Let’s say your schema strictly requires the key status. The finite state machine is looking for a quotation mark, but the LLM’s vocabulary might not output a standalone quotation mark. Oh, I see where this is going. It’s most likely token might be a combined subword like an opening brace attached to an S and a T. The grammar engine sees a token starting with a brace instead of a panics and masks it to negative infinity. Ah,
Alex	so the token boundaries don’t cleanly align with the syntax boundaries of JSON. The FSM accidentally bans the correct path because the LMN tried to output it as a multi-character check.
Sam	Exactly the problem. Modern constrained decoding engines like Outlines or the one built in Lamaat CPP have to do incredibly complex look-ahead operations just
Alex	to figure. If a subword token contains the valid byte sequence. Yeah,
Sam	it is computationally expensive, but it is really the only way to achieve deterministic structure.
Alex	Here’s where it gets really interesting, because now we have to talk about how the major APIs actually expose this underlying math to developers. And right now there is a massive point of confusion in the ecosystem between older JSON mode and true structured outputs.
Sam	Complating those two is a dangerous trap. It really is.
Alex	Break it down for us. What is the actual difference?
Sam	Well, JSON mode, which you typically enable by passing response format, equals type JSON object in OpenAI’s API, merely guarantees that the final string will successfully parse as JSON. That’s it. Just that it parses, that is the extent of the promise. It is a legacy feature supported by older models like GPT 3.5 turbo. It does not guarantee schema adherence, so
Alex	it doesn’t guarantee your field names are correct.
Sam	Nope, it doesn’t enforce strict typing. And it absolutely allows the model to hallucinate entirely new unexpected fields, right?
Alex	So Jason mode is basically like ordering a sandwich at a deli. You have a structural guarantee that you are going to get something between two pieces of bread, but the fillings are a total surprise.
Sam	Exactly. You might get turkey, you might get an old shoe.
Alex	But structured output, on the other hand, which uses type JSON schema, is like a highly specific legally binding catering contract. Yes.
Sam	It utilizes the actual schema aware negative infinity constraint decoding we just broke down. It guarantees that required fields are present. The integer types aren’t secretly strings. And no extra fields are hallucinated,
Alex	but you have to use newer models for it, right, like GPT-40 and many are the newer GP 240 releases, correct?
Sam	So for an engineer setting up their API calls right now, the question becomes, is there any valid reason to ever use the old JSON load again, right? Why would anyone use it unless you are forced to use a legacy model for compliance reasons or your task. Just incredibly open-ended,
Alex	like a chatbot returning deeply varied, unstructured configurations. Yeah,
Sam	where you can tolerate structural variants. If not, the answer is no. If you need predictability for a typed backend, structured outputs are the baseline.
Alex	But this raises a critical issue about your deployment environment because you might be prototyping. The privacy first agents, you are almost certainly using a llama,
Sam	and the difference in how these two APIs handle structured output is a serious lesson in defensive engineering.
Alex	It is a massive gotcha for developers. Let’s trace exactly what happens when you call an API, demand a strict JSON schema. But the underlying model you were targeting doesn’t actually support constrained decoding. OK,
Sam	so if you are using OpenAI and you request structured outputs on an unsupported model, the API acts as a strict bouncer. It outright rejects the request.
Alex	You get a hard 400 bad
Sam	request error,
Alex	exactly,
Sam	stating the model does not support structured outputs,
Alex	which is fantastic. That is exactly what you want in software engineering. Fail loudly, fail cleanly, and fail immediately so I can catch the exception.
Sam	But alama operates very differently. A llama uses a generic format parameter. If you use a supported model like llama 3.1, a llama brilliantly converts your JSON schema into a grammar
Alex	and runs that internal constraint decoding runtime with the negative infinity masking. Yep,
Sam	and it works beautifully. But if you pass that same strict schema to a smaller or older open source model in a llama. That doesn’t support structured output.
Alex	What happens?
Sam	A llama silently ignores the constraint.
Alex	Wait,
Sam	it just pretends you didn’t
Alex	ask.
Sam	It completely ignores it. It falls completely back to untrusted, probabilistic, Jason-ish output.
Alex	Oh, that is terrifying. If a llama fails silently like that, you could deploy an agent thinking it’s completely. Type safe only to have it blow up in production with a downstream JSON decode error.
Sam	All because the model randomly decided to inject a conversational preamble and a llama just let it happen.
Alex	So if the inference engine won’t protect us, how should developers proactively defend against this silent failure when running local models?
Sam	Architecturally, you have to treat a llama’s schema parameter as a hint, not a guarantee. Unless you have explicitly verified the model’s capabilities in the release notes, OK, treat it as a hint. You defend against this by wrapping your API client code in strict validation layers, typically using a library like Pedantic in Python. But you don’t just validate and crash, you build an automatic retry loop.
Alex	OK, walk me through the mechanics of that retry loop. What does that architecture actually look like in code?
Sam	Sure, you fire the generation request to a llama. You take the raw string response and pass it to your pedantic model using model validate JSON. And
Alex	if a llama silently failed and the model hallucinated a bad type,
Sam	pedantic throws a validation error. You catch that exception. You extract the exact string representation of the error.
Alex	For example, field age expected integer got string.
Sam	Exactly. You then format a new user prompt. Saying your previous output was invalid. Here’s the exact system error, and you insert that error. Please correct the
Alex	JSON, and you just append that to the message history and call the model again.
Sam	Yes, you typically loop this up to 3 times before falling back to a human in the loom or heart failure.
Alex	That makes total sense. So we have the mechanism to dodge the silent fallbacks, but you know, a validation loop is only as good as the instructions we give it to validate against, true, which brings us to the actual art of crafting JSON schemas for these agents, because it’s not just a matter of dumping your massive post-Gresswold database schema into the prompt and hoping for the best.
Sam	No, schema designed for constrained decoding is a very specific discipline. The golden rule from the scose material is keep it flat and shallow,
Alex	flat and shallow. Why is that so important?
Sam	Deeply nested jason trees, where an object contains a list of objects which contain more objects, they choke the constrained decoder. It creates too many branching paths for the finite state machine to evaluate efficiently at generation time, which spikes your latency. Exactly. You want explicit types, rigid boundaries like minimum and maximum lengths, and most importantly, you want to use enums for fixed vocabularies.
Alex	Give me an example of the enum thing. Well,
Sam	if a status field in your backend can only be pending, approved, or rejected, use an enum in the schema. Do not let the model probabilistically guess the strength, right,
Alex	because it might guess processing and break everything. The notes also highlight one specific setting as an absolute must use setting additional properties to false.
Sam	Oh yeah, that is non-negotiable.
Alex	If you don’t explicitly set that flag in your schema definition, you are leaving the door wide open for the model to hallucinate entirely new fields that your back end has no idea how to parse,
Sam	and the constrained decoder will let them right through because you didn’t tell it not to. That flag is mandatory for type safety, but equally important to knowing how to build a schema is knowing when not to use structured output at all.
Alex	Wait, really? When should we avoid it?
Sam	If your agent is executing complex reasoning tasks that require chain of thought, where it needs to think step by step before arriving at an answer, forcing that output into a JSON string is actively harmful to the model’s intelligence.
Alex	I’ve heard this. Why does the format degrade the intelligence? Is it a compute allocation issue?
Sam	Exactly. When you force chain of thought inside a Jason string field, the model suddenly has to dedicate its limited attention mechanism and compute to escaping quotation marks, managing new line characters, and maintaining string syntax
Alex	rather than actually reasoning through the logical problem,
Sam	right? The cognitive overhead of the formatting degrades its problem solving capability.
Alex	So we shouldn’t force our agent to do its deep thinking inside a tiny constrained JSON box. What’s the architectural workaround for that?
Sam	We use the reasoning then structuring hybrid approach. You basically break the task into two steps. First, let the mater reason freely in natural language. Just a
Alex	standard unstructured text prompt.
Sam	Yeah, let it use its full auto-regressive power to think through the problem in an unstructured scratch pad. Then once it reaches a conclusion, you make a second lightweight API call with strict structured output enabled.
Alex	Uh, and you simply ask it to extract the final data points from its own natural language reasoning into your schema.
Sam	Exactly. It works beautifully.
Alex	OK, so let’s say we’ve architected everything perfectly. We’ve built a flat, shallow schema with additional properties set to false. We are using GP 4 Mini with strict constrained decoding. Sounds solid. We We have pedantic retry loops in place just in case. We force the model to follow our syntax flawlessly. Our data pipeline is bulletproof now,
Sam	right? Well, it is bulletproof against syntax errors, but we have to face the final and perhaps most difficult reality check here. Uh
Alex	oh,
Sam	syntax does not equal semantics. Ah,
Alex	the ultimate boss fight.
Sam	The source material highlights a massive real-world data point that illustrates this perfectly. Researchers analyzed over 50,000 LLM generated SQL queries.
Alex	50,000. Wow. Yeah,
Sam	and these were queries generated using structured output paradigms. The SQL syntax was mathematically flawless. The brackets matched, the keywords were correct, but
Alex	I’m guessing it didn’t
Sam	work. Semantic errors ran rampant throughout the entire data set,
Alex	meaning the code compiles perfectly, but it executes the completely wrong business logic.
Sam	Precisely. The study showed that 35 to 40% of the errors were schema misunderstandings. The LLM would generate a valid SQL query, but it would assume standard naming conventions based on its training data.
Alex	So it would guess a table was called users or revenue,
Sam	right, instead of your actual internal database name like U Darodv2, it pattern matches syntax, but it doesn’t inherently understand your specific business context. That makes total sense. Furthermore, another 25 to 30% of the errors were incorrect joined paths. The model would successfully join two tables but completely miss the required intermediate junction table, returning wildly inaccurate data.
Alex	So what does this all mean for the developers listening? I mean, constrained decoding solves the typing problem, but it doesn’t solve the understanding problem. No, it doesn’t. We haven’t created a perfect thinker. We’ve basically just elevated the LLS. From making stupid typographical errors to making highly articulate, syntactically perfect logical errors. That
Sam	is the core issue. And this is why postprocessing is never dead. It just moves up the stack.
Alex	Right? You no longer need rejects to find a missing comma.
Sam	Exactly. But you absolutely still need a semantic layer in your application. You need code that defines valid joined paths and verifies table names before execution.
Alex	And you still rely heavily on those pedantic retry loops, I imagine.
Sam	Yes, but not to fix formatting. You use them to provide explicit error feedback to the model. Like, your syntax was perfect, but this joined path does not exist in our database schema. Try again.
Alex	Wow, what a journey this has been today. We started with the sheer chaos of begging an LLM for valid JSON in the system prompt. A dark time. We trace that problem all the way down to the auto-regressive token. By token trap where models are just rolling dice across 100,000 logics. Yeah, educated guessing. We looked at the mathematical elegance and let’s be honest, the computational cost of constrained decoding using negative infinity masking to force syntax validity.
Sam	It’s heavy, but it works.
Alex	We compared the strict bounds of OpenAI’s 400 bad requests to the wild west, silent fallbacks of a llama. And we established that flat schemas, enums, and pedantic validation loops are really the holy trinity for agentic data science workflows.
Sam	If we step back and look at where this is all heading, it points to a massive paradigm shift in how we treat these models. What do you mean? Well, as constrained decoding becomes perfectly integrated at the hardware and inference level, the chaotic conversational art of prompt engineering is rapidly dying. We’re returning to rigorous software engineering, which is a relief, honestly, it is. But it makes you wonder about the future of model architecture itself. If the end goal for backend agents is purely deterministic type-safe data pipelines, will future models even need to be trained on conversational human text? Oh wow, I never thought about that. We might see a radical divergence where consumer models speak English, but backend agentic models are trained exclusively on abstract syntax trees and strict formal grammars, uh, abandoning natural language entirely. The language part of large language models might actually become obsolete for the very tasks we are building today.
Alex	That is a fascinating thought to leave on. Are we approaching a future where our AI agents don’t even know how to say hello, but they can write a flawless nested sequel join on the first try? It’s very possible. Thank you all for joining. Joining us on this deep dive. As you head back to your code editors and terminal windows, go do yourselves a favor. Audit your API parameters, build out those pedantic retry loops, and go test additional properties. False in your next agent build. Let’s make sure none of you ever have to read the words, sure, here’s the Jason you requested ever again.

Presentation

Data Science for Agentic AI

Lecture Notes

Data Science for Agentic AI: Managing, Monitoring, and Evaluating Intelligent Agent Systems

Mapping Classic Data Science Techniques to Agentic AI Management

Classic Analytics Problem	Agentic AI Equivalent	Key Technique	Business Analogy	Primary Metrics
Conversion funnel analysis	Task completion path analysis	Sankey diagrams, sequence analysis	Clickstream / Customer Journey Analytics	Task completion rate, Mean steps-to-completion, Tool call frequency, Path diversity
Concept drift in forecasting	Agent performance degradation	Change-point detection, SPC control charts	KPI Trend Monitoring / Demand Forecasting / SLA Tracking	Answer correctness, Task success rate, Latency, Token consumption
Call center queueing / workforce mgmt	Multi-agent coordination & load balancing	Little’s Law, utilization analysis, bottleneck ID	Hospital Patient Flow / Supply Chain Optimization	Utilization rate, throughput, end-to-end latency ($W$), tasks in-flight ($L$)
Survey instrument validation	LLM-as-judge calibration	Cohen’s kappa, inter-rater reliability	Social science survey validation	Krippendorff’s alpha, Cohen’s kappa, bias/calibration metrics
Search relevance optimization	Retrieval quality in RAG pipelines	NDCG, Precision@k, Recall@k	E-commerce Search / Recommendation Systems	Precision@k, Recall@k, NDCG, Similarity scores
Customer complaint classification	Agent failure taxonomy	Multi-label classification, cost-sensitive learning	Fraud Detection / Ticket Routing	Hamming loss, subset accuracy, per-label F1, Precision/Recall
A/B testing marketing campaigns	Agent configuration experiments	t-test, Mann-Whitney U, effect size estimation	Marketing Campaign Experiments / Clinical Trials	p-value, Cohen’s d (Effect size), Statistical significance
Demand forecasting	Query volume & resource planning	Time-series decomposition, capacity models	Retail Demand Forecasting	Trend, Seasonality, Residuals, GPU/Resource utilization
Customer segmentation	Agent behavioral clustering	K-means, HDBSCAN, UMAP	CRM Behavioral Cohort Analysis	Cluster membership, silhouette scores (inferred), behavioral feature vectors
Fraud / anomaly detection	Unusual agent behavior detection	Isolation Forest, LOF, DBSCAN	Network Intrusion Detection	Outlier scores, noise points