Data Science on Agentic System
The focus of this session is on managing agentic AI systems using established data science methodologies rather than treating them as unique technical mysteries. We explore how techniques such as funnel analysis, queueing theory, and statistical process control can be directly applied to monitoring and evaluating intelligent systems.
Through a detailed mapping of classic analytics to AI-specific challenges - including classifying agent failure modes, optimizing retrieval quality, and detecting performance degradation - we illustrate how familiar analogies like clickstream data and call center operations translate to instrumenting execution logs and designing controlled experiments.

Manage Agentic AI with Traditional Analytics
Transcript
| Speaker | Text |
|---|---|
| Alex | This is the brief on mapping traditional data science to agentic AI. You know, managing intelligent agents doesn’t actually require brand new math. It’s really just a business process where your existing analytics toolkit applies perfectly. So let’s dive right into the three ways we’re going to use what we already know. First up, treat agent logs as event data. Think of an agent’s execution path like tracking a shopper’s customer journey on a website. A task completion rate is literally just your conversion rate. Grab a Sankey diagram to visualize exactly where the agent drops off. Second, we’ve got to move to performance monitoring overtime. Agent systems degrade silently with no error codes, just worse answers. So how do you know your agent got worse before your users start complaining? Well, we use statistical process control and change point detection to catch task distribution drift, exactly like tracking KPI erosion or concept drift in business forecasting. Finally, let’s shift from tracking quality to managing system speed. Treat multi-agent coordination as operations analytics, because a multi-agent pipeline is simply a queuing network. It’s basically a digital call center or a hospital triage. Just apply Little’s law. Bottleneck analysis to fix those latency and load balancing issues. The bottom line is, you already have the tools to evaluate, monitor, and manage AI. You just need to map them to this new domain. Stop looking for a magic AI metric and start putting your classic analytics toolkit to work. |
Deep Dive
Forcing Structured Outputs with Constrained Decoding
Transcript
| Speaker | Text |
|---|---|
| Alex | You know, it really wasn’t that long ago that building an AI data pipeline felt like, um, like negotiating with a highly unpredictable hostage taker. Oh |
| Sam | yeah, completely. It was stressful. |
| Alex | I mean, I’m talking about the dark ages of prompt engineering, which, if you are actively building agentic AI solutions right now, was essentially just last year, |
| Sam | right? It really was just yesterday in tech |
| Alex | time, exactly. You’d write this, you know, beautifully optimized, complex data pipeline. You’d feed your contests into the LLM, and you’d literally beg it in the system |
| Sam | prompt. So the begging return strictly valid |
| Alex | JSON. Yes, no preamble, no conversational filler. My job depends |
| Sam | on this, and you just cross your fingers and hit run, |
| Alex | right? And the model would process your request, generate this perfect data structure, and then right before the opening bracket politely append, sure. Here’s the JSON you requested, which |
| Sam | instantly detonates your entire pipeline. |
| Alex | Yeah, a total crash, just a parser error and everything breaks. Or, |
| Sam | you know, it would decide to get creative with your SQL query, like hallucinating a missing comma or um inventing an entirely new column out of thin air |
| Alex | because probabilistically that column just looked like it belonged there. |
| Sam | Exactly. We were spending half our compute cycles just running rejects patches. Endless string manipulation scripts just trying to force probabilistic text generators to behave like, well. Deterministic software components, which they just aren’t. No, they aren’t. The friction was immense and the failure rates in production were just totally unacceptable for any serious grad student or developer. |
| Alex | And that is exactly the mission for today’s deep dive. We are taking a stack of lecture materials, research notes, and API documentation on generative LLMs and structured outputs, and we are unpacking. How this problem was actually solved. It’s |
| Sam | a really fascinating evolution, honestly. It |
| Alex | really is. We’re going to get under the hood of text generation. We’ll explore how the math of constrained decoding finally guarantees valid syntax and look at the massive architectural differences between deploying on OpenAI versus, you know, local inference on a llama, |
| Sam | which is a huge trap for a lot of people building local agents right now. Oh, |
| Alex | massive trap. We’ll also establish the scheme of design best practices you need for type-safe agents. But I mean, to appreciate the cure, we first really have to understand the disease, right? |
| Sam | Why do they fail out of the box? |
| Alex | Exactly? Why do LLMs fail so spectacularly at structured output natively? |
| Sam | Well, it really comes down to confronting the mathematical reality of token by token generation. Let’s break down that autoregressive process. OK, let’s unpack this. Walk us through it. So when you send a prompt, The model doesn’t like think of the entire response as one cohesive document. It tokenizes your input, converts it to embeddings, and passes those through its transformer layers, right? The standard forward pass, exactly. And the final output layer computes what we call logits. This is a raw, unnormalized vector of scores representing the entire vocabulary of the model. And |
| Alex | for modern models, that vocabulary is what, massive, |
| Sam | huge, anywhere from 50,000 to 100,000 distinct tokens. |
| Alex | OK, wait, so every single generation step, the model is computing a raw score for all 100,000 possible next |
| Sam | tokens. Correct. Every single time it generates a chunk of text. And those raw logics are then passed through a soft max function |
| Alex | which converts them into probabilities, right, |
| Sam | right, it converts them into a probability distribution that sums exactly to one. So given all the tokens generated so far, the model calculates the probability of what comes next. OK, I’m with you. But how it actually chooses the winning token from that massive distribution depends entirely on the decoding strategy you configure, right, |
| Alex | because we have all these sampling parameters we tweak in our API calls. Like you have greedy decoding, which is just the model statically picking the absolute highest probability token every single time, |
| Sam | exactly zero creativity. |
| Alex | And then you have temperature, and temperature isn’t just a, you know, a randomness style, right? It’s mathematically altering the logits before the soft max function. |
| Sam | That’s a great way to put it. It shifts the math, |
| Alex | right? So if you set the temperature below one, you are sharpening the distribution, making the likely tokens even more dominant. If you set it above one, you are flattening it, making lower probability tokens more viable. |
| Sam | That is a crucial distinction. And beyond temperature, you have the sampling filters. Like Top K. |
| Alex | Remind us how top K |
| Sam | works |
| Alex | again. Sure, |
| Sam | Top K truncates the list to only consider the K most likely tokens, and it just discards the rest of the tail, so you’re ignoring the really unlikely stuff entirely. |
| Alex | Got it. And then there’s topec or nucleus sampling. Yeah, |
| Sam | Top K dynamically samples from a subset of tokens whose cumulative probability hits your threshold, say 90%. |
| Alex | OK, and I’ve seen minf popping up a lot lately too. |
| Sam | Yeah, the newer MIFO strategy filters out. Any tokens that fall below a specific percentage of the most likely tokens probability. It’s really effective. |
| Alex | So when you look at all of these strategies, you know, greedy, temperature, top up, they’re really just different mathematical flavors of educated guessing. |
| Sam | That’s exactly what they are weighted dice rolls, |
| Alex | which perfectly explains why unconstrained generation fails for strict data structures. Working with a standard LLM is like, um, it’s like working with a highly educated improv actor. |
| Sam | I love that analogy, |
| Alex | right, because they are brilliant at keeping the scene going based on the context, and they can pattern match the rhythm of a conversation perfectly, but they are absolutely terrible at filling out a strict tax return without supervision. |
| Sam | Yeah, because they just go with the vibe. The model has seen millions of JSON files in its training data, so it knows the general vibe of JSON, |
| Alex | but |
| Sam | it |
| Alex | has no hard mathematical guarantee of the syntax. |
| Sam | Exactly. It will happily mix types. Like giving you the key age with the value 25 spelled out as a string instead of the integer 25, |
| Alex | just because the word 25 was probabilistically likely in that specific linguistic context, right? |
| Sam | It lacks structural awareness entirely. It operates strictly on statistical associations. It doesn’t actually know what a JSON schema is. |
| Alex | It just knows that an opening brace is usually followed by a quotation mark. |
| Sam | Yep. So here’s the million dollar question for the developers listening. If the model is fundamentally just rolling weighted dice to pick the next token out of 100,000 options, how do we physically force it to only pick the ones that form valid |
| Alex | code? OK, yes. How do we do that? Because begging didn’t work. |
| Sam | Begging definitely didn’t work. This is where we introduced the mechanism of constrained decoding. If probabilistic guessing is the disease, constrained decoding is, is the cure. I like the sound of that. Instead of letting the model sample freely from the entire post-op max vocabulary, we introduce a formal grammar. Think of it as a finite state machine that runs parallel to the generation process. |
| Alex | OK, a finite state machine, so it’s tracking the state of the output at every step. |
| Sam | Exactly. At step T, this state machine evaluates your schema and the partial output generated so far. Then it determines exactly which tokens are syntactically lethal to generate next. |
| Alex | I like to think of it like a mechanical template placed over a piano keyboard. Oh, that’s a good visual. Like if you are only allowed to play the notes in a specific chord, the template physically locks down all the wrong keys. So even if the pianist, or in this case the LLM tries to mash a wrong key because it probabilistically feels right. The key simply won’t depress. |
| Sam | That is perfectly stated. It just blocks the action. |
| Alex | But what is the actual math happening under the hood to lock down those keys? |
| Sam | What’s fascinating here is the mathematical trick. It happens right at the logic computation phase before the softmax function is applied. OK, before softmax, right? The constrained decoding engine identifies all the syntactically invalid tokens for that specific step. It then masks their probabilities by setting their raw logic scores to negative infinity. |
| Alex | Oh wow, negative infinity, because when you pass negative infinity through an exponential softmax function, it exponentiates to exactly 0, |
| Sam | precisely 0, not a tiny fraction, but actual 0. Those invalid tokens now have a mathematically 0% chance of being sampled, |
| Alex | so they’re completely off the table, yep. |
| Sam | The remaining valid tokens, the keys that aren’t locked down, were then renormalized, so there are probabilities sum back to one. And |
| Alex | then the model just samples from that restricted pool using your standard temperature or top settings. |
| Sam | Exactly. The benefit is absolute. You get guaranteed syntax validity and you can finally throw away your rejects postprocessing scripts. Wait, |
| Alex | I need to push back on the efficiency of this though. You’re saying we’re running a complex finite state machine against a vocabulary of 100,000 tokens at every single generation step. If my model outputs 500 tokens, I’m doing that grammar evaluation 50 million times. Doesn’t that introduce like Massive computational overhead and latency. It absolutely does. Especially for listeners who are building local agents and counting every millisecond of time to first token. That sounds incredibly heavy. |
| Sam | It is a significant computational bottleneck, and you’ve hit on one of the major engineering challenges in the space. Building the finite state machine from a complex JSON schema takes upfront compute, and |
| Alex | evaluating the mask at every step adds latency. |
| Sam | Right. Frameworks mitigate this through heavy caching. If the state machine knows that the next 15 tokens must be a specific string, it can cache that path. But the overhead is real, so it’s not a free lunch. Definitely not. And it gets even more complicated when you factor in subword tokenization. Wait, |
| Alex | how so? Why does subword tokenization mess it up? Well, |
| Sam | the grammar engine is evaluating bytes or characters, right? But the LLM generates tokens, which are often chunks of characters grouped together. Let’s say your schema strictly requires the key status. The finite state machine is looking for a quotation mark, but the LLM’s vocabulary might not output a standalone quotation mark. Oh, I see where this is going. It’s most likely token might be a combined subword like an opening brace attached to an S and a T. The grammar engine sees a token starting with a brace instead of a panics and masks it to negative infinity. Ah, |
| Alex | so the token boundaries don’t cleanly align with the syntax boundaries of JSON. The FSM accidentally bans the correct path because the LMN tried to output it as a multi-character check. |
| Sam | Exactly the problem. Modern constrained decoding engines like Outlines or the one built in Lamaat CPP have to do incredibly complex look-ahead operations just |
| Alex | to figure. If a subword token contains the valid byte sequence. Yeah, |
| Sam | it is computationally expensive, but it is really the only way to achieve deterministic structure. |
| Alex | Here’s where it gets really interesting, because now we have to talk about how the major APIs actually expose this underlying math to developers. And right now there is a massive point of confusion in the ecosystem between older JSON mode and true structured outputs. |
| Sam | Complating those two is a dangerous trap. It really is. |
| Alex | Break it down for us. What is the actual difference? |
| Sam | Well, JSON mode, which you typically enable by passing response format, equals type JSON object in OpenAI’s API, merely guarantees that the final string will successfully parse as JSON. That’s it. Just that it parses, that is the extent of the promise. It is a legacy feature supported by older models like GPT 3.5 turbo. It does not guarantee schema adherence, so |
| Alex | it doesn’t guarantee your field names are correct. |
| Sam | Nope, it doesn’t enforce strict typing. And it absolutely allows the model to hallucinate entirely new unexpected fields, right? |
| Alex | So Jason mode is basically like ordering a sandwich at a deli. You have a structural guarantee that you are going to get something between two pieces of bread, but the fillings are a total surprise. |
| Sam | Exactly. You might get turkey, you might get an old shoe. |
| Alex | But structured output, on the other hand, which uses type JSON schema, is like a highly specific legally binding catering contract. Yes. |
| Sam | It utilizes the actual schema aware negative infinity constraint decoding we just broke down. It guarantees that required fields are present. The integer types aren’t secretly strings. And no extra fields are hallucinated, |
| Alex | but you have to use newer models for it, right, like GPT-40 and many are the newer GP 240 releases, correct? |
| Sam | So for an engineer setting up their API calls right now, the question becomes, is there any valid reason to ever use the old JSON load again, right? Why would anyone use it unless you are forced to use a legacy model for compliance reasons or your task. Just incredibly open-ended, |
| Alex | like a chatbot returning deeply varied, unstructured configurations. Yeah, |
| Sam | where you can tolerate structural variants. If not, the answer is no. If you need predictability for a typed backend, structured outputs are the baseline. |
| Alex | But this raises a critical issue about your deployment environment because you might be prototyping. The privacy first agents, you are almost certainly using a llama, |
| Sam | and the difference in how these two APIs handle structured output is a serious lesson in defensive engineering. |
| Alex | It is a massive gotcha for developers. Let’s trace exactly what happens when you call an API, demand a strict JSON schema. But the underlying model you were targeting doesn’t actually support constrained decoding. OK, |
| Sam | so if you are using OpenAI and you request structured outputs on an unsupported model, the API acts as a strict bouncer. It outright rejects the request. |
| Alex | You get a hard 400 bad |
| Sam | request error, |
| Alex | exactly, |
| Sam | stating the model does not support structured outputs, |
| Alex | which is fantastic. That is exactly what you want in software engineering. Fail loudly, fail cleanly, and fail immediately so I can catch the exception. |
| Sam | But alama operates very differently. A llama uses a generic format parameter. If you use a supported model like llama 3.1, a llama brilliantly converts your JSON schema into a grammar |
| Alex | and runs that internal constraint decoding runtime with the negative infinity masking. Yep, |
| Sam | and it works beautifully. But if you pass that same strict schema to a smaller or older open source model in a llama. That doesn’t support structured output. |
| Alex | What happens? |
| Sam | A llama silently ignores the constraint. |
| Alex | Wait, |
| Sam | it just pretends you didn’t |
| Alex | ask. |
| Sam | It completely ignores it. It falls completely back to untrusted, probabilistic, Jason-ish output. |
| Alex | Oh, that is terrifying. If a llama fails silently like that, you could deploy an agent thinking it’s completely. Type safe only to have it blow up in production with a downstream JSON decode error. |
| Sam | All because the model randomly decided to inject a conversational preamble and a llama just let it happen. |
| Alex | So if the inference engine won’t protect us, how should developers proactively defend against this silent failure when running local models? |
| Sam | Architecturally, you have to treat a llama’s schema parameter as a hint, not a guarantee. Unless you have explicitly verified the model’s capabilities in the release notes, OK, treat it as a hint. You defend against this by wrapping your API client code in strict validation layers, typically using a library like Pedantic in Python. But you don’t just validate and crash, you build an automatic retry loop. |
| Alex | OK, walk me through the mechanics of that retry loop. What does that architecture actually look like in code? |
| Sam | Sure, you fire the generation request to a llama. You take the raw string response and pass it to your pedantic model using model validate JSON. And |
| Alex | if a llama silently failed and the model hallucinated a bad type, |
| Sam | pedantic throws a validation error. You catch that exception. You extract the exact string representation of the error. |
| Alex | For example, field age expected integer got string. |
| Sam | Exactly. You then format a new user prompt. Saying your previous output was invalid. Here’s the exact system error, and you insert that error. Please correct the |
| Alex | JSON, and you just append that to the message history and call the model again. |
| Sam | Yes, you typically loop this up to 3 times before falling back to a human in the loom or heart failure. |
| Alex | That makes total sense. So we have the mechanism to dodge the silent fallbacks, but you know, a validation loop is only as good as the instructions we give it to validate against, true, which brings us to the actual art of crafting JSON schemas for these agents, because it’s not just a matter of dumping your massive post-Gresswold database schema into the prompt and hoping for the best. |
| Sam | No, schema designed for constrained decoding is a very specific discipline. The golden rule from the scose material is keep it flat and shallow, |
| Alex | flat and shallow. Why is that so important? |
| Sam | Deeply nested jason trees, where an object contains a list of objects which contain more objects, they choke the constrained decoder. It creates too many branching paths for the finite state machine to evaluate efficiently at generation time, which spikes your latency. Exactly. You want explicit types, rigid boundaries like minimum and maximum lengths, and most importantly, you want to use enums for fixed vocabularies. |
| Alex | Give me an example of the enum thing. Well, |
| Sam | if a status field in your backend can only be pending, approved, or rejected, use an enum in the schema. Do not let the model probabilistically guess the strength, right, |
| Alex | because it might guess processing and break everything. The notes also highlight one specific setting as an absolute must use setting additional properties to false. |
| Sam | Oh yeah, that is non-negotiable. |
| Alex | If you don’t explicitly set that flag in your schema definition, you are leaving the door wide open for the model to hallucinate entirely new fields that your back end has no idea how to parse, |
| Sam | and the constrained decoder will let them right through because you didn’t tell it not to. That flag is mandatory for type safety, but equally important to knowing how to build a schema is knowing when not to use structured output at all. |
| Alex | Wait, really? When should we avoid it? |
| Sam | If your agent is executing complex reasoning tasks that require chain of thought, where it needs to think step by step before arriving at an answer, forcing that output into a JSON string is actively harmful to the model’s intelligence. |
| Alex | I’ve heard this. Why does the format degrade the intelligence? Is it a compute allocation issue? |
| Sam | Exactly. When you force chain of thought inside a Jason string field, the model suddenly has to dedicate its limited attention mechanism and compute to escaping quotation marks, managing new line characters, and maintaining string syntax |
| Alex | rather than actually reasoning through the logical problem, |
| Sam | right? The cognitive overhead of the formatting degrades its problem solving capability. |
| Alex | So we shouldn’t force our agent to do its deep thinking inside a tiny constrained JSON box. What’s the architectural workaround for that? |
| Sam | We use the reasoning then structuring hybrid approach. You basically break the task into two steps. First, let the mater reason freely in natural language. Just a |
| Alex | standard unstructured text prompt. |
| Sam | Yeah, let it use its full auto-regressive power to think through the problem in an unstructured scratch pad. Then once it reaches a conclusion, you make a second lightweight API call with strict structured output enabled. |
| Alex | Uh, and you simply ask it to extract the final data points from its own natural language reasoning into your schema. |
| Sam | Exactly. It works beautifully. |
| Alex | OK, so let’s say we’ve architected everything perfectly. We’ve built a flat, shallow schema with additional properties set to false. We are using GP 4 Mini with strict constrained decoding. Sounds solid. We We have pedantic retry loops in place just in case. We force the model to follow our syntax flawlessly. Our data pipeline is bulletproof now, |
| Sam | right? Well, it is bulletproof against syntax errors, but we have to face the final and perhaps most difficult reality check here. Uh |
| Alex | oh, |
| Sam | syntax does not equal semantics. Ah, |
| Alex | the ultimate boss fight. |
| Sam | The source material highlights a massive real-world data point that illustrates this perfectly. Researchers analyzed over 50,000 LLM generated SQL queries. |
| Alex | 50,000. Wow. Yeah, |
| Sam | and these were queries generated using structured output paradigms. The SQL syntax was mathematically flawless. The brackets matched, the keywords were correct, but |
| Alex | I’m guessing it didn’t |
| Sam | work. Semantic errors ran rampant throughout the entire data set, |
| Alex | meaning the code compiles perfectly, but it executes the completely wrong business logic. |
| Sam | Precisely. The study showed that 35 to 40% of the errors were schema misunderstandings. The LLM would generate a valid SQL query, but it would assume standard naming conventions based on its training data. |
| Alex | So it would guess a table was called users or revenue, |
| Sam | right, instead of your actual internal database name like U Darodv2, it pattern matches syntax, but it doesn’t inherently understand your specific business context. That makes total sense. Furthermore, another 25 to 30% of the errors were incorrect joined paths. The model would successfully join two tables but completely miss the required intermediate junction table, returning wildly inaccurate data. |
| Alex | So what does this all mean for the developers listening? I mean, constrained decoding solves the typing problem, but it doesn’t solve the understanding problem. No, it doesn’t. We haven’t created a perfect thinker. We’ve basically just elevated the LLS. From making stupid typographical errors to making highly articulate, syntactically perfect logical errors. That |
| Sam | is the core issue. And this is why postprocessing is never dead. It just moves up the stack. |
| Alex | Right? You no longer need rejects to find a missing comma. |
| Sam | Exactly. But you absolutely still need a semantic layer in your application. You need code that defines valid joined paths and verifies table names before execution. |
| Alex | And you still rely heavily on those pedantic retry loops, I imagine. |
| Sam | Yes, but not to fix formatting. You use them to provide explicit error feedback to the model. Like, your syntax was perfect, but this joined path does not exist in our database schema. Try again. |
| Alex | Wow, what a journey this has been today. We started with the sheer chaos of begging an LLM for valid JSON in the system prompt. A dark time. We trace that problem all the way down to the auto-regressive token. By token trap where models are just rolling dice across 100,000 logics. Yeah, educated guessing. We looked at the mathematical elegance and let’s be honest, the computational cost of constrained decoding using negative infinity masking to force syntax validity. |
| Sam | It’s heavy, but it works. |
| Alex | We compared the strict bounds of OpenAI’s 400 bad requests to the wild west, silent fallbacks of a llama. And we established that flat schemas, enums, and pedantic validation loops are really the holy trinity for agentic data science workflows. |
| Sam | If we step back and look at where this is all heading, it points to a massive paradigm shift in how we treat these models. What do you mean? Well, as constrained decoding becomes perfectly integrated at the hardware and inference level, the chaotic conversational art of prompt engineering is rapidly dying. We’re returning to rigorous software engineering, which is a relief, honestly, it is. But it makes you wonder about the future of model architecture itself. If the end goal for backend agents is purely deterministic type-safe data pipelines, will future models even need to be trained on conversational human text? Oh wow, I never thought about that. We might see a radical divergence where consumer models speak English, but backend agentic models are trained exclusively on abstract syntax trees and strict formal grammars, uh, abandoning natural language entirely. The language part of large language models might actually become obsolete for the very tasks we are building today. |
| Alex | That is a fascinating thought to leave on. Are we approaching a future where our AI agents don’t even know how to say hello, but they can write a flawless nested sequel join on the first try? It’s very possible. Thank you all for joining. Joining us on this deep dive. As you head back to your code editors and terminal windows, go do yourselves a favor. Audit your API parameters, build out those pedantic retry loops, and go test additional properties. False in your next agent build. Let’s make sure none of you ever have to read the words, sure, here’s the Jason you requested ever again. |
Presentation
Lecture Notes
Mapping Classic Data Science Techniques to Agentic AI Management
| Classic Analytics Problem | Agentic AI Equivalent | Key Technique | Business Analogy | Primary Metrics |
|---|---|---|---|---|
| Conversion funnel analysis | Task completion path analysis | Sankey diagrams, sequence analysis | Clickstream / Customer Journey Analytics | Task completion rate, Mean steps-to-completion, Tool call frequency, Path diversity |
| Concept drift in forecasting | Agent performance degradation | Change-point detection, SPC control charts | KPI Trend Monitoring / Demand Forecasting / SLA Tracking | Answer correctness, Task success rate, Latency, Token consumption |
| Call center queueing / workforce mgmt | Multi-agent coordination & load balancing | Little’s Law, utilization analysis, bottleneck ID | Hospital Patient Flow / Supply Chain Optimization | Utilization rate, throughput, end-to-end latency ($W$), tasks in-flight ($L$) |
| Survey instrument validation | LLM-as-judge calibration | Cohen’s kappa, inter-rater reliability | Social science survey validation | Krippendorff’s alpha, Cohen’s kappa, bias/calibration metrics |
| Search relevance optimization | Retrieval quality in RAG pipelines | NDCG, Precision@k, Recall@k | E-commerce Search / Recommendation Systems | Precision@k, Recall@k, NDCG, Similarity scores |
| Customer complaint classification | Agent failure taxonomy | Multi-label classification, cost-sensitive learning | Fraud Detection / Ticket Routing | Hamming loss, subset accuracy, per-label F1, Precision/Recall |
| A/B testing marketing campaigns | Agent configuration experiments | t-test, Mann-Whitney U, effect size estimation | Marketing Campaign Experiments / Clinical Trials | p-value, Cohen’s d (Effect size), Statistical significance |
| Demand forecasting | Query volume & resource planning | Time-series decomposition, capacity models | Retail Demand Forecasting | Trend, Seasonality, Residuals, GPU/Resource utilization |
| Customer segmentation | Agent behavioral clustering | K-means, HDBSCAN, UMAP | CRM Behavioral Cohort Analysis | Cluster membership, silhouette scores (inferred), behavioral feature vectors |
| Fraud / anomaly detection | Unusual agent behavior detection | Isolation Forest, LOF, DBSCAN | Network Intrusion Detection | Outlier scores, noise points |