MSA 8700 — Module 11
Discussion Prompt: “What does a ‘bounce’ look like in an agent system? Is it always bad? When might a single-step completion be desirable, and when does it signal a problem?”
Track these as time series:
Apply standard decomposition: trend (is quality declining?), seasonality (do certain times of day or week show different performance — e.g., batch jobs competing for GPU resources?), residual (noise vs. signal). Rolling averages (7-day, 28-day) smooth noise and reveal trend.
Control charts — X-bar, R-charts, CUSUM — are used identically in manufacturing QA, service level monitoring, and agent performance tracking. The logic:
CUSUM (Cumulative Sum) charts are particularly valuable for detecting small, sustained shifts — the kind of slow degradation that agents exhibit when upstream data changes or model weights drift subtly.
Algorithms like PELT (Pruned Exact Linear Time) or Bayesian Online Change-Point Detection identify the moment a distribution shifts. In manufacturing, this detects when a machine tool begins producing slightly out-of-spec parts. In agent systems, it detects when answer quality shifts — perhaps because the retrieval index was updated, a model version changed, or the distribution of user queries evolved.
Concept drift in business forecasting occurs when the relationship between features and target changes — a demand model trained in 2024 underperforms in 2025 because consumer behavior shifted. In agent systems, the equivalent is task distribution drift: the queries users ask evolve, the documents in the knowledge base age, or the external APIs the agent calls change their response format. The agent’s behavior degrades not because the agent changed, but because the world did.
Key Insight: “How do you know your agent got worse before your users tell you?” The answer is the same as in any business: instrument the process, track the metrics, set control limits, and automate anomaly detection. SPC is not a manufacturing technique — it is a process monitoring technique that applies to any repeated operation.
Agent outputs fail in categorizable ways. A working taxonomy:
| Failure Category | Description | Business Analogy |
|---|---|---|
| Hallucination | Agent generates factually incorrect content with high confidence | Fraudulent transaction — looks legitimate, isn’t |
| Tool failure | External tool returns an error or unexpected result | Supply chain disruption — upstream dependency fails |
| Misrouting | Agent selects the wrong tool or wrong sub-agent for the task | Ticket misrouted to the wrong department |
| Context loss | Agent loses track of conversation history or task state | Customer has to repeat their problem to a new rep |
| Prompt injection | Adversarial input causes the agent to deviate from instructions | Social engineering attack on a call center agent |
| Infinite loop | Agent enters a cycle of repeated actions without progress | Escalation loop — ticket bounces between departments |
| Partial completion | Agent completes part of the task but misses key requirements | Order shipped incomplete |
A single agent output can exhibit multiple failure modes simultaneously — a hallucinated answer that also lost context and was produced after a silent tool failure. This is a multi-label classification problem, not multi-class. The distinction matters for metric selection (use Hamming loss, subset accuracy, or per-label F1 rather than simple accuracy).
The cost asymmetry is severe and asymmetric — exactly as in fraud detection:
This asymmetry demands threshold tuning, class weighting, or cost-sensitive loss functions — the same toolkit used in fraud detection, medical screening, and safety-critical classification.
Key Insight: Deciding what counts as “failure” is a domain problem, not a modeling problem. This is identical to the challenge of defining “churn” in a CRM system (inactive for 30 days? 60 days? Reduced usage? Canceled subscription?) or defining “fraud” in a payments system (rule-based threshold? Statistical anomaly? Confirmed investigation?). The taxonomy above is a starting point — every deployment will require domain-specific refinement based on the use case, the user population, and the consequences of each failure type.
Agent failure labels are expensive to obtain — they require expert review of each output. Active learning addresses this by prioritizing the most informative examples for labeling:
The result: a usable classifier with far fewer labeled examples than random sampling would require.
Consider a binary “hallucination detector” evaluated on 500 agent outputs:
| Predicted: Hallucination | Predicted: Correct | |
|---|---|---|
| Actual: Hallucination | 38 (TP) | 12 (FN) |
| Actual: Correct | 25 (FP) | 425 (TN) |
Discussion Prompt: “In your own agent system, what failure categories would you define? How would you decide the boundary between ‘partially correct’ and ‘failure’?”
RAG system retrieval quality is evaluated with the same metrics used in e-commerce search and recommendation engine evaluation:
A natural analytical question: does retrieval similarity score predict answer quality? This is a regression / correlation analysis problem. Extract features from retrieval:
Fit a regression model predicting answer quality (human-rated or LLM-judged) from these features. The feature importance ranking tells you which aspects of retrieval matter most for downstream quality — actionable intelligence for tuning the retrieval pipeline.
Which query clusters have poor retrieval performance? Embed user queries, cluster them (K-means or HDBSCAN), and compute per-cluster retrieval metrics. This reveals underserved segments — query types where the knowledge base has gaps or the embedding model performs poorly.
This is identical to identifying underserved customer segments in retail analytics: which customer cohorts have low conversion rates? What products are they searching for that the catalog doesn’t carry?
Key Insight: RAG quality is a regression problem. Predict answer quality from retrieval features, identify the weak spots, and invest in the clusters where retrieval fails. The analytical framework is identical to diagnosing why certain customer segments have poor conversion on an e-commerce site.
A multi-agent system is a queuing network. Queries arrive, wait for an available agent instance, get processed through a sequence of steps (sub-agents, tools, LLM calls), and exit. This is structurally identical to:
L = λW
This holds for any stable queuing system. If your agent handles λ = 5 queries/minute and each takes W = 30 seconds on average, you’ll have L = 2.5 tasks in-flight on average. If W increases to 60 seconds, L doubles — and if you only have 3 agent instances, the queue begins to grow.
Little’s Law is taught in every operations management course. It applies to agent systems without modification.
Theory of Constraints (TOC): The throughput of the entire system is limited by its slowest component. In a multi-agent pipeline:
This is the same analysis used to identify the bottleneck station in a manufacturing line or the overloaded department in a call center.
Tool call durations and sub-agent processing times are rarely normally distributed. They typically follow Log-Normal or Gamma distributions — a long right tail with occasional slow calls. Fitting the correct distribution matters for:
This is the same distributional modeling used for service times in retail checkout, hospital procedure durations, and warehouse pick times.
Gantt-style charts of parallel agent execution reveal the critical path — the longest sequential chain of dependencies that determines end-to-end latency.
This is identical to critical path analysis in project management.
Discussion Prompt: “In your own agent system, which component do you suspect is the bottleneck? How would you measure it? What would you do if you confirmed it?”
To compare agent configurations (different models, prompts, tool sets, retrieval strategies), run a controlled experiment:
This is standard A/B testing methodology, applied to agent configurations rather than website layouts or ad copy.
Compare quality scores between configurations:
Key Insight: Statistical significance ≠ practical significance. A new retrieval strategy might produce a statistically significant 0.3-point improvement on a 100-point quality scale with p < 0.01 — and be completely irrelevant in practice. Always report effect size (Cohen’s d, or raw difference in meaningful units) alongside the p-value. This is the same lesson taught in every marketing experimentation course: a statistically significant 0.1% lift in click-through rate is real but may not justify the engineering effort.
When you cannot afford to commit 50% of traffic to a possibly inferior configuration for weeks, use multi-armed bandit algorithms (Thompson Sampling, UCB) to adaptively allocate traffic. Configurations that perform well get more traffic; poor performers get less. This is the same approach used in dynamic ad serving and personalized recommendation — exploit the best-known option while continuing to explore.
When using an LLM to judge agent output quality, treat the LLM-judge as a measurement instrument subject to the same validation requirements as a survey instrument in social science:
Discussion Prompt: “If you were to run an A/B test on your own agent system, what is the single most important metric you would track, and how many queries would you need to detect a meaningful difference?”
Represent each agent session as a behavioral feature vector: tool usage profile (which tools, how often), task length, latency, token count, outcome (success/failure), number of retries. Apply K-means, HDBSCAN, or Gaussian Mixture Models to discover natural groupings.
These clusters are agent behavioral segments — the equivalent of customer segments in CRM analytics.
Apply Apriori or FP-Growth to tool call sequences to discover frequent patterns: “When the agent calls web_search, it almost always calls citation_formatter next” or “Retrieval followed by a second retrieval attempt is associated with task failure.”
This is market basket analysis applied to agent behavior rather than shopping carts.
Embed agent traces (using their behavioral feature vectors or by embedding the full trace text) and project to 2D with UMAP or t-SNE. The resulting scatter plot reveals clusters, outliers, and structural patterns in agent behavior — a behavioral landscape map.
Isolation Forest, Local Outlier Factor (LOF), or DBSCAN’s noise points flag unusual agent sessions — sessions that don’t fit any normal behavioral cluster.
This is the same anomaly detection pipeline used in fraud analytics, network intrusion detection, and rare event identification in CRM.
Key Insight: Unsupervised methods are discovery tools. They answer the question you didn’t know to ask: “What behavioral patterns exist in my agent system that I haven’t designed for?”
| Classic Analytics Problem | Agentic AI Equivalent | Key Technique |
|---|---|---|
| Conversion funnel analysis | Task completion path analysis | Sankey diagrams, sequence analysis |
| Concept drift in forecasting | Agent performance degradation | Change-point detection, SPC control charts |
| Customer complaint classification | Agent failure taxonomy | Multi-label classification, cost-sensitive learning |
| Search relevance optimization | Retrieval quality in RAG pipelines | NDCG, Precision@k, Recall@k |
| Call center queueing / workforce mgmt | Multi-agent coordination & load balancing | Little’s Law, utilization analysis, bottleneck ID |
| A/B testing marketing campaigns | Agent configuration experiments | t-test, Mann-Whitney U, effect size estimation |
| Customer segmentation | Agent behavioral clustering | K-means, HDBSCAN, UMAP |
| Fraud / anomaly detection | Unusual agent behavior detection | Isolation Forest, LOF, DBSCAN |
| Survey instrument validation | LLM-as-judge calibration | Cohen’s kappa, inter-rater reliability |
| Demand forecasting | Query volume & resource planning | Time-series decomposition, capacity models |
System Description
A large research university has deployed an agentic AI research assistant to support graduate students across all departments. The system architecture:
The system has been running for 10 weeks. It was well-received at launch.
The Problem — Six Weeks Post-Launch
The following issues have been reported or observed:
Question 1 — Diagnose the Performance Drop
The satisfaction score declined from 4.2 to 3.4 over six weeks.
Question 2 — Design a Failure Classifier
Users report answers that “sound right but cite wrong papers.”
Question 3 — Analyze the Queueing Imbalance
One agent instance handles 60% of traffic while seven others share 40%.
Question 4 — Design an Evaluation Experiment
The engineering team wants to test a new retrieval strategy (hybrid search: dense + sparse) against the current approach (dense only).
Question 5 — Propose a Monitoring Dashboard
You are asked to design an ongoing Grafana dashboard for this system with exactly 5 KPIs.
Every technique covered today — funnel analysis, SPC, classification, queueing theory, experimental design, clustering — is a technique students already know, applied to a new domain. The mapping is not metaphorical; it is structural. Agent logs are event data. Agent failures are classifiable defects. Agent pipelines are queuing networks. Agent evaluations are experiments.
You already have all the tools. The concepts transfer directly. The challenge is not learning new techniques — it is recognizing which technique applies to which agent management problem, and having the discipline to instrument, measure, and analyze before intervening.

