A Failure Taxonomy for Agentic Systems
Before a catalog of solutions, a catalog of problems. The patterns in this series each exist
because something failed in a specific, diagnosable way. This post documents the six failure
classes observed across 1,500 logged production runs—with the observed frequency, a
representative runs.jsonl signature, and a pointer to the pattern that reduced
each class’s rate.
The answer to “where do quality problems originate?” is rarely where you expect. Synthesis failures account for a smaller share than most practitioners assume. Retrieval and revision failures together account for more.
The Six Classes
F1 — Retrieval Failures high frequency
Search returned insufficient, misleading, or redundant content. The producer synthesizes from an impoverished or noisy context and produces output that fails Completeness or Groundedness in round 1.
Observed signature: r1 completeness < 6.0 with tool_calls.novelty_scores = [2, 3, 2, 3, 2] — all scores compressed to the binary low range, indicating the novelty scale collapsed. When novelty scores cluster at {2, 3} rather than spreading 1–10, the gate is firing but providing no differentiation. The deep-dive on 123 runs showed specificity as the weakest first-round dimension at 6.65, consistently below depth (6.97)—specificity fails when retrieved content is too general to ground concrete claims.
Memory contamination variant: low-quality observations re-injected from memory crowd out diverse context. Two copies of a 6.2-scoring run were found occupying two of the top-four memory retrieval slots, anchoring synthesis at the same quality ceiling. The fix deduplicates by title and soft-penalizes observations below 7.0 at retrieval time.
Fixed by: Novelty Gate (B2), Planner-First (B1), Dual-Backend Memory Store (B3)F2 — Planning Failures medium frequency
The planner generated queries that missed the task’s core requirement: too broad, too literal, or biased toward what the model already knows rather than what needs to be found. The research stage fetches real content, but none of it is what the task actually needed.
Observed signature: high novelty scores in round 1 (the content is new to the knowledge state) but low first-round composite scores despite passing the novelty gate. The planner correctly identified that nothing in memory covered the topic, then generated queries that covered the wrong aspect of it.
The review of MagenticOne’s architecture identified the fix: a closed-book prior knowledge pass before any web search, asking the producer what it already knows and what specific gaps remain. This front-loads the knowledge audit, making gap-filling queries more targeted. The planner now receives this closed-book self-assessment as context before generating queries.
Fixed by: Planner-First (B1) with closed-book pre-pass, Novelty Gate (B2)F3 — Synthesis Failures medium frequency
The producer hallucinated, hedged excessively, or lost the thread of the task. Fluent output that fails on Groundedness, Specificity, or Depth in ways that neither the retrieval pipeline nor the evaluation loop fully prevent.
Count-check retry: enumerated tasks (tasks asking for a specific number of items) trigger count_check_retry in 25% of runs, with a −0.39 score penalty when the retry fires. The retry indicates the model produced an output with the wrong count; the penalty reflects that retried outputs are often shallower than first-pass outputs on non-count dimensions—the model fills the count at the cost of depth.
Synthesis instruction effect: A controlled experiment showed that the
prose_depth synthesis instruction produces measurably more grounded outputs
(delta=+1.223 vs baseline) but does not yet clear the 1.5-point significance bar. The effect
is real but smaller than expected—evidence that the production ceiling for this task
type is bounded by retrieval quality, not synthesis instruction quality.
F4 — Evaluation Failures medium frequency
The evaluator scored a low-quality output above threshold, or a high-quality output below it.
Evaluation failures are insidious because they are invisible in runs.jsonl—a
false pass looks identical to a true pass.
Evaluator hallucination: the evaluator claims content is absent when it is present, causing the producer to add redundant coverage in revision. Most common on Completeness, where the evaluator must reason about absence rather than presence.
Evaluator drift: scoring calibration shifts across Ollama version updates or quantization changes. A model well-calibrated at deployment may inflate or deflate its scores months later. Without detection, a drifted evaluator converts true-fail runs into logged PASSes—degrading the training data used by the data flywheel.
Self-evaluation bias: outputs rated by the same model that produced them average 0.9 points higher than those rated by a separate evaluator using the same rubric. The gap is largest on Groundedness.
Academic grounding: A systematic evaluation of 13 LLM judges (arXiv:2406.12624v6) found that only the largest models achieve reasonable alignment with human evaluators—yet still lag inter-human agreement by up to five points. A companion prompt-sensitivity study found that eight of nine judges show always-A position bias under semantically equivalent prompt paraphrases. The harness’s Model Role Separation (A2) and Evaluator Pool (A3) patterns are direct responses to both findings.
F5 — Revision Failures high frequency
The producer failed to act on dimensional feedback, acted on it superficially, or improved one dimension while degrading another. Analysis of 57 multi-round Wiggum runs found that 12 (21%) showed revision regression—the final score lower than the round-1 score.
Cycling: a producer presented with the same dimensional feedback across two consecutive rounds produces outputs with identical scores and nearly identical dimension breakdowns. Cycling detection was added after observing runs stuck at a plateau saving approximately 1,300 seconds of wasted inference per stuck run.
Context overflow: long documents exceed the producer’s effective
context window during revision. The model truncates mid-sentence, producing
partial outputs that score worse than the original. The root cause is often a hardcoded
num_ctx in the model’s Modelfile that the runtime override does not
supersede. Fix: explicit num_predict and num_ctx overrides on
every revision call site.
Best-round restoration: when the loop exhausts all rounds without passing, the output written to disk is now the highest-scoring round’s content, not the final round’s content. Before this fix, 12 historical regressions were writing worse content than round 1 as the “final” output.
Fixed by: Wiggum Loop (C1), Surgical Compressor (C3), cycling detection in wiggum.pyF6 — Infrastructure Failures low frequency
Model timeouts, VRAM exhaustion, and context overflow that are not caused by content but by the operational environment. Low in frequency but high in cost: an infrastructure failure blocks the pipeline entirely rather than producing a recoverable bad output.
VRAM exhaustion: running the producer, evaluator, and planner simultaneously without managing residency budgets causes OOM errors that manifest as generation slowdown rather than hard errors—the model does not crash, it generates tokens at 0.1× normal speed. This is the hardest failure class to diagnose without the Chrome Trace file, where it appears as an unusually wide synthesis block.
Context overflow: the pipeline.md wiki file was being injected
wholesale at 14.8K characters, bloating the synthesis context to 27K+ characters and
exceeding effective context length. The fix replaced wholesale injection with
gap-targeted extraction (8K cap), selectively stitching relevant sections.
Academic grounding: An empirical analysis of bugs across five widely-used LLM inference engines (arXiv:2506.09713v2) identifies memory leaks, out-of-memory (OOM) errors, incorrect tensor shapes, and configuration-induced performance degradation as the four most prevalent bug classes in production deployments—exactly the failure modes the harness’s Keep-Alive Budget (A4) and Surgical Compressor (C3) were designed to prevent.
Dimension-Level Failure Signatures
The dimensional breakdown is more diagnostic than the composite score for identifying which failure class is active. The pattern is consistent across the 1,500-run dataset:
- Low Specificity + Low Groundedness → Retrieval failure (sources were too broad or too few)
- Low Completeness + High Relevance → Planning failure (found relevant content but missed the scope)
- Low Depth + High Specificity → Synthesis failure (model enumerated specifics without analyzing them)
- High Completeness + Low Groundedness → Synthesis failure (broad coverage, fabricated citations)
- Non-monotonic round-over-round scores → Revision failure (regression pattern)
- Score identical across rounds → Revision failure (cycling)
The Value of Named Failures
Before this taxonomy was written down, every bad run was diagnosed ad hoc. A low score was a symptom without a cause; a failed run was a mystery without a category. The taxonomy does not fix the failures—the patterns in subsequent posts do. But it provides the vocabulary that makes the patterns navigable: each pattern entry in the catalog specifies which failure class it addresses and by how much.
The diagnostic workflow is straightforward: load runs.jsonl into pandas, filter
for FAIL runs, look at the first-round dimensional breakdown, and map to the failure class
table. The class determines which pattern to reach for.
import pandas as pd, json
df = pd.read_json("data/runs.jsonl", lines=True)
fails = df[df["final"] == "FAIL"].copy()
# Extract first-round dimensions
fails["dims_r1"] = fails["wiggum_dimensions"].apply(
lambda d: d.get("round_1", {}) if isinstance(d, dict) else {}
)
# Identify revision failures (regression pattern)
def is_regression(scores):
if not isinstance(scores, list) or len(scores) < 2:
return False
return scores[-1] < scores[0]
fails["regression"] = fails["wiggum_scores"].apply(is_regression)
print(f"Revision regressions: {fails['regression'].sum()} / {len(fails)} fails")
The patterns in the next seven posts are the answers to what this query surfaces. Each one was derived from observing a failure class in production, diagnosing its root cause, and implementing a solution whose consequences were measured over subsequent runs.
What the Literature Leaves Open
Several questions raised by this body of research remain unresolved — and bear directly on how the harness should diagnose and respond to its own failures:
- Can autoresearch reliably detect cross-class failure trade-offs — cases where reducing the F2 planning rate increases the F1 retrieval rate, or where tighter evaluation thresholds inflate F5 revision counts — without a controlled experiment for each combination?
- Are there runtime features observable before a PASS/FAIL verdict (latency distribution, token budget consumption, retrieval hit rate) that predict which failure class is most likely, and could the harness reroute early rather than remediate late?
- How will the F6 infrastructure bug class distribution shift as the harness adopts quantized models — where memory footprint and tensor-shape behavior differ substantially from full-precision inference — and does the Keep-Alive Budget need a corresponding recalibration?
- What is the right counterfactual test for distinguishing genuine quality improvement from evaluator drift: if Wiggum scores rise but user-perceived output quality does not, which diagnostic in the taxonomy catches the discrepancy first?