A Failure Taxonomy for Agentic Systems

May 22, 2026 • 11 min read

Before a catalog of solutions, a catalog of problems. The patterns in this series each exist because something failed in a specific, diagnosable way. This post documents the six failure classes observed across 1,500 logged production runs—with the observed frequency, a representative runs.jsonl signature, and a pointer to the pattern that reduced each class’s rate.

The answer to “where do quality problems originate?” is rarely where you expect. Synthesis failures account for a smaller share than most practitioners assume. Retrieval and revision failures together account for more.

Failure Class Frequency — 1,500 Logged Runs

A run may exhibit failures in more than one class; frequencies sum to more than 100%. Retrieval and Revision failures are the most prevalent. Infrastructure failures are rare but costly — a single VRAM exhaustion event blocks the pipeline until manual intervention.

The Six Classes

F1 — Retrieval Failures high frequency

Search returned insufficient, misleading, or redundant content. The producer synthesizes from an impoverished or noisy context and produces output that fails Completeness or Groundedness in round 1.

Observed signature: r1 completeness < 6.0 with tool_calls.novelty_scores = [2, 3, 2, 3, 2] — all scores compressed to the binary low range, indicating the novelty scale collapsed. When novelty scores cluster at {2, 3} rather than spreading 1–10, the gate is firing but providing no differentiation. The deep-dive on 123 runs showed specificity as the weakest first-round dimension at 6.65, consistently below depth (6.97)—specificity fails when retrieved content is too general to ground concrete claims.

Memory contamination variant: low-quality observations re-injected from memory crowd out diverse context. Two copies of a 6.2-scoring run were found occupying two of the top-four memory retrieval slots, anchoring synthesis at the same quality ceiling. The fix deduplicates by title and soft-penalizes observations below 7.0 at retrieval time.

Fixed by: Novelty Gate (B2), Planner-First (B1), Dual-Backend Memory Store (B3)

F2 — Planning Failures medium frequency

The planner generated queries that missed the task’s core requirement: too broad, too literal, or biased toward what the model already knows rather than what needs to be found. The research stage fetches real content, but none of it is what the task actually needed.

Observed signature: high novelty scores in round 1 (the content is new to the knowledge state) but low first-round composite scores despite passing the novelty gate. The planner correctly identified that nothing in memory covered the topic, then generated queries that covered the wrong aspect of it.

The review of MagenticOne’s architecture identified the fix: a closed-book prior knowledge pass before any web search, asking the producer what it already knows and what specific gaps remain. This front-loads the knowledge audit, making gap-filling queries more targeted. The planner now receives this closed-book self-assessment as context before generating queries.

Fixed by: Planner-First (B1) with closed-book pre-pass, Novelty Gate (B2)

F3 — Synthesis Failures medium frequency

The producer hallucinated, hedged excessively, or lost the thread of the task. Fluent output that fails on Groundedness, Specificity, or Depth in ways that neither the retrieval pipeline nor the evaluation loop fully prevent.

Count-check retry: enumerated tasks (tasks asking for a specific number of items) trigger count_check_retry in 25% of runs, with a −0.39 score penalty when the retry fires. The retry indicates the model produced an output with the wrong count; the penalty reflects that retried outputs are often shallower than first-pass outputs on non-count dimensions—the model fills the count at the cost of depth.

Synthesis instruction effect: A controlled experiment showed that the prose_depth synthesis instruction produces measurably more grounded outputs (delta=+1.223 vs baseline) but does not yet clear the 1.5-point significance bar. The effect is real but smaller than expected—evidence that the production ceiling for this task type is bounded by retrieval quality, not synthesis instruction quality.

Fixed by: Planner-First (B1), Semantic Chunker (B4), Surgical Compressor (C3)

F4 — Evaluation Failures medium frequency

The evaluator scored a low-quality output above threshold, or a high-quality output below it. Evaluation failures are insidious because they are invisible in runs.jsonl—a false pass looks identical to a true pass.

Evaluator hallucination: the evaluator claims content is absent when it is present, causing the producer to add redundant coverage in revision. Most common on Completeness, where the evaluator must reason about absence rather than presence.

Evaluator drift: scoring calibration shifts across Ollama version updates or quantization changes. A model well-calibrated at deployment may inflate or deflate its scores months later. Without detection, a drifted evaluator converts true-fail runs into logged PASSes—degrading the training data used by the data flywheel.

Self-evaluation bias: outputs rated by the same model that produced them average 0.9 points higher than those rated by a separate evaluator using the same rubric. The gap is largest on Groundedness.

Academic grounding: A systematic evaluation of 13 LLM judges (arXiv:2406.12624v6) found that only the largest models achieve reasonable alignment with human evaluators—yet still lag inter-human agreement by up to five points. A companion prompt-sensitivity study found that eight of nine judges show always-A position bias under semantically equivalent prompt paraphrases. The harness’s Model Role Separation (A2) and Evaluator Pool (A3) patterns are direct responses to both findings.

Fixed by: Model Role Separation (A2), Evaluator Pool (A3), Dimensional Rubric (C2)

F5 — Revision Failures high frequency

The producer failed to act on dimensional feedback, acted on it superficially, or improved one dimension while degrading another. Analysis of 57 multi-round Wiggum runs found that 12 (21%) showed revision regression—the final score lower than the round-1 score.

Cycling: a producer presented with the same dimensional feedback across two consecutive rounds produces outputs with identical scores and nearly identical dimension breakdowns. Cycling detection was added after observing runs stuck at a plateau saving approximately 1,300 seconds of wasted inference per stuck run.

Context overflow: long documents exceed the producer’s effective context window during revision. The model truncates mid-sentence, producing partial outputs that score worse than the original. The root cause is often a hardcoded num_ctx in the model’s Modelfile that the runtime override does not supersede. Fix: explicit num_predict and num_ctx overrides on every revision call site.

Best-round restoration: when the loop exhausts all rounds without passing, the output written to disk is now the highest-scoring round’s content, not the final round’s content. Before this fix, 12 historical regressions were writing worse content than round 1 as the “final” output.

Fixed by: Wiggum Loop (C1), Surgical Compressor (C3), cycling detection in wiggum.py

F6 — Infrastructure Failures low frequency

Model timeouts, VRAM exhaustion, and context overflow that are not caused by content but by the operational environment. Low in frequency but high in cost: an infrastructure failure blocks the pipeline entirely rather than producing a recoverable bad output.

VRAM exhaustion: running the producer, evaluator, and planner simultaneously without managing residency budgets causes OOM errors that manifest as generation slowdown rather than hard errors—the model does not crash, it generates tokens at 0.1× normal speed. This is the hardest failure class to diagnose without the Chrome Trace file, where it appears as an unusually wide synthesis block.

Context overflow: the pipeline.md wiki file was being injected wholesale at 14.8K characters, bloating the synthesis context to 27K+ characters and exceeding effective context length. The fix replaced wholesale injection with gap-targeted extraction (8K cap), selectively stitching relevant sections.

Academic grounding: An empirical analysis of bugs across five widely-used LLM inference engines (arXiv:2506.09713v2) identifies memory leaks, out-of-memory (OOM) errors, incorrect tensor shapes, and configuration-induced performance degradation as the four most prevalent bug classes in production deployments—exactly the failure modes the harness’s Keep-Alive Budget (A4) and Surgical Compressor (C3) were designed to prevent.

Fixed by: Keep-Alive Budget (A4), Semantic Chunker (B4), Surgical Compressor (C3)

Dimension-Level Failure Signatures

Mean First-Round Score by Dimension (n=1,500 runs)

Specificity is the weakest first-round dimension (6.65), consistently below Depth (6.97). This is a retrieval signal, not a synthesis signal: the model cannot be specific about things it was not given specific sources for.

The dimensional breakdown is more diagnostic than the composite score for identifying which failure class is active. The pattern is consistent across the 1,500-run dataset:

Low Specificity + Low Groundedness → Retrieval failure (sources were too broad or too few)
Low Completeness + High Relevance → Planning failure (found relevant content but missed the scope)
Low Depth + High Specificity → Synthesis failure (model enumerated specifics without analyzing them)
High Completeness + Low Groundedness → Synthesis failure (broad coverage, fabricated citations)
Non-monotonic round-over-round scores → Revision failure (regression pattern)
Score identical across rounds → Revision failure (cycling)

The Value of Named Failures

Before this taxonomy was written down, every bad run was diagnosed ad hoc. A low score was a symptom without a cause; a failed run was a mystery without a category. The taxonomy does not fix the failures—the patterns in subsequent posts do. But it provides the vocabulary that makes the patterns navigable: each pattern entry in the catalog specifies which failure class it addresses and by how much.

The diagnostic workflow is straightforward: load runs.jsonl into pandas, filter for FAIL runs, look at the first-round dimensional breakdown, and map to the failure class table. The class determines which pattern to reach for.

import pandas as pd, json

df = pd.read_json("data/runs.jsonl", lines=True)
fails = df[df["final"] == "FAIL"].copy()

# Extract first-round dimensions
fails["dims_r1"] = fails["wiggum_dimensions"].apply(
    lambda d: d.get("round_1", {}) if isinstance(d, dict) else {}
)

# Identify revision failures (regression pattern)
def is_regression(scores):
    if not isinstance(scores, list) or len(scores) < 2:
        return False
    return scores[-1] < scores[0]

fails["regression"] = fails["wiggum_scores"].apply(is_regression)
print(f"Revision regressions: {fails['regression'].sum()} / {len(fails)} fails")

The patterns in the next seven posts are the answers to what this query surfaces. Each one was derived from observing a failure class in production, diagnosing its root cause, and implementing a solution whose consequences were measured over subsequent runs.

Where This Sits Against MAST

The six classes above were derived from one system’s logs, which raises the obvious question: do they generalize? The closest academic analog is MAST—the Multi-Agent System Failure Taxonomy (arXiv:2503.13657)—the first taxonomy of failure modes in multi-agent AI systems, built from MAST-Data, a corpus of more than 1,600 annotated execution traces collected across seven popular multi-agent frameworks running GPT-4, Claude 3, Qwen2.5, and CodeLlama backends. Its three top-level categories were validated with high inter-annotator agreement (κ = 0.88): specification issues, inter-agent misalignment, and task verification failures.

The two taxonomies map cleanly where the architectures overlap:

Specification issues ↔ F2 Planning failures. MAST’s ambiguous-task and role-definition failures are the multi-agent generalization of the planner generating queries that miss the task’s core requirement. The closed-book pre-pass in Planner-First (B1) is a specification fix.
Task verification failures ↔ F4 Evaluation failures. MAST’s missing or shallow verification maps directly onto evaluator hallucination, drift, and self-evaluation bias. Model Role Separation (A2) and the Evaluator Pool (A3) exist because verification by the producing agent is unreliable—MAST reaches the same conclusion at the inter-agent level.
Inter-agent misalignment has no analog in a single-pipeline harness—it emerges only when multiple agents hold divergent task state. It becomes relevant the moment the harness scales to multi-agent DAG execution, which is why worktree isolation and DAG dependency contracts appear in that post.
F1, F5, and F6 sit below MAST’s level of analysis. Retrieval, revision, and infrastructure failures are per-agent substrate problems. MAST assumes each agent’s pipeline works and asks what breaks between agents; this taxonomy asks what breaks inside one.

MAST’s headline finding is worth stating plainly: across frameworks and models, the authors attribute the bulk of failures to system design rather than model capability—the improvement headroom is in the scaffolding. That is the harness thesis, arrived at independently from 1,600 traces of other people’s systems.

Academic grounding: The production literature corroborates from several angles. A system-level taxonomy of LLM applications (arXiv:2511.19933) catalogs fifteen hidden failure modes invisible to knowledge-and-reasoning benchmarks. A study of agentic systems in production (arXiv:2605.01604) identifies seven production-specific failure modes and finds that ROUGE, BERTScore, and accuracy fail to detect four of them entirely—and catch the remaining three only after a multi-cycle lag, which is the argument for dimensional rubrics over surface metrics. And Aegis (arXiv:2508.19504) shows that fixing the environment alone—observability, computation offloading, speculative actions—raises agent success rates 6.7–12.5% with zero changes to the agent or model.

What the Literature Leaves Open

Several questions raised by this body of research remain unresolved — and bear directly on how the harness should diagnose and respond to its own failures:

Can autoresearch reliably detect cross-class failure trade-offs — cases where reducing the F2 planning rate increases the F1 retrieval rate, or where tighter evaluation thresholds inflate F5 revision counts — without a controlled experiment for each combination?
Are there runtime features observable before a PASS/FAIL verdict (latency distribution, token budget consumption, retrieval hit rate) that predict which failure class is most likely, and could the harness reroute early rather than remediate late?
How will the F6 infrastructure bug class distribution shift as the harness adopts quantized models — where memory footprint and tensor-shape behavior differ substantially from full-precision inference — and does the Keep-Alive Budget need a corresponding recalibration?
What is the right counterfactual test for distinguishing genuine quality improvement from evaluator drift: if Wiggum scores rise but user-perceived output quality does not, which diagnostic in the taxonomy catches the discrepancy first?

← Previous 2 · The Pipeline in Motion Next → 4 · Inference Patterns

A Failure Taxonomy for Agentic Systems

The Six Classes

F1 — Retrieval Failures high frequency

F2 — Planning Failures medium frequency

F3 — Synthesis Failures medium frequency

F4 — Evaluation Failures medium frequency

F5 — Revision Failures high frequency

F6 — Infrastructure Failures low frequency

Dimension-Level Failure Signatures

The Value of Named Failures

Where This Sits Against MAST

What the Literature Leaves Open

The Wiggum Loop

Related in this series