← Back to Blog

The Wiggum Loop: Cross-Model Evaluation and Dimensional Revision

The first time you run a research agent without an evaluator, you will probably be impressed. The output will be long. It will be fluent. It will have headings and bullet points and citations and a confident, authoritative tone that reads like something a competent analyst spent an afternoon on.

It will also, in a meaningful fraction of cases, be wrong—not in the way a hallucination is wrong, where a model fabricates a fact that is obviously false on inspection, but in the way that a first draft is wrong: structurally correct but thin in the places that matter, confident about things it should hedge, and silent about the gaps a domain expert would immediately notice. The model does not know that it does not know.

The problem compounds when you ask it to evaluate itself. In controlled experiments, outputs rated by the same model that produced them average 0.9 points higher on a six-dimension quality rubric than outputs rated by a separate evaluator model using the same rubric. The gap is largest on the Groundedness dimension—because a model that invented a claim is also the most likely to believe it is grounded.

Research backing: The self-preference bias is well-characterized in the literature. A 2026 study across 20 mainstream LLMs found that advanced model capabilities are “often uncorrelated or even negatively correlated with low self-preference bias” — larger models are not more reliable self-judges. A structured multi-dimensional evaluation strategy reduces self-preference bias by an average of 31.5%, directly validating the Dimensional Rubric approach. (arXiv:2604.22891) The bias has been traced mechanistically to a preference for familiar, low-perplexity text (arXiv:2410.21819) — a model rates its own fluent output highly regardless of factual accuracy. Critically, stronger models exhibit more pronounced harmful self-preference precisely when they are wrong: they struggle most to recognize their own errors, making cross-model evaluation not just a quality practice but a necessary error-correction mechanism. (arXiv:2504.03846)

There are two loops that address this problem. They operate at different levels and compose naturally. Understanding the distinction between them is prerequisite to understanding either one.

Posts 1–11 — May 22, 2026

Eleven posts on the complete agentic harness architecture, from first principles to hardware.

  1. Post 1 The Wiggum Loop: Cross-Model Evaluation and Dimensional Revision The producer–evaluator separation pattern; how Ralph and Wiggum loops compose; score trajectory data from 1,500 production runs.
  2. Post 2 The Harness Thesis: Why Scaffolding Beats Model Selection The empirical case for scaffolding over model choice; the five-stage pipeline; the silent overwrite failure that started it all.
  3. Post 3 The Pipeline in Motion: Tracing a Task Through All Eleven Subsystems A single research query traced from CLI entry through decomposition, retrieval, synthesis, evaluation, and JSONL persistence.
  4. Post 4 A Failure Taxonomy for Agentic Systems Six failure classes—retrieval, synthesis, planning, evaluation, revision, infrastructure—ranked by frequency and cost, with real data from 1,500 runs.
  5. Post 5 Inference Patterns: The Substrate Layer Inference Shim, Model Role Separation, Evaluator Pool, and Keep-Alive Budget—the substrate patterns prerequisite to everything else.
  6. Post 6 Context Engineering: What Reaches the Model Planner-First, Novelty Gate, Dual-Backend Memory Store, Semantic Chunker, and Dynamic Context Injection.
  7. Post 7 Verification Patterns: Measuring and Improving Output Quality Dimensional Rubric, Surgical Compressor, and ReAct Comparator—the three patterns that close the quality loop.
  8. Post 8 Orchestration Patterns: Scaling to Multi-Agent Execution DAG Orchestrator, Worktree Context, MCP Dispatch Router, and Skill Registry for coordinating multiple agents.
  9. Post 9 Security Patterns: Constraining the Agent AST Guard, Path Sandbox, Injection Scanner, and CDP Guard—all implemented with Python stdlib.
  10. Post 10 Observability and the Data Flywheel RunTrace, JSONL Audit Log, Chrome Trace Exporter, Data Flywheel, RL Rollout, and Literature Review Pipeline.
  11. Post 11 Parallel Inference: Hardware Substrates for LLM Workloads GPU memory hierarchy, SM architecture, llama.cpp, vLLM, GGUF quantization, and KV cache pressure.

Two Loops: Ralph and Wiggum

Ralph Loop (Outer)

Geoffrey Huntley's Ralph pattern is a bash while-loop that feeds PROMPT.md to an LLM CLI, persists the evolving plan on disk as shared state across isolated context windows, and commits progress after each iteration. Each pass starts cold—no accumulated context, just the plan file.

  • Governs: should the agent keep going?
  • Scope: task-level iteration
  • Cycle driver: completion detection or plan state
  • Commit on each pass

Wiggum Loop (Inner)

The Wiggum Loop operates inside one agent iteration, after synthesis and before the Ralph outer loop decides whether to commit. A separate evaluator model scores the synthesis on six named dimensions. Dimensional feedback—not a composite score—routes back to the producer for targeted revision.

  • Governs: is this output good enough to accept?
  • Scope: output-quality gating within one iteration
  • Cycle driver: 8.0 composite threshold, max 3 rounds
  • Persist only on pass

The two loops compose cleanly: Ralph decides whether the task is complete; Wiggum decides whether the current iteration’s output clears the quality bar. A production pipeline typically runs Wiggum inside every non-trivial synthesis step, then returns the pass/fail result to the Ralph outer loop which decides whether to continue, revise the plan, or commit.

Ralph and Wiggum: Nested Loop Architecture
Ralph (outer) iterates across context windows, committing plan progress. Wiggum (inner) gates quality within each synthesis pass before Ralph can proceed.

The Wiggum Loop in Detail

Named affectionately for Chief Clancy Wiggum of Springfield—a figure who catches most problems, misses some, and is constitutionally unable to evaluate his own blind spots—the loop has four components:

Wiggum Loop — Evaluate–Revise Cycle
Maximum three evaluation rounds. 75% of achievable quality improvement occurs in round 1; 18% in round 2. Beyond three rounds, marginal gain falls below measurement noise.

Implementation

The loop entry point in harness/wiggum.py:

def loop(
    task: str,
    output_path: str,
    models: ModelConfig,
    parent_trace: RunTrace,
) -> WiggumResult:
    content = Path(output_path).read_text(encoding="utf-8")
    round_scores: list[float] = []
    evaluator = select_evaluator(seed=parent_trace.run_id)

    for round_num in range(1, MAX_ROUNDS + 1):
        parent_trace.enter_stage(f"wiggum_round_{round_num}")
        result = _evaluate(content, task, evaluator, models)
        round_scores.append(result["score"])

        if result["score"] >= PASS_THRESHOLD:
            Path(output_path).write_text(content, encoding="utf-8")
            return WiggumResult(passed=True, final_score=result["score"],
                                round_scores=round_scores, rounds_taken=round_num)
        if round_num < MAX_ROUNDS:
            content = _revise(content, task, result, models.producer)

    Path(output_path).write_text(content, encoding="utf-8")
    return WiggumResult(passed=False, final_score=round_scores[-1],
                        round_scores=round_scores, rounds_taken=MAX_ROUNDS)

PASS_THRESHOLD = 8.0
MAX_ROUNDS     = 3

The Key Insight: Dimensions Beat Scores

The loop’s most important output is not the composite score. It is the dimensional breakdown. The evaluator scores six named dimensions independently before computing a composite:

Here is a representative two-round run from runs.jsonl, showing how the dimensional breakdown exposes exactly what revision must fix:

{
  "wiggum_rounds": 2,
  "wiggum_scores": [6.4, 8.3],
  "wiggum_dimensions": {
    "round_1": {
      "relevance": 8.0, "completeness": 5.5, "depth": 6.0,
      "specificity": 6.5, "structure": 7.5, "groundedness": 5.0
    },
    "round_2": {
      "relevance": 8.5, "completeness": 8.0, "depth": 8.0,
      "specificity": 8.5, "structure": 8.5, "groundedness": 8.0
    }
  }
}

The composite scores (6.4 → 8.3) tell you revision worked. The dimensional breakdown tells you what it fixed. This run had strong structure and relevance in round 1 but failed on coverage and sourcing. The revision prompt highlights only Completeness and Groundedness—not all six dimensions. Presenting the producer with a six-item checklist reliably causes superficial changes across all dimensions while missing the depth needed to fix the ones that actually failed.

Deterministic Evaluator Selection

The evaluator is never chosen arbitrarily. select_evaluator() hashes the run ID and maps it to a position in the pool:

EVALUATOR_POOL = os.getenv("HARNESS_EVALUATOR_POOL", "").split(",")

def select_evaluator(seed: str) -> str:
    if not EVALUATOR_POOL or EVALUATOR_POOL == [""]:
        return DEFAULT_EVALUATOR
    idx = int(hashlib.md5(seed.encode()).hexdigest(), 16) % len(EVALUATOR_POOL)
    return EVALUATOR_POOL[idx].strip()

The hash-based selection ensures the same run always uses the same evaluator (reproducibility), while different runs are distributed across the pool (bias dilution). A model that consistently over-scores Groundedness by 0.5 points will inflate the pass rate if used exclusively; distributed across a four-model pool, the bias becomes variance that is visible in the analytics and correctable.

The Numbers

Across 1,500 logged runs, the loop raises mean output quality from 6.87 (single-pass) to 8.12 (post-loop)—a gain of 1.25 points—and converts a 61% first-pass rate into an 89% final-pass rate. The round-by-round trajectory is consistent:

Score Trajectory Across Evaluation Rounds (n=1,500 runs)
Mean and ±1 std dev across all runs that reached each round. Declining std dev in round 3 reflects that surviving runs are increasingly difficult to improve, not that improvement is easier.

The Revision Step

Temperature during revision is 0.3 by design. Temperature 0.0 causes the producer to find the local minimum it can reach deterministically—minimal changes. Temperature 0.3 introduces enough variation for meaningful structural revision while maintaining coherence.

The revision compression mode is distinct from evaluation compression. Before evaluation, all second-level headings are preserved verbatim and bodies are condensed (the evaluator must assess structural coverage). Before revision, sections referenced in the evaluator’s feedback are preserved verbatim and everything else is condensed (the producer must see exactly what it wrote in the criticized passage).

Failure Modes of the Loop Itself

The Wiggum Loop catches most quality problems. It is named for a known-imperfect inspector because that is an accurate description of what it is.

Revision regression. The producer improves a failing dimension while degrading a passing one. A document with strong Depth but weak Groundedness may emerge from revision with improved citations but condensed analytical sections. The loop’s score trajectory will show this as a non-monotonic pattern: 6.8 → 7.2 → 7.6, still below threshold after three rounds, with a document materially different from round 1 in ways that are not all improvements.

Evaluator hallucination. An evaluator that claims “the output does not address X” when X is clearly present will cause the producer to add redundant coverage of X in its revision. Most common on Completeness, where the evaluator must reason about what is absent rather than what is present.

Evaluator drift. A model’s scoring calibration shifts over Ollama version updates or quantization changes. The evaluator pool with rotation converts systematic drift into measurable variance. Periodic recalibration runs against a held-out human-rated set are the correct diagnostic.

Context starvation. When pre-evaluation compression removes too much, the evaluator scores structural completeness on a skeleton that misrepresents the content. The 6,000-character threshold was chosen conservatively on this axis.

Tuning the Loop

Parameter Default Effect of raising Effect of lowering
PASS_THRESHOLD 8.0 Stricter quality gate; more runs exhaust all rounds More outputs pass in round 1; lower average quality
MAX_ROUNDS 3 More recovery attempts; higher latency ceiling Faster pipeline; more runs fail without passing
REVISION_FOCUS_THRESHOLD 7.0 Fewer dimensions in revision prompt; tighter focus More dimensions highlighted; broader but shallower changes

Tune PASS_THRESHOLD last, after establishing a baseline with the default. The right threshold depends on your task domain, quality expectations, and latency budget—not on a universal standard. The 8.0 default was calibrated for general research synthesis. Do not touch REVISION_FOCUS_THRESHOLD without A/B evidence; it is the parameter most likely to introduce subtle regressions.

What the 11% Tells You

The 11% of runs that exhaust all three rounds without passing are the most interesting part of the dataset. They are not random. They cluster around specific task types (narrow technical domains where the retrieval corpus is thin), specific failure classes (groundedness failures when claimed sources don’t exist), and specific model combinations (producers and evaluators from the same family, which share distributional biases the rubric cannot overcome).

The Wiggum Loop is not a guarantee. It is a reliable quality lift for the runs it handles, and a structured failure record for the runs it does not. Both are valuable—but only if you read the records. The run log is the data; the patterns are the explanation.

Previous in series

The Harness Thesis

Why scaffolding quality matters more than model selection, and the failure that motivated it.

Read more →
← Previous 5 · Context Engineering Next → 7 · Verification Patterns