May 22, 2026 • 15 min read • Agentic Harness Engineering Series

Verification Patterns: Measuring and Improving Output Quality

The Dimensional Rubric defines what "quality" means in the harness. The Surgical Compressor makes evaluation and revision feasible for long documents. The ReAct Comparator tells you when to use a different evaluation strategy entirely.

The Wiggum Loop (post 6) described the evaluate–revise cycle as a whole: a separate evaluator model scores synthesis, dimensional feedback routes back to the producer, the cycle repeats up to three times. But the loop depends on three support patterns that were left implicit: a rubric that defines what the evaluator scores, a compressor that makes scoring and revision feasible for long documents, and a decision framework for choosing between the Wiggum Loop and a ReAct evaluation strategy.

This post covers those three patterns — C2, C3, and C4 of the verification section.

C2 — The Dimensional Rubric

A composite quality score is useful for threshold gating. It tells you whether an output passed or failed. It tells you almost nothing useful for revision. "Your score is 6.4 out of 10" gives the producer no actionable information. "Your score on Specificity is 5.5 — you made four broad claims without concrete data or example, particularly in section 3" gives the producer something to act on.

The Dimensional Rubric defines quality across six named dimensions, each scored at 0.5-point resolution:

C2 — The Six Dimensions and Their First-Round Score Distribution

First-round dimensional scores across 1,500 runs. Specificity (6.65) and Depth (6.97) are consistently the weakest dimensions. Relevance (7.42) is consistently the strongest.

The six dimensions are:

Dimension What it measures Typical first-round weakness
Relevance Does the output address the task directly? Usually highest — the model stays on topic
Completeness Are all major aspects of the task addressed? Misses edge cases and limitations
Depth Does the output go beyond surface-level treatment? Correct but shallow; lacks mechanistic explanation
Specificity Are claims supported by concrete data, examples, or citations? Consistently weakest; model makes broad unanchored claims
Structure Is the output organized with clear logical flow? Headings present but parallel structure often weak
Groundedness Are claims traceable to retrieved sources? Improves when Novelty Gate reduces redundant sources

The rubric is a structured prompt template in harness/prompts/eval.py that instructs the evaluator to score each dimension independently before computing the composite. The independence constraint is important: evaluators that score holistically tend to anchor on their initial gestalt impression and adjust each dimension toward it. Forced independence reduces this anchoring effect.

Equal weighting across the six dimensions was the result of an A/B experiment. Alternative weighting schemes (upweighting Groundedness for research tasks, upweighting Structure for formal reports) failed to produce statistically significant quality improvements over equal weights on the reference task distribution. This could change with a significantly different task type distribution.

The 8.0 composite threshold — the pass/fail line for the Wiggum Loop — was calibrated against a human-rated holdout set of 200 outputs. Outputs scoring at or above 8.0 were rated as "good enough for intended use" by human reviewers at a 90% agreement rate.

The rubric is not universal. It was calibrated for long-form research synthesis. For code generation, replace Groundedness with Correctness (does the code execute and produce the expected output?) and Depth with Parsimony (is the solution unnecessarily complex?). Modifying the prompt template is a two-minute change; recalibrating the threshold takes a holdout evaluation session.

Research grounding: The independence constraint — scoring each dimension before computing the composite — is directly supported by empirical work on judge failure modes. A systematic reliability study found no judge model to be uniformly reliable across benchmarks: consistent sensitivity to formatting changes, paraphrasing, verbosity adjustments, and label flipping were observed across four state-of-the-art judges. (arXiv:2603.05399) A separate sensitivity analysis found 8 of 9 judges exhibit “always-A” position bias — when the candidate being evaluated appears first in a pairwise prompt, it receives inflated scores regardless of content. (arXiv:2604.23478) Forcing per-dimension scoring before any composite is computed is a structural defense against both effects: the evaluator must commit to a per-dimension verdict before it can anchor on a holistic impression, and there is no pairwise ordering when scoring individual dimensions. An ensemble approach that automatically discovers missing evaluation dimensions from judge failure cases improved GPT-4o benchmark agreement from 87.2% to 90.5% — the multi-dimensional structure is recoverable from failures in a way that holistic scores are not. (arXiv:2510.06538)

C3 — The Surgical Compressor

The Wiggum Loop evaluates and revises synthesis outputs. Those outputs are often 4,000–12,000 characters — research summaries, technical reports, multi-section surveys. The evaluator model has a finite context window. The producer model receiving revision feedback also has a finite context window. Attempting to pass full-length documents directly to both causes context overflow on longer outputs and, more subtly, causes the evaluator to miss structure when the document is truncated mid-section.

The Surgical Compressor addresses this with two distinct compression strategies for two different purposes:

# For evaluation: preserve all section headings, condense bodies
def summarize_for_eval(content: str, task: str,
                       threshold: int = 6_000) -> str:
    if len(content) <= threshold:
        return content  # no compression needed
    # Extract all H2/H3 headings verbatim + condensed section bodies
    return _section_preserving_compress(content)

# For revision: preserve the specific sections the evaluator criticized
def summarize_for_revision(content: str, feedback: str,
                           threshold: int = 5_000) -> str:
    if len(content) <= threshold:
        return content
    # Parse feedback for section references, keep those verbatim
    cited_sections = _extract_cited_sections(feedback)
    return _section_targeted_compress(content, cited_sections)
C3 — Surgical Compressor: Two Compression Modes

Section-preserving compression for evaluation keeps all headings visible so the evaluator can score Structure and Completeness. Section-targeted compression for revision keeps the exact passages the evaluator criticized verbatim.

The asymmetry between the two modes reflects the asymmetric needs of evaluator and producer. The evaluator needs to see the document's structure — what sections exist, in what order, at what level of detail — to score Structure and Completeness accurately. If the evaluator receives a summarized document that condenses section bodies to one sentence each but preserves all headings, it can make valid structural assessments. If the evaluator receives a document truncated after the third section, it will score Completeness falsely low.

The producer needs the opposite: it needs to see exactly what it wrote in the passages the evaluator criticized, so it can make targeted improvements without losing track of the surrounding context. A producer receiving a fully condensed document that happens to include one-sentence summaries of the criticized sections cannot make meaningful revisions — it doesn't remember what it originally wrote. Section-targeted compression preserves the cited sections verbatim and condenses everything else.

The two-threshold design (6,000 chars for evaluation, 5,000 for revision) prevents unnecessary compression overhead on short outputs. Neither compressor fires on outputs below its threshold. The 20,000-character input cap prevents OOM errors on pathologically long outputs from skills that generate extended reports.

C4 — The ReAct Comparator

The Wiggum Loop is not the only evaluation loop available. ReAct (Reason+Act) loops are the evaluation strategy used in most published agentic frameworks: the evaluating model interleaves reasoning steps with action steps (search, code execution, lookup), observing the results of its actions and updating its assessment. This is powerful for tasks where the evaluation itself requires active investigation.

The two approaches have complementary failure modes, which is why choosing between them is a first-class design decision rather than a preference:

C4 — Wiggum Loop vs. ReAct: Failure Mode Comparison

Wiggum fails when the rubric doesn't capture what matters (specification error). ReAct fails when the evaluator's reasoning about what to verify is itself wrong (reasoning error). For research synthesis, Wiggum outperforms. For code verification, ReAct does.

The practical decision criteria:

Task type Recommended loop Reason
Long-form research synthesis Wiggum Six-dimension rubric covers the quality space; distinct evaluator model eliminates self-evaluation bias
Code generation / debugging ReAct Evaluator can execute code and observe errors; rubric scoring of code quality is unreliable
Factual Q&A ReAct Evaluator can look up ground truth; dimensional rubric doesn't capture factual correctness well
Creative tasks Neither Both loops rely on automated evaluation; creative quality requires human assessment
Multi-step planning Wiggum Plan quality is assessable via rubric (Structure, Completeness); execution-level verification needs ReAct

There is also a hardware argument. The Wiggum Loop requires a distinct evaluator model to be loaded — two model instances in VRAM simultaneously. On 8–16 GB VRAM hardware, this forces sequential loading. A ReAct loop, using a single model to both reason and verify, fits in the same VRAM budget as a single-model pipeline. On constrained hardware, ReAct may be the only feasible evaluation strategy.

The ReAct Comparator is reference material. It introduces no new code — it documents the decision analysis that should precede any deployment. The right time to consult it is when designing a new task type, not after observing that scores are low.

How the Section C Patterns Compose

The four verification patterns (C1–C4) form a coherent subsystem around one question: is this output good enough, and if not, how is it failing? The Dimensional Rubric (C2) defines what "good enough" means across six dimensions. The Wiggum Loop (C1, post 6) runs the evaluate–revise cycle using the rubric as its scoring instrument. The Surgical Compressor (C3) ensures the evaluator and producer can both work with the document regardless of its length. The ReAct Comparator (C4) provides the decision framework for when this entire subsystem should be replaced.

The practical implication: the most common tuning operation in the verification subsystem is adjusting the Dimensional Rubric — updating scoring anchors, reweighting dimensions for a new task type, or adding a seventh dimension for domain-specific quality. Everything else — the loop mechanics, the compressor, the loop-selection logic — is stable infrastructure that rarely needs modification.

The next post covers Section D — Orchestration Patterns — where the harness scales from single-task runs to multi-agent parallel execution: the DAG Orchestrator, Worktree Context isolation, MCP Dispatch Router for cross-instance routing, and the Skill Registry.

← Previous 6 · The Wiggum Loop Next → 8 · Orchestration