May 25, 2026 • 14 min read • Agentic Harness Engineering Series

The Regression Harness: Eval Suite, Criterion Functions, and Experimental Infrastructure

Nine benchmark tasks. Eleven criterion functions. One composite score. And above it all, a three-persona panel that evaluates whether the experiment itself produced genuine knowledge. This post documents the full measurement stack that underlies every controlled experiment and autoresearch iteration in the harness.

C1 — Wiggum Loop C2–C4 — Verification Patterns C5 — Regression Harness Experiments — Four Experiments Pattern Catalog

Why a Regression Harness?

The Wiggum Loop (C1) and the Dimensional Rubric (C2) give you a score for any given synthesis run. That score is useful for gating output quality at inference time. It is not sufficient for detecting regressions across harness changes.

The distinction matters. When you swap a model, modify a Modelfile, change the synthesis instruction, or refactor a pipeline stage, you want to know: did this change break something that was working? A single ad-hoc run tells you nothing — it collapses all variation (task choice, model randomness, instruction sensitivity) into one uninterpretable data point. A regression harness runs a fixed, diverse set of tasks with deterministic content criteria and produces a comparable score before and after every change.

Design rule: Run the eval suite after any model swap, Modelfile change, or harness modification to detect regressions before they accumulate. The suite is not a quality certificate — it is a change detector.

The harness measurement infrastructure has three layers:

Layer 1 — eval_suite.py

Task registry (9 tasks + T_MEM), criterion function library (11 functions), composite score formula, and CLI. The baseline regression gate.

Layer 2 — experiment_runner.py

Completely randomized design (CRD) runner. Reads an ExperimentSpec JSON, generates a randomized run order, applies treatment env vars, checkpoints after every run. Used by the four documented experiments.

Layer 3 — experiment_panel.py

Three-persona meta-evaluator. After a CRD run completes, three models evaluate the experiment itself — one for design rigor, one for epistemic validity, one for actionability — and emit a joint KEEP / REVISE / REDESIGN decision.

The Task Registry

The suite defines ten named tasks. Nine run the agent on representative inputs and check file-based output criteria. One (T_MEM) is a memory retrieval smoke test that directly queries the MemoryStore without running the full agent pipeline.

ID	Type	Description	Key criteria
T_A	enumerated	Top 5 context engineering techniques	`exact_sections(5)`, `has_impl_notes`
T_B	best practices	Cost envelope management in production agents	`min_sections(3)`, `has_impl_notes`
T_C	enumerated	Top 3 failure modes in multi-agent systems	`exact_sections(3)`, `has_impl_notes`
T_D	enumerated	Top 3 context window management strategies	`exact_sections(3)`, `has_impl_notes`
T_E	best practices	Prompt injection defense in production systems	`min_sections(3)`, `has_impl_notes`
T_ANN	annotation	`/annotate` fixture: NANDA-format paper annotation	`has_nanda_sections`, `no_annotate_artifacts`
T_F	introspection	`/introspect` skill registry and pipeline stages	`mentions_skill_names`, `no_hallucinated_skills`
T_G	file synthesis	Read autoresearch program file and summarize design	`min_sections(2)`, `no_file_path_refs`
T_H	OOD	Top 5 nutrient synergies for cognitive performance	`exact_sections(5)`, `has_impl_notes`
T_MEM	memory smoke	Direct `MemoryStore` query — no agent invocation	`paper_count > 0`, `retrieval_returns_results`

The task taxonomy is intentional. T_A through T_E are the domain tasks used in the four controlled experiments and serve as the autoresearch optimization target (T_B specifically). T_ANN tests a skill path (annotation) that bypasses the main synthesis pipeline entirely. T_F tests introspection — whether the agent accurately reports its own capabilities without hallucinating skill names. T_G tests file-based synthesis, a structurally different input mode. T_H is deliberately out-of-domain: a regression in the pipeline's general capability should appear here before it appears in the in-domain tasks, since the model has less cached intuition to lean on.

T_MEM runs separately from the agent tasks. It connects directly to the MemoryStore SQLite backend, verifies paper count, and issues a raw retrieval query. It is the only test that bypasses the agent runtime entirely — which means it will catch storage regressions that the agent tasks would mask (the agent simply retrieves less context silently rather than failing).

The Criterion Function Library

Each task carries a list of criterion functions. A criterion is a pure function from output string to (passed: bool, detail: str). The library has eleven functions covering four categories of check.

Size and Structure Checks

These check that the output is substantively long enough and organized correctly — the minimum bar before content quality becomes meaningful.

def min_bytes(n: int):
    def check(content: str):
        b = len(content.encode("utf-8"))
        return b >= n, f"{b} bytes (need >= {n})"
    return check

def min_lines(n: int):
    def check(content: str):
        lines = content.count("\n") + 1
        return lines >= n, f"{lines} lines (need >= {n})"
    return check

def exact_sections(n: int):
    """Exactly n H2-level content sections, excluding structural headers."""
    structural = {"introduction", "conclusion", "summary", "overview",
                  "background", "references"}
    def check(content: str):
        headers = re.findall(r'^##\s+(.+)', content, re.MULTILINE)
        items   = [h for h in headers
                   if re.sub(r'^[\d.\s]+', '', h).strip().lower()
                   not in structural]
        return len(items) == n, f"{len(items)} content sections (need exactly {n})"
    return check

def min_sections(n: int):
    def check(content: str):
        headers = re.findall(r'^##\s+\S', content, re.MULTILINE)
        return len(headers) >= n, f"{len(headers)} H2 sections (need >= {n})"
    return check

The distinction between exact_sections and min_sections maps directly onto the task taxonomy. Enumerated tasks (T_A: "top 5") get exact_sections — the agent is wrong if it returns four or six items. Best-practices tasks (T_B, T_E) get min_sections — there is no canonical count, and imposing one would be an arbitrary constraint.

exact_sections strips structural headers from the count. A document with five content sections plus an introduction does not score as six sections. This required two passes of iteration before stabilizing: early versions were fooled by numbered headers (## 1. Introduction) and by variations like ## Overview that were clearly preamble rather than content.

Content Quality Checks

def no_placeholders():
    BAD = ["[placeholder]", "TODO", "brief implementation note",
           "add example here", "implementation note here"]
    def check(content: str):
        found = [b for b in BAD if b.lower() in content.lower()]
        return len(found) == 0, ("clean" if not found
                                 else f"placeholder text found: {found}")
    return check

def has_impl_notes():
    MARKERS = ["implementation note", "example:", "```",
               "**example", "**implementation"]
    def check(content: str):
        found = any(m.lower() in content.lower() for m in MARKERS)
        return found, ("has implementation notes/examples" if found
                       else "no implementation notes or examples found")
    return check

def no_file_path_refs():
    def check(content: str):
        patterns = [r'saved to ~/\S+\.md', r'save.*~/\S+\.md',
                    r'written to ~/\S+\.md']
        found = any(re.search(p, content, re.IGNORECASE) for p in patterns)
        return not found, ("clean" if not found
                           else "output contains file path reference")
    return check

has_impl_notes is the criterion most directly connected to score variance. Its presence in a task's criterion list is what drives the criteria_rate term of the composite score. In the autoresearch experiments, the NAMED-SYSTEMS instruction variants caused has_impl_notes to fail because outputs containing "cite a named tool" tended to produce structured lists without code blocks or implementation examples — they satisfied the literal request but not the evaluable form. This interaction between instruction phrasing and criterion satisfaction is one of the design tensions the autoresearch loop is trying to navigate.

no_file_path_refs catches a specific class of producer artifact: the model reporting that it "saved the document to ~/Desktop/..." inside the document body. This is a leakage of the task framing into the content — the agent is supposed to save the file silently, not narrate the save operation in the output.

Skill-Specific Checks

def has_nanda_sections():
    REQUIRED  = ["**Topic**", "**Motivation**", "**Contribution**"]
    EVIDENCE  = ["**Evidence", "**Broad impact**", "**Narrow impact**"]
    def check(content: str):
        missing      = [s for s in REQUIRED if s not in content]
        has_evidence = any(s in content for s in EVIDENCE)
        if missing:
            return False, f"missing sections: {missing}"
        if not has_evidence:
            return False, "missing evidence/impact section"
        return True, "all core sections present"
    return check

def mentions_skill_names(required: list[str]):
    def check(content: str):
        found = [s for s in required if s in content]
        ok    = len(found) >= max(1, len(required) // 2)
        return ok, f"found {len(found)}/{len(required)} required skills: {found}"
    return check

def no_hallucinated_skills():
    FAKE = ["/research", "/visualize", "/translate", "/validate",
            "/compress", "/enumerate", "/summarize", "/writetofile",
            "/writetodisk", "/writetolog"]
    def check(content: str):
        found = [f for f in FAKE if f in content.lower()]
        return len(found) == 0, ("clean" if not found
                                 else f"hallucinated skills: {found}")
    return check

has_nanda_sections is task-specific to T_ANN and validates the structured annotation format (Topic, Motivation, Contribution, Evidence/Impact). It accepts two forms of the evidence section to accommodate slight variation in the model's capitalization and phrasing across runs.

no_hallucinated_skills maintains a blocklist of skill names that appear plausible but do not exist in the registry. Models have a consistent tendency to invent /summarize, /validate, and /visualize — all obvious names for things a capable assistant should be able to do. The blocklist must be maintained: as the real skill registry grows, some previously-hallucinated names may become real, which would cause false positives.

Heading Structure Checks

def has_h1_heading():
    def check(content: str):
        ok = bool(re.search(r'^#\s+\S', content.strip(), re.MULTILINE))
        return ok, ("has H1 heading" if ok else "missing H1 heading")
    return check

Used only by T_F (introspection). The introspect skill is expected to produce a structured document with a top-level heading. The criterion catches the case where the model produces a reasonable-looking document but omits the H1, making it structurally incomplete.

Knowledge Base Injection

Enumerated tasks carry a kb_file path pointing to a knowledge base document in harness/eval/knowledge_base/. The runner injects this file into the task string when two conditions are both true: the file exists, and the task uses exact_sections.

kb = task_def.get("kb_file", "")
is_enumerated = any(c.__name__.startswith("exact_sections")
                    for c in task_def.get("criteria", []))
if kb and os.path.exists(kb) and is_enumerated:
    task_str += f" read {kb}"

The design choice here is conservative: injection only for tasks where a wrong count is a clear failure. Best-practices tasks get no injection because the knowledge base structure might inadvertently constrain what the agent includes — a KB with three subsections on cost management might cause T_B to produce exactly three sections when five would score better. For enumerated tasks the constraint is explicit and correct by definition.

The KB files are curated markdown documents, not raw retrieval results. Each contains the canonical framing of the task topic with enough structure to anchor the agent's section count without dictating content. They are reviewed and updated when the underlying domain literature meaningfully changes.

The Composite Score

The suite computes one scalar metric that combines evaluation quality (WIGGUM r1 scores from runs.jsonl) with content structure validity (criterion pass rate):

composite = 0.7 × mean_wiggum_r1 + 0.3 × criteria_rate × 10

The 70/30 split encodes a priority: WIGGUM measures substantive quality (relevance, depth, specificity, groundedness), while criteria_rate measures structural validity (correct section count, no placeholders, has implementation examples). A document can have impeccable structure and be shallow. A document can be substantively excellent and have wrong section count. The composite requires both to score well.

The score_suite() function is the entry point for autoresearch. When the autoresearch loop proposes a modification to the synthesis instruction, it calls score_suite(task_ids=["T_B"]) to get a single float representing the effect of that change. The loop keeps the change if new_score − baseline > 0.1, otherwise reverts via git reset HEAD~1 --soft.

Why criteria_rate × 10? The WIGGUM r1 score is on a 0–10 scale. criteria_rate is on a 0–1 scale. Multiplying by 10 puts them on the same scale before the weighted sum. Without this, a criteria_rate of 1.0 would contribute 0.3 × 1.0 = 0.3 to the composite regardless of how many criteria were checked.

WIGGUM scores are matched back to tasks by fingerprint strings — a fragment of the task description unique enough to identify it in runs.jsonl:

FIXED_FINGERPRINTS = {
    "T_A": "top 5 context engineering",
    "T_B": "cost envelope management",
    "T_C": "3 most common failure modes",
    "T_D": "context window management strategies",
    "T_E": "prompt injection defense",
}

For tasks without fixed fingerprints (T_F, T_G, T_H, T_ANN), the fingerprint is derived from the output filename stem. This means task descriptions must be stable — a reworded T_B task string will fail to match its runs in runs.jsonl and silently return a mean_wiggum of zero.

CLI Modes

The suite exposes five CLI flags covering different use patterns:

Flag	What it does	When to use it
`python -m harness.eval.eval_suite`	Full run: execute all tasks then check all criteria	After any harness change; the primary regression gate
`--fast`	Criteria check only against existing output files; no agent calls	When you want to re-check criteria after editing them without re-running all tasks
`--no-wiggum`	Run tasks but skip the Wiggum evaluation loop	Testing the producer in isolation; faster turnaround when WIGGUM scores aren't needed
`--score [--tasks T_B,T_D]`	Print composite float and exit; optional task subset filter	Called by autoresearch to evaluate proposed instruction changes
`--generated [path]`	Load additional tasks from a `generated_tasks.json` file	Testing with dynamically generated task variants; TinyTroupe integration

The --score mode is the one that matters most for the autoresearch loop. It runs tasks fresh, reads the new WIGGUM scores from runs.jsonl, evaluates all criteria, and emits a single float to stdout. The autoresearch script captures this float and makes the keep/discard decision against the current baseline.

The Experiment Runner

Layer 2 formalizes the four-experiment methodology documented in the experimental methodology post. Instead of running the agent manually with different settings and noting results, the CRD runner reads a JSON spec, generates a randomized run order, applies treatment variables via environment overrides, and checkpoints after every run.

An ExperimentSpec has six required fields:

{
  "title":      "Prior Knowledge Pass — Context Engineering",
  "hypothesis": "Injecting a KB doc into T_A raises mean wiggum_r1 by >= 0.5",
  "falsified_if": "mean(on) - mean(off) < 0.5 across all replications",
  "factor": {
    "name":   "prior_knowledge_pass",
    "levels": ["off", "on"]
  },
  "tasks":        ["T_A"],
  "replications": 3,
  "response_variables":   ["wiggum_r1", "pass_rate"],
  "controlled_variables": {"producer": "qwen2.5:32b", "evaluator": "..."},
  "mutable_scope": {
    "type": "env",
    "var":  "HARNESS_PRIOR_KNOWLEDGE_PASS",
    "levels": {"off": "", "on": "1"}
  }
}

The mutable_scope field specifies exactly one thing the treatment changes. This enforces single-variable isolation: if you need to test two changes simultaneously, you need two specs. The runner applies the treatment as an environment variable override before spawning each agent subprocess, so the harness code reads it normally — there is no special treatment injection path.

Randomization is seeded (seed=42) so the run order is deterministic and reproducible, but the order still prevents systematic confounding from time-of-day effects or model warm-up state. The --resume flag lets you pick up a partially completed experiment by skipping (task_id, treatment, rep) tuples already in the checkpoint file.

The Experiment Panel

After a CRD run completes, the experiment panel evaluates the experiment itself as a knowledge-producing artifact. This is meta-evaluation: not "did the agent produce good output?" but "did this experiment design produce reliable, actionable knowledge?"

Three models run in parallel, each with an epistemically distinct persona:

Methodologist

Evaluates design rigor: falsifiability, confound control, replication adequacy, run order randomization. Flags design flaws that make results uninterpretable.

SOUND MARGINAL UNSOUND

Knowledge Auditor

Evaluates epistemic validity: did outputs change in response to feedback? Do conclusions follow from observed scores? Are alternative explanations addressed?

VALID INCONCLUSIVE INVALID

Loop Optimizer

Evaluates actionability: is there a specific next variable to change? Are effect sizes large enough to distinguish signal from noise? Produces a concrete next-experiment proposal.

ADVANCE REVISE REDESIGN

The panel's decision logic is a precedence cascade:

if methodologist == "UNSOUND":
    return "REDESIGN"    # design invalid — no point reading the data

if methodologist in ("SOUND","MARGINAL") and auditor == "VALID" and optimizer == "ADVANCE":
    return "KEEP"        # all three agree — update baseline, design next experiment

if optimizer == "REDESIGN":
    return "REDESIGN"    # findings insufficient to advance

return "REVISE"          # mixed verdicts — more replications or cleaner isolation needed

The Methodologist has veto power: if the design is unsound, the data is uninterpretable regardless of what the scores show. This prevents the common failure mode of building on findings from a confounded experiment.

The Loop Optimizer is the only persona required to produce a next_experiment_suggestion. It must name the factor, comparison levels, tasks, and the specific file or function to change if code is involved. This keeps the experimental loop closed: each panel review ends with a concrete proposal for what to test next, not just a verdict on what was tested.

How the Suite Evolves

The task registry and criterion library are not frozen. Tasks are added when a new pipeline capability needs coverage (T_ANN was added when the /annotate skill was introduced; T_G was added when file-based synthesis became a distinct input mode). Criteria are updated when the harness's standards change — no_file_path_refs was added after noticing that models were narrating their save operations inside document bodies.

The hardest part of evolving the suite is the fingerprint matching problem. Adding a task is straightforward. Changing a task's description breaks its fingerprint and silently disconnects it from historical WIGGUM scores in runs.jsonl. When descriptions must change, the fingerprint map must be updated explicitly.

The generated task system (--generated) is the forward path for scaling task coverage without manually curating each task. It reads a generated_tasks.json file produced by an external generator (TinyTroupe in the current integration) and materializes criterion functions from a criteria_specs declarative format. This lets the task count grow proportionally to the harness's capability surface without requiring manual criterion engineering for every new scenario.

The regression harness is the instrument that makes all other measurement meaningful. Without it, “did this change help?” is answered by intuition. With it, the question has a number — and the number is comparable across every change ever made to the harness.

What the Literature Leaves Open

Three papers from the harness’s lit-review corpus speak directly to the eval suite’s design.

The Judge Reliability Harness (2603.05399) is an open-source library that constructs validation suites to stress-test LLM judges across binary judgment accuracy and ordinal grading. Its central finding: “no judge was found to be uniformly reliable across all benchmarks,” with meaningful sensitivity to text formatting, paraphrasing, verbosity, and label flipping. The eval_suite.py task registry is implementing this exact pattern for WIGGUM rather than for arbitrary judges — the T_A–T_H task battery is a judge reliability harness applied to the harness’s own evaluator.

JuStRank (2412.09569) conducts the first large-scale study of LLM judges as system rankers rather than instance evaluators. It validates that aggregating judgment scores over multiple outputs produces a meaningful system-level quality signal distinct from per-instance scoring. This is exactly what score_suite() does: aggregate WIGGUM scores across T_A, T_B, T_C (and optionally T_D, T_E) to produce a single scalar representing harness-wide quality. The paper’s methodology also identifies judge bias toward specific systems as a separate concern from per-instance accuracy — an unaddressed risk if the same evaluator model is used in both production runs and eval suite runs.

Conformal Prediction for LLM-as-a-Judge (2509.18658) proposes constructing prediction intervals from a single evaluation run rather than reporting scalar scores. As documented in the autoresearch convergence post, the composite score’s baseline is a point estimate that can be set by a single lucky run (grounded=8, specificity=9), creating a threshold that subsequent experiments can rarely beat. Interval-valued composite scores would make the keep/discard threshold stochastically aware.

Open question: Should score_suite() return (mean, lower, upper) using conformal intervals over eval-n runs rather than a scalar, and should the autoresearch keep rule be “new interval strictly above old interval” rather than “new scalar > baseline + 0.1”?