May 23, 2026 • 18 min read • Agentic Harness Engineering Series

Experimental Methodology: Four Experiments, One Pipeline

A progression through four completely randomized designs that exposed the evaluator ceiling, the producer ceiling, and the bottleneck that neither model upgrade could fix — the synthesis instruction itself.

The patterns described across this series were not designed in advance. They were derived from a controlled empirical process: observe a failure, diagnose its root cause, apply a targeted intervention, measure whether behavior changed. This post documents that process — the experimental designs, the data, and what each experiment forced the next one to test.

The framework uses Montgomery-style completely randomized designs (CRDs) — the same statistical methodology used in industrial process improvement. Each factor level (task type) receives a fixed number of replications, run order is randomized to prevent confounding from search cache drift or model warm-up effects, and response variables are specified before any run begins. Hypotheses are written before data is collected so that "interesting" post-hoc patterns don't masquerade as confirmations.

Posts 12 & 14 — May 23, 2026

Two posts on the empirical and alignment foundations of the series.

  1. Post 12 Experimental Methodology: Four Experiments, One Pipeline Four completely randomized designs that exposed the evaluator ceiling, the producer ceiling, and the bottleneck neither model upgrade could fix.
  2. Post 14 Multi-Objective Alignment: Beyond Scalar Rewards DPA, MO-ODPO, MGDA-Decoupled, and the safety fine-tuning hazard: high dataset similarity degrades safety by 10.33%.

The Measurement Infrastructure

Before the first formal experiment, the pipeline needed an audit trail. Every run appends one JSON record to runs.jsonl:

{
  "task": "top 5 context engineering techniques...",
  "producer_model": "pi-qwen",
  "evaluator_model": "glm4:9b",
  "total_search_chars": 3125,
  "output_bytes": 1525,
  "output_lines": 20,
  "wiggum_scores": [9.0],
  "wiggum_dims": {"relevance": 9, "completeness": 9, "depth": 8, "specificity": 8, "structure": 10},
  "wiggum_rounds": 1,
  "final": "PASS",
  "run_duration_s": 142.3,
  "input_tokens": 8240,
  "output_tokens": 1621
}

The append-only format means every run is recoverable regardless of what happens downstream. analytics.py reads the log and computes cross-run summaries; inspect_run.py gives per-run forensics. This infrastructure made the regime shift visible before the formal experiments began.

The Foundational Finding: Dual-Search Regime Shift

The first hypothesis the pipeline data tested was not experimental — it was observational. A comparison of six runs (three single-search, three dual-search) produced a finding stark enough to anchor every subsequent experiment:

MetricSingle Search (n=3)Dual Search (n=3)Delta
Avg output bytes1,1491,817+58%
Avg output lines10.731.0+190%
First Wiggum score7.79.0+1.3
Avg Wiggum rounds1.31.0−0.3

Every metric improved with dual search. The prior correlation between search chars and output quality appeared strongly positive (r ≈ +0.9). Dual search was made the default: always run two queries before synthesizing, with a 1,800-char quality floor that triggers a fallback if merged results fall short.

The positive correlation between search chars and output quality held at regime scale — single-search vs dual-search runs are qualitatively different. Later experiments would show that within the dual-search regime (all runs 2,900–3,600 chars), search volume becomes a non-predictor (r = −0.577, noise). Leading indicators lose their predictive value once the floor they measured has been consistently cleared.

Dual-Search Regime Shift — Output Bytes and First-Pass Score

Single-search vs dual-search runs. Each bar is one run. The regime boundary (dashed) corresponds to enabling dual search by default. Score axis (right) shown as dots.

Experiment Design Framework

The four formal experiments used the same task set and CRD structure, enabling direct cross-experiment comparisons. The factor is task type at three levels:

IDTypeCount constraintDomain
T_AenumeratedTop 5 (explicit)Context engineering techniques
T_Bbest_practicesOpen-endedCost envelope management
T_CenumeratedTop 3 (explicit)Agent failure modes

Three replications per task type, randomized run order (independent permutation per experiment). The identical task prompts across all four experiments mean that differences in output metrics are attributable to harness changes, not task variation. Response variables are collected automatically from runs.jsonl.

Experiment 01: Pipeline Generalization Study

Question: Does the dual-search harness produce consistent quality across different task types and count constraints?

Nine runs, 3 × 3 CRD. Producer: qwen2.5:7b. Evaluator: glm4:9b. Pass threshold: score ≥ 8.

RunTaskSearch charsOutput bytesScore r1RoundsFinal
1T_C3,1741,32291PASS
2T_A3,1251,52591PASS
3T_B3,2012,50191PASS
4T_A3,0313,33291PASS
5T_C3,5771,29191PASS
6T_B3,1543,26291PASS
7T_C3,1652,64991PASS
8T_B2,9522,95291PASS
9T_A3,4842,11091PASS

Pass rate: 9/9. The wiggum loop generalized across all task types without failure. But the per-task statistics told a less tidy story:

TaskBytes meanBytes stdCVScore meanRounds mean
T_A2,32292239.7%9.01.0
T_B2,90538313.2%9.01.0
T_C1,75477544.2%9.01.0

T_C — the most explicitly constrained task ("top 3") — had the highest output variance at CV = 44.2%, exceeding the 40% falsification threshold. T_B, the open-ended task, was the most consistent at CV = 13.2%. Counterintuitive finding: explicit count constraints introduce brittleness. When the model over-delivers (Run 7 produced 7 failure modes for a "top 3" task), the evaluator passed it at 9/10 without enforcing the count rule. The Wiggum loop was correct that the output was good — but the task wasn't completed as specified.

Evaluator ceiling detected. All 18 runs across experiments 01 and 02 returned wiggum_rounds = 1 with score_r1 = 9.0. The revision loop — the mechanism that should surface and fix quality gaps — was completely dormant. The root cause: glm4:9b assigns 9/10 to any output that is structurally complete and topically correct, regardless of depth or specificity. Raising the pass threshold from 8 to 9 in experiment 02 had zero effect because the evaluator's score distribution never moved. Threshold changes are a no-op if the evaluator never scores below the new threshold.

Experiment 02: Harness Upgrade Impact Study

Question: Does adding count constraint enforcement, a raised pass threshold, and task-type-specific evaluator criteria improve consistency?

Same CRD. Three changes under test: (1) harness-side count check re-synthesizes if item count is wrong, (2) pass threshold raised to 9, (3) per-task-type scoring criteria injected (enumerated / best_practices / research).

Results: 9/9 PASS again. Zero count_check_retry events (count was right on every first synthesis). All 9 correct task-type routing. CV improved for constrained tasks: T_A 39.7% → 22.8%, T_C 44.2% → 32.4%. T_B remained the most stable.

But the central hypothesis — that raising the threshold would surface runs requiring revision — was definitively falsified. The evaluator scored 9/10 on every first pass, identical to experiment 01. The harness improvements had measurable effects on consistency but zero effect on the metric that mattered: revision loop activation. The ceiling was an evaluator problem, not a threshold or criteria problem.

Design implication: when every run scores the same, the metric has become uninformative. A uniformly high first-pass score means either quality is genuinely high, the evaluator is too lenient, or the pass threshold is too low. Distinguish these by using a stricter evaluator before concluding the harness has no room to improve. Experiment 03 was designed exactly to test this.

Experiment 03: Evaluator Upgrade

Question: Does replacing glm4:9b with Qwen3-Coder:30b produce genuine score variance and activate the revision loop?

Same CRD and task set. Evaluator changed to Qwen3-Coder:30b (30B parameters, 3× the capacity of glm4). Evaluator prompt updated with calibration anchors and a rule requiring named issues for any dimension scored ≤8. Pass threshold: 8.0.

RunTaskScore r1Score finalRoundsGainFinal
1T_C7.08.82+1.8PASS
2T_A7.06.93−0.1FAIL
3T_C7.07.030.0FAIL
4T_B6.07.23+1.2FAIL
5T_A7.06.83−0.2FAIL
6T_B8.18.110PASS
7T_C7.08.12+1.1PASS
8T_B7.58.22+0.7PASS
9T_A7.07.030.0FAIL

Overall: 4/9 PASS. T_A: 0/3. The evaluator upgrade worked exactly as intended — first-pass score std rose from effectively 0 to 0.55, mean dropped from 9.0 to 7.07, revision loop activated in 8/9 runs. The unexpected result was what the revision loop revealed: qwen2.5:7b could not respond to depth feedback on T_A.

Two distinct failure modes appeared in T_A revision:

ModePatternMechanism
Regression7.0 → 6.8 (runs 2, 5)Producer rewrites sections and removes content while trying to add depth; shorter, shallower result
Stagnation7.0 → 7.0 (runs 3, 9)Producer edits surface wording without addressing the underlying depth gap the evaluator identified

Producer ceiling exposed. The wiggum loop design was correct; the evaluator was now calibrated. The bottleneck was the producer. qwen2.5:7b can produce output that scores ≥8.0 on open-ended tasks (T_B) and short enumerated tasks (T_C) when depth per item is achievable. For T_A — five items, each requiring a concrete implementation note — the ceiling was ~7.0 regardless of revision. The evaluator correctly identified missing depth per item, but the producer could not add it. Depth and specificity dimensions both scored 6.0 on every T_A first pass, and revision did not move them.

Experiment 04: Producer Upgrade

Question: Does replacing qwen2.5:7b with qwen2.5:32b Q4_K_M break the T_A ceiling?

Same task set, same evaluator, same threshold. Producer changed to the 32B parameter model (approximately 20GB at Q4_K_M quantization). The experiment ran 16 total records rather than 9 due to a MarkItDown URL enrichment integration mid-run.

TaskScore r1 meanRounds meanPass rateDepth r1Spc r1Bytes mean
T_A8.00 ±0.741.254/4 (100%)7.27.02,293
T_B6.97 ±0.532.714/7 (57%)6.15.92,198
T_C7.54 ±0.802.004/5 (80%)6.86.41,491

Overall: 12/16 PASS (75%). Experiment 03 was 4/9 (44%). All five hypotheses confirmed:

Cross-Experiment Progression — T_A First-Pass Score and Pass Rate

T_A (top 5 enumerated) across all four experiments. The evaluator upgrade (exp-03) activated the revision loop but exposed the producer ceiling. The producer upgrade (exp-04) broke the ceiling. T_A first-pass score went 9.0 → 9.0 → 7.0 → 8.0 across the experiment series — the drops and recovery track which component was the actual bottleneck at each stage.

T_B: The Remaining Bottleneck

The most counterintuitive finding from experiment 04 was T_B. The 32B producer generates shorter T_B output than the 7B model — 2,198 bytes vs 3,288 bytes, a 33% reduction. Depth and specificity on first pass are essentially unchanged: T_B depth_r1 = 6.1 (vs experiment 03's 6.0). The parameter upgrade that definitively solved T_A had no effect on T_B's depth dimension.

The dimension data makes the cause clear. T_A depth improvement was +1.2 points; T_B depth improvement was +0.1 points. A 4.5× parameter increase brought no depth improvement on open-ended tasks. This is a synthesis instruction problem, not a model capability problem. The SYNTH_INSTRUCTION doesn't push hard enough on depth for best-practices task types, and the 32B complies faithfully with a weak instruction — it's more capable, so it follows the (insufficiently demanding) instruction more precisely.

The bottleneck has shifted three times. The dual-search change solved the research quality bottleneck. The evaluator upgrade unmasked the producer ceiling. The producer upgrade solved the producer ceiling — and unmasked the synthesis instruction as the remaining bottleneck. Each fix reveals the next constraint.

Bottleneck Chain: What Each Experiment Fixed and What It Revealed

Each intervention solved one constraint and made the next constraint visible. Harness improvement is a sequential bottleneck-elimination process.

The Autoresearch Loop

With T_B depth/specificity identified as the synthesis instruction bottleneck, the pipeline became self-targeting. autoresearch.py is an autonomous optimizer that mutates the SYNTH_INSTRUCTION string and runs controlled experiments to measure whether the mutation improved the composite score:

composite = 0.7 × mean_wiggum_r1 + 0.3 × criteria_rate × 10

Each iteration: propose a modification to the synthesis instruction (using Qwen3-Coder:30b as proposer, optionally kimi-k2.5:cloud for structural diversity), run the eval suite, keep the commit if new_score − baseline > 0.1, otherwise revert via git reset HEAD~1 --soft. The only mutable scope is the instruction string between sentinel markers — the harness itself is off-limits.

Session 3 results illustrate what the optimizer discovers and discards:

ExperimentChange attemptedScoreDecision
6Failure modes + detection/mitigation per strategy8.350DISCARD −0.233
7"When NOT to use" + input boundaries framing8.915KEEP +0.332
8Measurable success criteria / validation tests8.915DISCARD +0.000
9Confidence ratings (High/Med/Low) per library7.965DISCARD −0.950

The largest single negative signal: experiment 9's confidence ratings caused a −0.950 regression. The evaluator reads uncertainty markers as lack of authority. Hedging actively hurts depth scores. This finding is impossible to derive from first principles — it requires running the pipeline and measuring it.

The largest single positive: experiment 7's "when NOT to use" framing (+0.332) was discovered by Kimi (a cloud-based proposer), which explored structurally different angles from the local proposer. Local sessions 1–2 converged on "add more code / implementation detail." Session 3's cloud proposer immediately tried constraint framing and boundary conditions — approaches the local proposer never reached.

The autoresearch loop encodes the full experimental methodology in software. It runs controlled experiments (each proposal is one experiment), measures a response variable (composite score), applies a decision rule (keep if delta > threshold), and maintains a git-auditable history of what worked. The runs.jsonl audit log described in Post 10 provides the data; the autoresearch loop is the systematic optimizer that acts on it.

Statistical Addendum: What Three Replications Can and Cannot Tell You

Three replications per task type is the practical floor for a CRD. It is insufficient for most inferential statistics. The experiments do not report Mann-Whitney U tests or confidence intervals on the per-task means because three samples produce unreliable estimates of those quantities.

What three replications can reliably detect: large effects. The T_A failure in experiment 03 (0/3 PASS with regression on 2/3 attempts) does not require a test to interpret. The 4/4 T_A success in experiment 04 following a single producer change does not require a test. The signal-to-noise ratio at these effect sizes is high enough that informal reasoning suffices.

The coefficient of variation (CV) analysis — std / mean per task per experiment — is the primary summary statistic used. It is dimensionless, interpretable, and does not require distributional assumptions. The 20% stability threshold used throughout the experiments is a practical engineering criterion, not a statistical one: at CV < 20%, the pipeline produces outputs consistent enough to serve as training data for the Data Flywheel (Post 10) without requiring per-sample human review.

Formal inference would require at least 6–8 replications per condition to detect medium-sized effects with reasonable power. That cost is not warranted for iterative engineering decisions — the right tool is rapid pilot experiments followed by a single confirmatory run, which is precisely what the autoresearch loop implements.

Three-Model Architecture

The four experiments converged on a three-model architecture where each model occupies its correct niche:

RoleModelSizeRationale
Producerqwen2.5:32b Q4_K_M~20 GBBreaks enumerated task ceiling; reliable revision without regression
EvaluatorQwen3-Coder:30b~18 GBDifferent architecture family; scores critically; names specific issues
Planner / Compressorglm4:9b~5 GBFast; handles lightweight JSON generation and memory compression

The key constraint is that the evaluator must be more capable than the producer in the dimensions being evaluated — not necessarily larger in total parameters, but specifically more discriminating on depth and specificity. glm4:9b failed as evaluator not because it was small but because it was not calibrated to penalize shallow outputs. A smaller model with better calibration would work. The experiments happened to find a calibrated model that was also larger.

The harness patterns described throughout this series — dual-backend memory (Post 5), the Wiggum Loop (Post 6), dimensional rubrics (Post 7), the Data Flywheel (Post 10) — were all validated or refined against the runs.jsonl audit log. The experimental methodology is not separate from the harness; it is the mechanism by which the harness was built. Every pattern in Posts 1–11 has a row in a JSONL file that justified it.

← Previous 11 · Parallel Inference Next → 13 · Judge Reliability