May 23, 2026 • 18 min read • Agentic Harness Engineering Series

Experimental Methodology: Four Experiments, One Pipeline

A progression through four completely randomized designs that exposed the evaluator ceiling, the producer ceiling, and the bottleneck that neither model upgrade could fix — the synthesis instruction itself.

The patterns described across this series were not designed in advance. They were derived from a controlled empirical process: observe a failure, diagnose its root cause, apply a targeted intervention, measure whether behavior changed. This post documents that process — the experimental designs, the data, and what each experiment forced the next one to test.

The framework uses Montgomery-style completely randomized designs (CRDs) — the same statistical methodology used in industrial process improvement. Each factor level (task type) receives a fixed number of replications, run order is randomized to prevent confounding from search cache drift or model warm-up effects, and response variables are specified before any run begins. Hypotheses are written before data is collected so that "interesting" post-hoc patterns don't masquerade as confirmations.

Posts 12 & 14 — May 23, 2026

Two posts on the empirical and alignment foundations of the series.

The Measurement Infrastructure

Before the first formal experiment, the pipeline needed an audit trail. Every run appends one JSON record to runs.jsonl:

{
  "task": "top 5 context engineering techniques...",
  "producer_model": "pi-qwen",
  "evaluator_model": "glm4:9b",
  "total_search_chars": 3125,
  "output_bytes": 1525,
  "output_lines": 20,
  "wiggum_scores": [9.0],
  "wiggum_dims": {"relevance": 9, "completeness": 9, "depth": 8, "specificity": 8, "structure": 10},
  "wiggum_rounds": 1,
  "final": "PASS",
  "run_duration_s": 142.3,
  "input_tokens": 8240,
  "output_tokens": 1621
}

The append-only format means every run is recoverable regardless of what happens downstream. analytics.py reads the log and computes cross-run summaries; inspect_run.py gives per-run forensics. This infrastructure made the regime shift visible before the formal experiments began.

The Foundational Finding: Dual-Search Regime Shift

The first hypothesis the pipeline data tested was not experimental — it was observational. A comparison of six runs (three single-search, three dual-search) produced a finding stark enough to anchor every subsequent experiment:

Metric	Single Search (n=3)	Dual Search (n=3)	Delta
Avg output bytes	1,149	1,817	+58%
Avg output lines	10.7	31.0	+190%
First Wiggum score	7.7	9.0	+1.3
Avg Wiggum rounds	1.3	1.0	−0.3

Every metric improved with dual search. The prior correlation between search chars and output quality appeared strongly positive (r ≈ +0.9). Dual search was made the default: always run two queries before synthesizing, with a 1,800-char quality floor that triggers a fallback if merged results fall short.

The positive correlation between search chars and output quality held at regime scale — single-search vs dual-search runs are qualitatively different. Later experiments would show that within the dual-search regime (all runs 2,900–3,600 chars), search volume becomes a non-predictor (r = −0.577, noise). Leading indicators lose their predictive value once the floor they measured has been consistently cleared.

Dual-Search Regime Shift — Output Bytes and First-Pass Score

Single-search vs dual-search runs. Each bar is one run. The regime boundary (dashed) corresponds to enabling dual search by default. Score axis (right) shown as dots.

Experiment Design Framework

The four formal experiments used the same task set and CRD structure, enabling direct cross-experiment comparisons. The factor is task type at three levels:

ID	Type	Count constraint	Domain
T_A	enumerated	Top 5 (explicit)	Context engineering techniques
T_B	best_practices	Open-ended	Cost envelope management
T_C	enumerated	Top 3 (explicit)	Agent failure modes

Three replications per task type, randomized run order (independent permutation per experiment). The identical task prompts across all four experiments mean that differences in output metrics are attributable to harness changes, not task variation. Response variables are collected automatically from runs.jsonl.

Experiment 01: Pipeline Generalization Study

Question: Does the dual-search harness produce consistent quality across different task types and count constraints?

Nine runs, 3 × 3 CRD. Producer: qwen2.5:7b. Evaluator: glm4:9b. Pass threshold: score ≥ 8.

Run	Task	Search chars	Output bytes	Score r1	Rounds	Final
1	T_C	3,174	1,322	9	1	PASS
2	T_A	3,125	1,525	9	1	PASS
3	T_B	3,201	2,501	9	1	PASS
4	T_A	3,031	3,332	9	1	PASS
5	T_C	3,577	1,291	9	1	PASS
6	T_B	3,154	3,262	9	1	PASS
7	T_C	3,165	2,649	9	1	PASS
8	T_B	2,952	2,952	9	1	PASS
9	T_A	3,484	2,110	9	1	PASS

Pass rate: 9/9. The wiggum loop generalized across all task types without failure. But the per-task statistics told a less tidy story:

Task	Bytes mean	Bytes std	CV	Score mean	Rounds mean
T_A	2,322	922	39.7%	9.0	1.0
T_B	2,905	383	13.2%	9.0	1.0
T_C	1,754	775	44.2%	9.0	1.0

T_C — the most explicitly constrained task ("top 3") — had the highest output variance at CV = 44.2%, exceeding the 40% falsification threshold. T_B, the open-ended task, was the most consistent at CV = 13.2%. Counterintuitive finding: explicit count constraints introduce brittleness. When the model over-delivers (Run 7 produced 7 failure modes for a "top 3" task), the evaluator passed it at 9/10 without enforcing the count rule. The Wiggum loop was correct that the output was good — but the task wasn't completed as specified.

Evaluator ceiling detected. All 18 runs across experiments 01 and 02 returned wiggum_rounds = 1 with score_r1 = 9.0. The revision loop — the mechanism that should surface and fix quality gaps — was completely dormant. The root cause: glm4:9b assigns 9/10 to any output that is structurally complete and topically correct, regardless of depth or specificity. Raising the pass threshold from 8 to 9 in experiment 02 had zero effect because the evaluator's score distribution never moved. Threshold changes are a no-op if the evaluator never scores below the new threshold.

Experiment 02: Harness Upgrade Impact Study

Question: Does adding count constraint enforcement, a raised pass threshold, and task-type-specific evaluator criteria improve consistency?

Same CRD. Three changes under test: (1) harness-side count check re-synthesizes if item count is wrong, (2) pass threshold raised to 9, (3) per-task-type scoring criteria injected (enumerated / best_practices / research).

Results: 9/9 PASS again. Zero count_check_retry events (count was right on every first synthesis). All 9 correct task-type routing. CV improved for constrained tasks: T_A 39.7% → 22.8%, T_C 44.2% → 32.4%. T_B remained the most stable.

But the central hypothesis — that raising the threshold would surface runs requiring revision — was definitively falsified. The evaluator scored 9/10 on every first pass, identical to experiment 01. The harness improvements had measurable effects on consistency but zero effect on the metric that mattered: revision loop activation. The ceiling was an evaluator problem, not a threshold or criteria problem.

Design implication: when every run scores the same, the metric has become uninformative. A uniformly high first-pass score means either quality is genuinely high, the evaluator is too lenient, or the pass threshold is too low. Distinguish these by using a stricter evaluator before concluding the harness has no room to improve. Experiment 03 was designed exactly to test this.

Experiment 03: Evaluator Upgrade

Question: Does replacing glm4:9b with Qwen3-Coder:30b produce genuine score variance and activate the revision loop?

Same CRD and task set. Evaluator changed to Qwen3-Coder:30b (30B parameters, 3× the capacity of glm4). Evaluator prompt updated with calibration anchors and a rule requiring named issues for any dimension scored ≤8. Pass threshold: 8.0.

Run	Task	Score r1	Score final	Rounds	Gain	Final
1	T_C	7.0	8.8	2	+1.8	PASS
2	T_A	7.0	6.9	3	−0.1	FAIL
3	T_C	7.0	7.0	3	0.0	FAIL
4	T_B	6.0	7.2	3	+1.2	FAIL
5	T_A	7.0	6.8	3	−0.2	FAIL
6	T_B	8.1	8.1	1	0	PASS
7	T_C	7.0	8.1	2	+1.1	PASS
8	T_B	7.5	8.2	2	+0.7	PASS
9	T_A	7.0	7.0	3	0.0	FAIL

Overall: 4/9 PASS. T_A: 0/3. The evaluator upgrade worked exactly as intended — first-pass score std rose from effectively 0 to 0.55, mean dropped from 9.0 to 7.07, revision loop activated in 8/9 runs. The unexpected result was what the revision loop revealed: qwen2.5:7b could not respond to depth feedback on T_A.

Two distinct failure modes appeared in T_A revision:

Mode	Pattern	Mechanism
Regression	7.0 → 6.8 (runs 2, 5)	Producer rewrites sections and removes content while trying to add depth; shorter, shallower result
Stagnation	7.0 → 7.0 (runs 3, 9)	Producer edits surface wording without addressing the underlying depth gap the evaluator identified

Producer ceiling exposed. The wiggum loop design was correct; the evaluator was now calibrated. The bottleneck was the producer. qwen2.5:7b can produce output that scores ≥8.0 on open-ended tasks (T_B) and short enumerated tasks (T_C) when depth per item is achievable. For T_A — five items, each requiring a concrete implementation note — the ceiling was ~7.0 regardless of revision. The evaluator correctly identified missing depth per item, but the producer could not add it. Depth and specificity dimensions both scored 6.0 on every T_A first pass, and revision did not move them.

Experiment 04: Producer Upgrade

Question: Does replacing qwen2.5:7b with qwen2.5:32b Q4_K_M break the T_A ceiling?

Same task set, same evaluator, same threshold. Producer changed to the 32B parameter model (approximately 20GB at Q4_K_M quantization). The experiment ran 16 total records rather than 9 due to a MarkItDown URL enrichment integration mid-run.

Task	Score r1 mean	Rounds mean	Pass rate	Depth r1	Spc r1	Bytes mean
T_A	8.00 ±0.74	1.25	4/4 (100%)	7.2	7.0	2,293
T_B	6.97 ±0.53	2.71	4/7 (57%)	6.1	5.9	2,198
T_C	7.54 ±0.80	2.00	4/5 (80%)	6.8	6.4	1,491

Overall: 12/16 PASS (75%). Experiment 03 was 4/9 (44%). All five hypotheses confirmed:

T_A ceiling broken. 4/4 PASS. score_r1 improved from 7.0 ±0.00 to 8.00 ±0.74. Depth +1.2, specificity +1.0 vs experiment 03. Revision rounds: 3.0 → 1.25.
Zero revision regressions. Experiment 03 had two regression events (7.0 → 6.8). The 32B consistently improves on evaluator feedback rather than degrading. This matters more than first-pass score: a reliable revision loop means the ceiling is now set by the synthesis instruction and evaluator, not by the producer's ability to respond to feedback.
Overall first-pass quality improved. Mean score_r1 rose from 7.07 to 7.41.

Cross-Experiment Progression — T_A First-Pass Score and Pass Rate

T_A (top 5 enumerated) across all four experiments. The evaluator upgrade (exp-03) activated the revision loop but exposed the producer ceiling. The producer upgrade (exp-04) broke the ceiling. T_A first-pass score went 9.0 → 9.0 → 7.0 → 8.0 across the experiment series — the drops and recovery track which component was the actual bottleneck at each stage.

T_B: The Remaining Bottleneck

The most counterintuitive finding from experiment 04 was T_B. The 32B producer generates shorter T_B output than the 7B model — 2,198 bytes vs 3,288 bytes, a 33% reduction. Depth and specificity on first pass are essentially unchanged: T_B depth_r1 = 6.1 (vs experiment 03's 6.0). The parameter upgrade that definitively solved T_A had no effect on T_B's depth dimension.

The dimension data makes the cause clear. T_A depth improvement was +1.2 points; T_B depth improvement was +0.1 points. A 4.5× parameter increase brought no depth improvement on open-ended tasks. This is a synthesis instruction problem, not a model capability problem. The SYNTH_INSTRUCTION doesn't push hard enough on depth for best-practices task types, and the 32B complies faithfully with a weak instruction — it's more capable, so it follows the (insufficiently demanding) instruction more precisely.

The bottleneck has shifted three times. The dual-search change solved the research quality bottleneck. The evaluator upgrade unmasked the producer ceiling. The producer upgrade solved the producer ceiling — and unmasked the synthesis instruction as the remaining bottleneck. Each fix reveals the next constraint.

Bottleneck Chain: What Each Experiment Fixed and What It Revealed

Each intervention solved one constraint and made the next constraint visible. Harness improvement is a sequential bottleneck-elimination process.

The Autoresearch Loop

With T_B depth/specificity identified as the synthesis instruction bottleneck, the pipeline became self-targeting. autoresearch.py is an autonomous optimizer that mutates the SYNTH_INSTRUCTION string and runs controlled experiments to measure whether the mutation improved the composite score:

composite = 0.7 × mean_wiggum_r1 + 0.3 × criteria_rate × 10

Each iteration: propose a modification to the synthesis instruction (using Qwen3-Coder:30b as proposer, optionally kimi-k2.5:cloud for structural diversity), run the eval suite, keep the commit if new_score − baseline > 0.1, otherwise revert via git reset HEAD~1 --soft. The only mutable scope is the instruction string between sentinel markers — the harness itself is off-limits.

Session 3 results illustrate what the optimizer discovers and discards:

Experiment	Change attempted	Score	Decision
6	Failure modes + detection/mitigation per strategy	8.350	DISCARD −0.233
7	"When NOT to use" + input boundaries framing	8.915	KEEP +0.332
8	Measurable success criteria / validation tests	8.915	DISCARD +0.000
9	Confidence ratings (High/Med/Low) per library	7.965	DISCARD −0.950

The largest single negative signal: experiment 9's confidence ratings caused a −0.950 regression. The evaluator reads uncertainty markers as lack of authority. Hedging actively hurts depth scores. This finding is impossible to derive from first principles — it requires running the pipeline and measuring it.

The largest single positive: experiment 7's "when NOT to use" framing (+0.332) was discovered by Kimi (a cloud-based proposer), which explored structurally different angles from the local proposer. Local sessions 1–2 converged on "add more code / implementation detail." Session 3's cloud proposer immediately tried constraint framing and boundary conditions — approaches the local proposer never reached.

The autoresearch loop encodes the full experimental methodology in software. It runs controlled experiments (each proposal is one experiment), measures a response variable (composite score), applies a decision rule (keep if delta > threshold), and maintains a git-auditable history of what worked. The runs.jsonl audit log described in Post 10 provides the data; the autoresearch loop is the systematic optimizer that acts on it.

Statistical Addendum: What Three Replications Can and Cannot Tell You

Three replications per task type is the practical floor for a CRD. It is insufficient for most inferential statistics. The experiments do not report Mann-Whitney U tests or confidence intervals on the per-task means because three samples produce unreliable estimates of those quantities.

What three replications can reliably detect: large effects. The T_A failure in experiment 03 (0/3 PASS with regression on 2/3 attempts) does not require a test to interpret. The 4/4 T_A success in experiment 04 following a single producer change does not require a test. The signal-to-noise ratio at these effect sizes is high enough that informal reasoning suffices.

The coefficient of variation (CV) analysis — std / mean per task per experiment — is the primary summary statistic used. It is dimensionless, interpretable, and does not require distributional assumptions. The 20% stability threshold used throughout the experiments is a practical engineering criterion, not a statistical one: at CV < 20%, the pipeline produces outputs consistent enough to serve as training data for the Data Flywheel (Post 10) without requiring per-sample human review.

Formal inference would require at least 6–8 replications per condition to detect medium-sized effects with reasonable power. That cost is not warranted for iterative engineering decisions — the right tool is rapid pilot experiments followed by a single confirmatory run, which is precisely what the autoresearch loop implements.

Three-Model Architecture

The four experiments converged on a three-model architecture where each model occupies its correct niche:

Role	Model	Size	Rationale
Producer	qwen2.5:32b Q4_K_M	~20 GB	Breaks enumerated task ceiling; reliable revision without regression
Evaluator	Qwen3-Coder:30b	~18 GB	Different architecture family; scores critically; names specific issues
Planner / Compressor	glm4:9b	~5 GB	Fast; handles lightweight JSON generation and memory compression

The key constraint is that the evaluator must be more capable than the producer in the dimensions being evaluated — not necessarily larger in total parameters, but specifically more discriminating on depth and specificity. glm4:9b failed as evaluator not because it was small but because it was not calibrated to penalize shallow outputs. A smaller model with better calibration would work. The experiments happened to find a calibrated model that was also larger.

The harness patterns described throughout this series — dual-backend memory (Post 5), the Wiggum Loop (Post 6), dimensional rubrics (Post 7), the Data Flywheel (Post 10) — were all validated or refined against the runs.jsonl audit log. The experimental methodology is not separate from the harness; it is the mechanism by which the harness was built. Every pattern in Posts 1–11 has a row in a JSONL file that justified it.

The Measurement Infrastructure

The Foundational Finding: Dual-Search Regime Shift

Experiment Design Framework

Experiment 01: Pipeline Generalization Study

Experiment 02: Harness Upgrade Impact Study

Experiment 03: Evaluator Upgrade

Experiment 04: Producer Upgrade

T_B: The Remaining Bottleneck

The Autoresearch Loop

Statistical Addendum: What Three Replications Can and Cannot Tell You

Three-Model Architecture

Related in this series