Experimental Methodology: Four Experiments, One Pipeline
A progression through four completely randomized designs that exposed the evaluator ceiling, the producer ceiling, and the bottleneck that neither model upgrade could fix — the synthesis instruction itself.
The patterns described across this series were not designed in advance. They were derived from a controlled empirical process: observe a failure, diagnose its root cause, apply a targeted intervention, measure whether behavior changed. This post documents that process — the experimental designs, the data, and what each experiment forced the next one to test.
The framework uses Montgomery-style completely randomized designs (CRDs) — the same statistical methodology used in industrial process improvement. Each factor level (task type) receives a fixed number of replications, run order is randomized to prevent confounding from search cache drift or model warm-up effects, and response variables are specified before any run begins. Hypotheses are written before data is collected so that "interesting" post-hoc patterns don't masquerade as confirmations.
Posts 12 & 14 — May 23, 2026
Two posts on the empirical and alignment foundations of the series.
- Post 12 Experimental Methodology: Four Experiments, One Pipeline Four completely randomized designs that exposed the evaluator ceiling, the producer ceiling, and the bottleneck neither model upgrade could fix.
- Post 14 Multi-Objective Alignment: Beyond Scalar Rewards DPA, MO-ODPO, MGDA-Decoupled, and the safety fine-tuning hazard: high dataset similarity degrades safety by 10.33%.
The Measurement Infrastructure
Before the first formal experiment, the pipeline needed an audit trail. Every run appends one JSON record to runs.jsonl:
{
"task": "top 5 context engineering techniques...",
"producer_model": "pi-qwen",
"evaluator_model": "glm4:9b",
"total_search_chars": 3125,
"output_bytes": 1525,
"output_lines": 20,
"wiggum_scores": [9.0],
"wiggum_dims": {"relevance": 9, "completeness": 9, "depth": 8, "specificity": 8, "structure": 10},
"wiggum_rounds": 1,
"final": "PASS",
"run_duration_s": 142.3,
"input_tokens": 8240,
"output_tokens": 1621
}
The append-only format means every run is recoverable regardless of what happens downstream. analytics.py reads the log and computes cross-run summaries; inspect_run.py gives per-run forensics. This infrastructure made the regime shift visible before the formal experiments began.
The Foundational Finding: Dual-Search Regime Shift
The first hypothesis the pipeline data tested was not experimental — it was observational. A comparison of six runs (three single-search, three dual-search) produced a finding stark enough to anchor every subsequent experiment:
| Metric | Single Search (n=3) | Dual Search (n=3) | Delta |
|---|---|---|---|
| Avg output bytes | 1,149 | 1,817 | +58% |
| Avg output lines | 10.7 | 31.0 | +190% |
| First Wiggum score | 7.7 | 9.0 | +1.3 |
| Avg Wiggum rounds | 1.3 | 1.0 | −0.3 |
Every metric improved with dual search. The prior correlation between search chars and output quality appeared strongly positive (r ≈ +0.9). Dual search was made the default: always run two queries before synthesizing, with a 1,800-char quality floor that triggers a fallback if merged results fall short.
The positive correlation between search chars and output quality held at regime scale — single-search vs dual-search runs are qualitatively different. Later experiments would show that within the dual-search regime (all runs 2,900–3,600 chars), search volume becomes a non-predictor (r = −0.577, noise). Leading indicators lose their predictive value once the floor they measured has been consistently cleared.
Single-search vs dual-search runs. Each bar is one run. The regime boundary (dashed) corresponds to enabling dual search by default. Score axis (right) shown as dots.
Experiment Design Framework
The four formal experiments used the same task set and CRD structure, enabling direct cross-experiment comparisons. The factor is task type at three levels:
| ID | Type | Count constraint | Domain |
|---|---|---|---|
| T_A | enumerated | Top 5 (explicit) | Context engineering techniques |
| T_B | best_practices | Open-ended | Cost envelope management |
| T_C | enumerated | Top 3 (explicit) | Agent failure modes |
Three replications per task type, randomized run order (independent permutation per experiment). The identical task prompts across all four experiments mean that differences in output metrics are attributable to harness changes, not task variation. Response variables are collected automatically from runs.jsonl.
Experiment 01: Pipeline Generalization Study
Question: Does the dual-search harness produce consistent quality across different task types and count constraints?
Nine runs, 3 × 3 CRD. Producer: qwen2.5:7b. Evaluator: glm4:9b. Pass threshold: score ≥ 8.
| Run | Task | Search chars | Output bytes | Score r1 | Rounds | Final |
|---|---|---|---|---|---|---|
| 1 | T_C | 3,174 | 1,322 | 9 | 1 | PASS |
| 2 | T_A | 3,125 | 1,525 | 9 | 1 | PASS |
| 3 | T_B | 3,201 | 2,501 | 9 | 1 | PASS |
| 4 | T_A | 3,031 | 3,332 | 9 | 1 | PASS |
| 5 | T_C | 3,577 | 1,291 | 9 | 1 | PASS |
| 6 | T_B | 3,154 | 3,262 | 9 | 1 | PASS |
| 7 | T_C | 3,165 | 2,649 | 9 | 1 | PASS |
| 8 | T_B | 2,952 | 2,952 | 9 | 1 | PASS |
| 9 | T_A | 3,484 | 2,110 | 9 | 1 | PASS |
Pass rate: 9/9. The wiggum loop generalized across all task types without failure. But the per-task statistics told a less tidy story:
| Task | Bytes mean | Bytes std | CV | Score mean | Rounds mean |
|---|---|---|---|---|---|
| T_A | 2,322 | 922 | 39.7% | 9.0 | 1.0 |
| T_B | 2,905 | 383 | 13.2% | 9.0 | 1.0 |
| T_C | 1,754 | 775 | 44.2% | 9.0 | 1.0 |
T_C — the most explicitly constrained task ("top 3") — had the highest output variance at CV = 44.2%, exceeding the 40% falsification threshold. T_B, the open-ended task, was the most consistent at CV = 13.2%. Counterintuitive finding: explicit count constraints introduce brittleness. When the model over-delivers (Run 7 produced 7 failure modes for a "top 3" task), the evaluator passed it at 9/10 without enforcing the count rule. The Wiggum loop was correct that the output was good — but the task wasn't completed as specified.
Evaluator ceiling detected. All 18 runs across experiments 01 and 02 returned wiggum_rounds = 1 with score_r1 = 9.0. The revision loop — the mechanism that should surface and fix quality gaps — was completely dormant. The root cause: glm4:9b assigns 9/10 to any output that is structurally complete and topically correct, regardless of depth or specificity. Raising the pass threshold from 8 to 9 in experiment 02 had zero effect because the evaluator's score distribution never moved. Threshold changes are a no-op if the evaluator never scores below the new threshold.
Experiment 02: Harness Upgrade Impact Study
Question: Does adding count constraint enforcement, a raised pass threshold, and task-type-specific evaluator criteria improve consistency?
Same CRD. Three changes under test: (1) harness-side count check re-synthesizes if item count is wrong, (2) pass threshold raised to 9, (3) per-task-type scoring criteria injected (enumerated / best_practices / research).
Results: 9/9 PASS again. Zero count_check_retry events (count was right on every first synthesis). All 9 correct task-type routing. CV improved for constrained tasks: T_A 39.7% → 22.8%, T_C 44.2% → 32.4%. T_B remained the most stable.
But the central hypothesis — that raising the threshold would surface runs requiring revision — was definitively falsified. The evaluator scored 9/10 on every first pass, identical to experiment 01. The harness improvements had measurable effects on consistency but zero effect on the metric that mattered: revision loop activation. The ceiling was an evaluator problem, not a threshold or criteria problem.
Design implication: when every run scores the same, the metric has become uninformative. A uniformly high first-pass score means either quality is genuinely high, the evaluator is too lenient, or the pass threshold is too low. Distinguish these by using a stricter evaluator before concluding the harness has no room to improve. Experiment 03 was designed exactly to test this.
Experiment 03: Evaluator Upgrade
Question: Does replacing glm4:9b with Qwen3-Coder:30b produce genuine score variance and activate the revision loop?
Same CRD and task set. Evaluator changed to Qwen3-Coder:30b (30B parameters, 3× the capacity of glm4). Evaluator prompt updated with calibration anchors and a rule requiring named issues for any dimension scored ≤8. Pass threshold: 8.0.
| Run | Task | Score r1 | Score final | Rounds | Gain | Final |
|---|---|---|---|---|---|---|
| 1 | T_C | 7.0 | 8.8 | 2 | +1.8 | PASS |
| 2 | T_A | 7.0 | 6.9 | 3 | −0.1 | FAIL |
| 3 | T_C | 7.0 | 7.0 | 3 | 0.0 | FAIL |
| 4 | T_B | 6.0 | 7.2 | 3 | +1.2 | FAIL |
| 5 | T_A | 7.0 | 6.8 | 3 | −0.2 | FAIL |
| 6 | T_B | 8.1 | 8.1 | 1 | 0 | PASS |
| 7 | T_C | 7.0 | 8.1 | 2 | +1.1 | PASS |
| 8 | T_B | 7.5 | 8.2 | 2 | +0.7 | PASS |
| 9 | T_A | 7.0 | 7.0 | 3 | 0.0 | FAIL |
Overall: 4/9 PASS. T_A: 0/3. The evaluator upgrade worked exactly as intended — first-pass score std rose from effectively 0 to 0.55, mean dropped from 9.0 to 7.07, revision loop activated in 8/9 runs. The unexpected result was what the revision loop revealed: qwen2.5:7b could not respond to depth feedback on T_A.
Two distinct failure modes appeared in T_A revision:
| Mode | Pattern | Mechanism |
|---|---|---|
| Regression | 7.0 → 6.8 (runs 2, 5) | Producer rewrites sections and removes content while trying to add depth; shorter, shallower result |
| Stagnation | 7.0 → 7.0 (runs 3, 9) | Producer edits surface wording without addressing the underlying depth gap the evaluator identified |
Producer ceiling exposed. The wiggum loop design was correct; the evaluator was now calibrated. The bottleneck was the producer. qwen2.5:7b can produce output that scores ≥8.0 on open-ended tasks (T_B) and short enumerated tasks (T_C) when depth per item is achievable. For T_A — five items, each requiring a concrete implementation note — the ceiling was ~7.0 regardless of revision. The evaluator correctly identified missing depth per item, but the producer could not add it. Depth and specificity dimensions both scored 6.0 on every T_A first pass, and revision did not move them.
Experiment 04: Producer Upgrade
Question: Does replacing qwen2.5:7b with qwen2.5:32b Q4_K_M break the T_A ceiling?
Same task set, same evaluator, same threshold. Producer changed to the 32B parameter model (approximately 20GB at Q4_K_M quantization). The experiment ran 16 total records rather than 9 due to a MarkItDown URL enrichment integration mid-run.
| Task | Score r1 mean | Rounds mean | Pass rate | Depth r1 | Spc r1 | Bytes mean |
|---|---|---|---|---|---|---|
| T_A | 8.00 ±0.74 | 1.25 | 4/4 (100%) | 7.2 | 7.0 | 2,293 |
| T_B | 6.97 ±0.53 | 2.71 | 4/7 (57%) | 6.1 | 5.9 | 2,198 |
| T_C | 7.54 ±0.80 | 2.00 | 4/5 (80%) | 6.8 | 6.4 | 1,491 |
Overall: 12/16 PASS (75%). Experiment 03 was 4/9 (44%). All five hypotheses confirmed:
- T_A ceiling broken. 4/4 PASS. score_r1 improved from 7.0 ±0.00 to 8.00 ±0.74. Depth +1.2, specificity +1.0 vs experiment 03. Revision rounds: 3.0 → 1.25.
- Zero revision regressions. Experiment 03 had two regression events (7.0 → 6.8). The 32B consistently improves on evaluator feedback rather than degrading. This matters more than first-pass score: a reliable revision loop means the ceiling is now set by the synthesis instruction and evaluator, not by the producer's ability to respond to feedback.
- Overall first-pass quality improved. Mean score_r1 rose from 7.07 to 7.41.
T_A (top 5 enumerated) across all four experiments. The evaluator upgrade (exp-03) activated the revision loop but exposed the producer ceiling. The producer upgrade (exp-04) broke the ceiling. T_A first-pass score went 9.0 → 9.0 → 7.0 → 8.0 across the experiment series — the drops and recovery track which component was the actual bottleneck at each stage.
T_B: The Remaining Bottleneck
The most counterintuitive finding from experiment 04 was T_B. The 32B producer generates shorter T_B output than the 7B model — 2,198 bytes vs 3,288 bytes, a 33% reduction. Depth and specificity on first pass are essentially unchanged: T_B depth_r1 = 6.1 (vs experiment 03's 6.0). The parameter upgrade that definitively solved T_A had no effect on T_B's depth dimension.
The dimension data makes the cause clear. T_A depth improvement was +1.2 points; T_B depth improvement was +0.1 points. A 4.5× parameter increase brought no depth improvement on open-ended tasks. This is a synthesis instruction problem, not a model capability problem. The SYNTH_INSTRUCTION doesn't push hard enough on depth for best-practices task types, and the 32B complies faithfully with a weak instruction — it's more capable, so it follows the (insufficiently demanding) instruction more precisely.
The bottleneck has shifted three times. The dual-search change solved the research quality bottleneck. The evaluator upgrade unmasked the producer ceiling. The producer upgrade solved the producer ceiling — and unmasked the synthesis instruction as the remaining bottleneck. Each fix reveals the next constraint.
Each intervention solved one constraint and made the next constraint visible. Harness improvement is a sequential bottleneck-elimination process.
The Autoresearch Loop
With T_B depth/specificity identified as the synthesis instruction bottleneck, the pipeline became self-targeting. autoresearch.py is an autonomous optimizer that mutates the SYNTH_INSTRUCTION string and runs controlled experiments to measure whether the mutation improved the composite score:
composite = 0.7 × mean_wiggum_r1 + 0.3 × criteria_rate × 10
Each iteration: propose a modification to the synthesis instruction (using Qwen3-Coder:30b as proposer, optionally kimi-k2.5:cloud for structural diversity), run the eval suite, keep the commit if new_score − baseline > 0.1, otherwise revert via git reset HEAD~1 --soft. The only mutable scope is the instruction string between sentinel markers — the harness itself is off-limits.
Session 3 results illustrate what the optimizer discovers and discards:
| Experiment | Change attempted | Score | Decision |
|---|---|---|---|
| 6 | Failure modes + detection/mitigation per strategy | 8.350 | DISCARD −0.233 |
| 7 | "When NOT to use" + input boundaries framing | 8.915 | KEEP +0.332 |
| 8 | Measurable success criteria / validation tests | 8.915 | DISCARD +0.000 |
| 9 | Confidence ratings (High/Med/Low) per library | 7.965 | DISCARD −0.950 |
The largest single negative signal: experiment 9's confidence ratings caused a −0.950 regression. The evaluator reads uncertainty markers as lack of authority. Hedging actively hurts depth scores. This finding is impossible to derive from first principles — it requires running the pipeline and measuring it.
The largest single positive: experiment 7's "when NOT to use" framing (+0.332) was discovered by Kimi (a cloud-based proposer), which explored structurally different angles from the local proposer. Local sessions 1–2 converged on "add more code / implementation detail." Session 3's cloud proposer immediately tried constraint framing and boundary conditions — approaches the local proposer never reached.
The autoresearch loop encodes the full experimental methodology in software. It runs controlled experiments (each proposal is one experiment), measures a response variable (composite score), applies a decision rule (keep if delta > threshold), and maintains a git-auditable history of what worked. The runs.jsonl audit log described in Post 10 provides the data; the autoresearch loop is the systematic optimizer that acts on it.
Statistical Addendum: What Three Replications Can and Cannot Tell You
Three replications per task type is the practical floor for a CRD. It is insufficient for most inferential statistics. The experiments do not report Mann-Whitney U tests or confidence intervals on the per-task means because three samples produce unreliable estimates of those quantities.
What three replications can reliably detect: large effects. The T_A failure in experiment 03 (0/3 PASS with regression on 2/3 attempts) does not require a test to interpret. The 4/4 T_A success in experiment 04 following a single producer change does not require a test. The signal-to-noise ratio at these effect sizes is high enough that informal reasoning suffices.
The coefficient of variation (CV) analysis — std / mean per task per experiment — is the primary summary statistic used. It is dimensionless, interpretable, and does not require distributional assumptions. The 20% stability threshold used throughout the experiments is a practical engineering criterion, not a statistical one: at CV < 20%, the pipeline produces outputs consistent enough to serve as training data for the Data Flywheel (Post 10) without requiring per-sample human review.
Formal inference would require at least 6–8 replications per condition to detect medium-sized effects with reasonable power. That cost is not warranted for iterative engineering decisions — the right tool is rapid pilot experiments followed by a single confirmatory run, which is precisely what the autoresearch loop implements.
Three-Model Architecture
The four experiments converged on a three-model architecture where each model occupies its correct niche:
| Role | Model | Size | Rationale |
|---|---|---|---|
| Producer | qwen2.5:32b Q4_K_M | ~20 GB | Breaks enumerated task ceiling; reliable revision without regression |
| Evaluator | Qwen3-Coder:30b | ~18 GB | Different architecture family; scores critically; names specific issues |
| Planner / Compressor | glm4:9b | ~5 GB | Fast; handles lightweight JSON generation and memory compression |
The key constraint is that the evaluator must be more capable than the producer in the dimensions being evaluated — not necessarily larger in total parameters, but specifically more discriminating on depth and specificity. glm4:9b failed as evaluator not because it was small but because it was not calibrated to penalize shallow outputs. A smaller model with better calibration would work. The experiments happened to find a calibrated model that was also larger.
The harness patterns described throughout this series — dual-backend memory (Post 5), the Wiggum Loop (Post 6), dimensional rubrics (Post 7), the Data Flywheel (Post 10) — were all validated or refined against the runs.jsonl audit log. The experimental methodology is not separate from the harness; it is the mechanism by which the harness was built. Every pattern in Posts 1–11 has a row in a JSONL file that justified it.