May 28, 2026 • 14 min read • Agentic Harness Engineering Series

Building the Detectors: What Actually Shipped After 107 Experiments

The previous post proposed four convergence detectors to prevent the failure modes that caused 65 experiments to advance nowhere. Here is what actually got built — and how each detector diverged from its original design.

From Hill-Climbing to Pareto Experimental Methodology Convergence Failures Detectors, Shipped

The autoresearch loop in harness/autoresearch.py now has all four convergence mechanisms in place. None of them looks quite like the design. The TF-IDF cosine similarity from Detector 1 was replaced with a compiled regex ban list. The Shannon entropy monitor from Detector 2 collapsed into the same ban list, served differently. The hard global-exit from Detector 4 was traded for a cloud model consultation. Detector 3 shipped intact but as a flag rather than an automatic trigger.

Along the way, two additions that were not in the original design turned out to matter more than any of the four detectors: a pre-proposal research phase that grounds the proposer in current literature before it writes anything, and a routing check that prevents proposals from targeting instructions the current task set doesn’t actually exercise.

This post walks through each detector, what was proposed, what shipped, and why the gap exists.

The Design vs. What Shipped

Detector	Proposed mechanism	Shipped as	Status
1 — Semantic Attractor Guard	TF-IDF cosine similarity ≥ 0.65 vs recent discards	Compiled regex ban list (`_PROSE_BAN_PATTERNS`) + `_validate_proposal()`	Changed
2 — Family Entropy Monitor	Shannon entropy over 10-experiment sliding window; auto-inject ban when entropy < threshold	Accumulated hard-ban section in `PROPOSE_PROMPT`; `_active_instruction_keys()` routing check	Changed
3 — Baseline Re-estimation	Automatic at K consecutive discards with `eval-n=3`	`--eval-n N` multi-sample averaging + `--reset-baseline` flag (user-triggered)	Shipped
4 — Global Convergence Exit	Hard stop when experiments > N_max with no advance in last M	Soft guard on consecutive propose failures (≥ 10); Kimi unblocking at ≥ 6 discards	Replaced

Detector 1: Regex over Cosine Similarity

The original design called for computing TF-IDF cosine similarity between each new proposal and the last N discards. A match above 0.65 would reject the proposal before spending an eval run. The intent was to catch the HOW-FIRST attractor mechanically, before 14 compute runs got wasted on variations of the same blocked approach.

What shipped instead is a two-stage validation gate in _validate_proposal(). The first stage is a routing check (discussed below). The second is a compiled regex scan against _PROSE_BAN_PATTERNS — three patterns that match the structural signatures of the families that had most reliably produced attractor lock:

_PROSE_BAN_PATTERNS = [
    re.compile(r'\b(narrative|prose paragraph|flowing prose)\b', re.IGNORECASE),
    re.compile(r'\b(sequential|step[- ]by[- ]step|ordered steps)\b', re.IGNORECASE),
    re.compile(r'\b(logical[- ]chain|chain of reasoning|linear progression)\b', re.IGNORECASE),
]

Any proposed instruction matching one of these patterns is rejected before the eval suite runs. The validation gate also enforces the full hard-ban list: the proposer is shown the ban list in its prompt, but in case it tries anyway, _validate_proposal() re-checks independently and refuses.

Proposed

Similarity-based: catches novel phrasings of banned families if they are semantically close to previous discards, even if the surface text differs. Generalizes without enumeration. Requires scikit-learn at runtime.

Shipped

Pattern-based: catches exact surface-level mentions of banned terms. Fast, zero-dependency, and deterministic. Does not generalize to paraphrases of banned families that use different terminology.

The tradeoff is explicit. The regex approach misses a proposer that writes “use a narrative arc with flowing connective tissue” if “narrative arc” is not in the compiled pattern. The cosine approach would have caught it if any prior discard used similar language. The decision to ship the regex version was pragmatic: the proposer (Qwen3-Coder:30b) is strong enough that, once the ban list is in the prompt, it rarely tries the exact banned phrasing. The rejection gate exists as a backstop, not as the primary defense. The primary defense is the prompt-level ban list accumulated over 107 experiments.

Implementation note: A secondary guard not in the original design is also present in _validate_proposal(): the routing check. If the proposer targets SYNTH_INSTRUCTION_PROSE but the current task set does not exercise that instruction (determined by _active_instruction_keys()), the proposal is rejected regardless of content. This prevents the common failure where the proposer correctly identifies an improvement but proposes it for an instruction path that the running tasks never invoke.

Detector 2: Accumulated Ban over Entropy Computation

The entropy monitor design computed Shannon entropy over a sliding window of family labels, then auto-injected a ban when one family dominated. The idea was to let the ban list grow organically from the loop’s own observed behavior rather than requiring manual curation.

What shipped is a prompt-level hard-ban section in PROPOSE_PROMPT that serves the same purpose through a different mechanism. Instead of computing entropy, the ban list is maintained as a manually-verified accumulation of families that produced consecutive discards. It is injected into every proposal request. At 107 experiments in, it covers at least seven major families:

NARRATIVE / flowing prose (from experiment 25 attractor)
SEQUENTIAL / step-by-step enumeration (from HOW-FIRST attractor, exps 55–68)
LOGICAL-CHAIN / linear progression (adjacent to sequential)
COMPARISON format (A-vs-B structure)
HOW-FIRST family and all five named variants
NAMED-SYSTEMS / FAILURE-MODES / LIFECYCLE-COVERAGE (from grounded-targeting phase, exps 70–93)
PROBLEM→PRACTICE→CAVEAT three-part structure (from recent experiment block)

The proposer sees this list in full before generating any proposal. The _validate_proposal() regex check then provides a mechanical backstop for the most common families.

Proposed

Automated: entropy falls below threshold → ban injected without human review. Self-updating with each experiment. Requires labeling each proposal with a coarse family tag.

Shipped

Semi-automated: the hard-ban section grows manually when a pattern of consecutive discards is identified. Human decision required to add a new family. The ban text is visible in the prompt, not hidden in a computation.

The practical gap between these approaches has been smaller than expected. The proposer is strong enough that each new experiment tends to explore a genuinely novel angle rather than re-entering a known attractor. The entropy collapse that the original design was guarding against has not recurred since the ban list reached its current size. Whether this is because the ban list is comprehensive, or because the remaining search space has not yet exhausted its diversity, will become clear as the experiment count grows.

The _active_instruction_keys() routing check — new in the shipped version — adds a complementary guard that the entropy monitor design did not include. It determines which of the three instruction slots (SYNTH_INSTRUCTION, SYNTH_INSTRUCTION_COUNT, SYNTH_INSTRUCTION_PROSE) are actually exercised by the current task set, and constrains the proposer to target only those slots. This prevents entropy from collapsing because the proposer cycles through all three slots without restriction, producing surface novelty that doesn’t translate to score movement.

Detector 3: Baseline Re-estimation, Shipped as Intended

This detector was implemented closest to its original design. The core capability is run_eval(task_ids, n), which runs the eval suite n times per task and returns the mean composite score. Calling it with n=3 gives a low-variance estimate of the instruction’s true expected value rather than a point sample from a noisy distribution.

def run_eval(task_ids: list[str], n: int = 1) -> float:
    """Run eval suite n times and return mean composite score."""
    scores = []
    for i in range(n):
        score = _single_eval_run(task_ids)
        scores.append(score)
        if n > 1:
            print(f"  [eval {i+1}/{n}] composite={score:.3f}")
    return round(sum(scores) / len(scores), 4)

Two CLI flags expose this capability:

--eval-n N — sets the number of samples per eval step throughout the run. At --eval-n 3, each keep/discard decision is based on the mean of three runs rather than a single outcome.
--reset-baseline — forces a fresh baseline measurement using the current --eval-n before entering the proposal loop. This is the practical resolution to the contamination failure: re-run the baseline instruction with multi-sample averaging before starting, regardless of what the stored single-eval baseline says.

The 8.740 baseline established from experiment 25 re-estimated to 8.530 with --eval-n 3 --reset-baseline. A 0.21-point correction resolved the false ceiling that 65 experiments had been racing against. The loop now has a realistic target.

The one gap from the original design: re-estimation is not automatic at K consecutive discards. It requires a deliberate --reset-baseline invocation. The automatic trigger was deferred because instrumenting it required a reliable way to distinguish “genuinely above baseline” from “baseline was contaminated and everything looks below it.” The user-triggered version is the pragmatic first implementation; automatic triggering at 15 consecutive discards is a straightforward next step once the multi-sample baseline tooling has been exercised in production.

Detector 4: The Exit That Became a Consultation

The original Detector 4 proposed a hard stop: if the loop had run more than N_max experiments with no advance in the last M, declare convergence and exit with a structured report. The reasoning was that an indefinitely-running optimizer with no advance signal is wasting compute and should surface its own failure rather than continuing silently.

What shipped is philosophically different. The loop does have a soft abort guard:

if consecutive_propose_failures >= 10:
    print("[autoresearch] abort: too many consecutive propose failures")
    break

But this fires on propose failures — cases where the proposer model itself fails to generate a parseable output — not on consecutive discards. It is an error-handling guard, not a convergence exit. The loop will still run indefinitely past a converged score distribution as long as the proposer keeps generating syntactically valid (but repeatedly discarded) proposals.

The practical replacement for the global exit is the Kimi unblocking mechanism:

KIMI_STUCK_THRESHOLD = 6  # consecutive discards before consulting cloud model

def get_kimi_unblock_suggestion(
    current_instruction: str,
    recent_discards: list[str],
    baseline: float,
    ban_list: list[str],
) -> str:
    """Consult kimi-k2.5:cloud for a fresh direction when local model is stuck."""
    ...

When consecutive_discards ≥ KIMI_STUCK_THRESHOLD, the loop pauses its local proposal loop and calls get_kimi_unblock_suggestion(). This passes the current instruction, the last six discarded proposals, the current baseline, and the full ban list to kimi-k2.5:cloud — a cloud model with a different prior on instruction design than the local Qwen3-Coder:30b proposer. The suggestion is returned as a candidate instruction that the local proposer can then refine or build on.

Proposed

Convergence exit: stop the loop, emit a structured report, and require a human or outer loop to decide what to do next. The loop terminates cleanly with full diagnosis.

Shipped

Unblocking consultation: pause the local loop at stuck threshold, consult an external model with a different prior, then continue. The loop does not stop — it gets injected with a fresh direction.

This is a substantively different design philosophy. The original exit treated convergence as a terminal state requiring a human decision. The shipped unblocking treats it as a local minimum navigable by querying an oracle with a broader search space. The oracle (Kimi) is a cloud model with exposure to different instruction patterns than the locally-trained proposer, which means its suggestions are genuinely novel relative to the local search history rather than variations on the same families the proposer has exhausted.

The tradeoff is cost and privacy: each Kimi consultation sends the current instruction and recent discard history to an external endpoint. For six-per-stuck-event at the current experiment cadence, this is negligible. But it means the loop no longer exits cleanly when stuck — it continues running, potentially cycling through Kimi suggestions that also fail to advance the baseline. A Kimi suggestion that generates its own consecutive discards does not re-trigger the unblocking threshold until KIMI_STUCK_THRESHOLD new discards accumulate, which can extend a non-advancing run significantly.

Finding: Replacing a convergence exit with an unblocking oracle converts a “stop and report” failure mode into a “run longer, possibly escape” one. Whether this is better depends on whether the oracle’s suggestions provide genuine exploration or merely defer the same convergence. After 107 experiments, Kimi unblocking has been invoked twice; in one case it provided an advance, in the other it did not. The sample size is too small to generalize.

Two New Additions Not in the Original Design

The original post was a diagnosis. In writing the fixes, two new mechanisms were added that had no equivalent in the four proposed detectors.

New Addition 1

Pre-Proposal Research Context (`gather_proposal_context()`)

Before generating each proposal, the loop now runs a short web research phase: a DuckDuckGo search for recent work on instruction design and synthesis quality, followed by MarkItDown extraction of the most relevant results. The output is passed to the proposer as a research context block, alongside the experiment history and ban list.

The motivation was a specific failure pattern: the proposer was generating proposals based entirely on the history of what had and hadn’t worked in the current experiment set — an increasingly narrow prior as the experiment count grew. By injecting current literature, the proposer has access to instruction strategies that have not yet appeared in the experiment history and are therefore not banned.

def gather_proposal_context(task: str) -> str:
    """Run DuckDuckGo search + MarkItDown extraction before each proposal."""
    query = f"LLM synthesis instruction design best practices {task[:60]}"
    results = _ddg_search(query, max_results=3)
    passages = [_markitdown_extract(r["url"]) for r in results if r.get("url")]
    return "\n\n---\n\n".join(p for p in passages if p)[:4000]

This runs on every proposal iteration when --mode auto or --mode explore is active, and is suppressed in --mode exploit. The explore/exploit distinction matters: in early exploration, fresh context is valuable; in late exploitation when the proposer is refining a near-optimal instruction, the overhead of a web search adds latency without proportional benefit.

New Addition 2

Explore / Exploit Mode Control (`--mode`)

Three modes control the loop’s behavior:

--mode explore: always re-gather research context before each proposal; higher prior on novel directions over refinement
--mode exploit: skip research phase; proposer works from experiment history alone; prioritizes refinement of a known-good approach
--mode auto (default): switches to explore behavior after PLATEAU_DISCARDS=3 consecutive discards, then resets to exploit on the next advance

PLATEAU_DISCARDS = 3
PLATEAU_DELTA    = 0.05

def _should_explore(consecutive_discards: int, mode: str) -> bool:
    if mode == "explore":
        return True
    if mode == "exploit":
        return False
    # auto: explore after plateau
    return consecutive_discards >= PLATEAU_DISCARDS

In practice, auto mode means the loop runs lean (no web research overhead) when it is making progress and switches to a broader search when it stalls. The PLATEAU_DELTA=0.05 constant defines the minimum composite improvement to count as “progress” when resetting the discard counter.

Observability: Rich JSONL Logging with Thinking CoT

A structural improvement not captured in the four detector descriptions is the addition of rich JSONL logging. The original loop wrote to autoresearch.tsv — a flat file with experiment number, composite score, and a description field. The shipped version writes in parallel to data/autoresearch.jsonl:

# Each record in data/autoresearch.jsonl
{
  "exp":         43,
  "status":      "keep",
  "composite":   8.640,
  "delta":      +0.110,
  "description": "Changed SYNTH_INSTRUCTION to require...",
  "instruction": "...",
  "thinking":    "<thinking>...</thinking>",
  "dims":        {"depth": 8, "grounded": 8, "specificity": 9, ...},
  "task_ids":    ["T_B", "T_W"],
  "eval_n":      1,
  "mode":        "auto",
  "kimi_used":   false,
  "ts":          "2026-05-15T..."
}

The thinking field captures the proposer’s full extended-thinking chain-of-thought when the model supports it. This turns out to be the most useful debugging artifact: the thinking trace shows which constraints the proposer weighted most heavily, which directions it considered and rejected, and where its reasoning diverged from experiment evidence. The TSV file is still written for quick inspection; the JSONL file is the artifact that supports retrospective analysis.

The eval subprocess also sets WIGGUM_PANEL=1 and RESEARCH_CACHE=1 in its environment. Panel evaluation mode enables the multi-persona scoring path in Wiggum for relevant tasks; research cache skips the web research phase for tasks that have already gathered context, which cuts per-experiment latency by 30–60% for cached tasks.

Where Things Stand

After 107 experiments, the accumulated ban list covers enough families that the proposer’s effective search space is substantially narrower than it was at experiment 1. This is both a feature and a risk. The ban list prevents re-entering known attractors; it also prevents re-entering approaches that failed to beat a now-corrected baseline but might have performed well against a truthful one.

The original post’s third finding still holds: “a ban list that encodes ‘this approach cannot beat baseline’ is not the same as one that encodes ‘this approach does not work.’” With the baseline now re-estimated at 8.530, some approaches that were banned after failing to reach 8.740 deserve a second look at the lower bar. The narrative-plus-word-count approach from experiment 25 — which was subsequently banned for failing to advance past the score it had set — is the clearest candidate.

The most consequential decision in the shipped implementation was the choice to replace the global convergence exit with the Kimi unblocking mechanism. The original design would have surfaced the loop’s convergence as a clean signal requiring a human decision. The shipped design injects an external suggestion and continues running. At the current experiment scale, the two choices are roughly equivalent in practice. At higher experiment counts, the shipped approach risks an asymmetric burn: the loop consults Kimi at threshold, continues running, Kimi suggestions also fail, and the loop runs another stretch with no mechanism to surface the fact that even external intervention didn’t help. Adding a secondary exit condition — “if Kimi unblocking fails to produce an advance within M subsequent experiments, exit with a structured report” — would close that gap.

The one open question from the previous post: conformal prediction intervals on the baseline — reporting it as 8.46–8.74 rather than 8.740 — remains unimplemented. The multi-sample --eval-n flag provides the raw material; converting those samples into an interval-valued baseline is the next instrumentation step. With interval baselines, the advance/discard decision becomes “does the new instruction’s expected composite exceed the baseline interval’s upper bound” rather than chasing a scalar that may be a sampling artifact.

The Design vs. What Shipped

Detector 1: Regex over Cosine Similarity

Detector 2: Accumulated Ban over Entropy Computation

Detector 3: Baseline Re-estimation, Shipped as Intended

Detector 4: The Exit That Became a Consultation

Two New Additions Not in the Original Design

Pre-Proposal Research Context (gather_proposal_context())

Explore / Exploit Mode Control (--mode)

Observability: Rich JSONL Logging with Thinking CoT

Where Things Stand

Pre-Proposal Research Context (`gather_proposal_context()`)

Explore / Exploit Mode Control (`--mode`)