May 25, 2026 • 16 min read • Agentic Harness Engineering Series

When the Loop Defeats Itself: Convergence Failures in Autonomous Prompt Optimization

Ninety experiments. Zero advances past baseline. Three nested failure modes — a corrupted diagnostic, a single-eval baseline contaminated by a lucky run, and a ban list that erased its own best approach — and four convergence detectors that would have caught each one.

The autoresearch loop in harness/autoresearch.py is a hill-climber: a proposer LLM suggests a change to the synthesis instruction, the eval suite measures the output quality, and the change is kept only if it beats the current baseline by at least 0.1 composite points. The design is straightforward and, for the first 25 experiments, it worked. The baseline climbed from an unscored initial state to 8.740 — the best composite score the system had ever produced.

Then it ran for 65 more experiments without advancing once.

This post is the forensic account of why. The short answer is three nested failures, each compounding the others: the proposer was given a wrong diagnosis of which dimension to target; the baseline score it was trying to beat was established from a single lucky evaluation run rather than the expected value of the instruction; and the growing hard-ban list eventually prohibited the only instruction family that worked. The longer answer requires looking at what the runs actually showed.

The Loop in Brief

Each iteration of the autoresearch loop consists of four steps. First, the proposer reads the current synthesis instruction, the experiment history, the hard-ban list, and the most recent evaluator feedback, then outputs a proposed replacement. Second, the proposed instruction is written to agent.py and the eval suite runs a full research task from scratch. Third, a composite score is computed:

composite = 0.7 × mean_wiggum_r1 + 0.3 × criteria_rate × 10

mean_wiggum_r1 is the WIGGUM evaluator score for the first synthesis pass, a weighted sum across six dimensions: relevance (0.20), completeness (0.20), depth (0.25), grounded (0.15), specificity (0.10), and structure (0.10). criteria_rate is the fraction of task-specific criteria the output satisfies — for T_B these are structural checks: minimum length, minimum sections, no placeholder text, no file path references, and the presence of implementation notes. Fourth, if the composite exceeds baseline + 0.1, the instruction is kept and the baseline is updated. Otherwise, the instruction is reverted and the experiment is logged as a discard.

The loop ran task T_B: "Search for best practices for cost envelope management in production AI agents." Every experiment targeted SYNTH_INSTRUCTION_PROSE, the instruction used for non-technical best-practices tasks.

Failure Mode 1: The Corrupted Diagnostic

The proposer doesn't navigate blind. PROPOSE_PROMPT includes a diagnostic section that tells it which dimension is the bottleneck and why, so it can design targeted changes rather than exploring at random. At experiment 55, the diagnostic said:

"depth=7 is the ONLY bottleneck. Every other dimension is at 8 or 9. Moving depth from 7→8 adds +0.25 to the composite. Why depth stays at 7: outputs describe WHAT each practice is and WHY it works, but do NOT explain HOW to apply it."

The proposer read this and did exactly what it was told. For the next 14 experiments (exps 55–68), every proposal was a variation of the same HOW-FIRST instruction: lead with implementation steps, require ordered sequences a practitioner can follow, frame each practice as a mini-tutorial. All 14 discarded.

The diagnostic was factually wrong. Querying runs.jsonl across the 20 most recent T_B evaluations told a different story:

Dimension Observed values Interpretation
relevance 9 (constant) Not a lever
completeness 8 (constant) Not a lever
depth 7 (permanent) Not a lever — immovable
grounded 6 or 8 (bimodal) Primary variance driver
specificity 8 or 9 (correlated with grounded) Secondary variance driver
structure 9 (constant) Not a lever

Depth was not fluctuating. It had been 7 in every single run. The real variance was in grounded, which alternated between 6 and 8 with no apparent pattern. The two resulting dimension profiles map precisely to the two score clusters the system kept producing:

Profile grounded specificity wiggum r1 composite vs baseline
Low 6 8 7.8 8.460 −0.280
High 8 9 8.2 8.740 baseline

The arithmetic is exact. For the low profile: 9×0.20 + 8×0.20 + 7×0.25 + 6×0.15 + 8×0.10 + 9×0.10 = 7.75, which rounds to 7.8. For the high profile: 9×0.20 + 8×0.20 + 7×0.25 + 8×0.15 + 9×0.10 + 9×0.10 = 8.15, which rounds to 8.2. The bimodal distribution is entirely explained by grounded and specificity toggling together between their two states.

The diagnostic had been written early in the experiment run, before sufficient data had accumulated to distinguish between a permanently flat dimension and one that happened to be low in the available samples. It was never updated. The proposer spent 14 experiments in a confident, self-reinforcing attractor that could not possibly work, because the dimension it was trying to move was already frozen.

Finding 1: An optimizer's steering document is a first-class experimental artifact. A diagnostic encoding stale or inferred beliefs will produce attractor lock that the optimizer cannot self-escape, because the proposer has no access to the underlying data — only to what the diagnostic tells it.

After the diagnostic was corrected to target grounded and completeness instead of depth, the proposer escaped the HOW-FIRST attractor. It entered a new one. Five new target angles — NAMED-SYSTEMS, FAILURE-MODES, LIFECYCLE-COVERAGE, EVIDENCE-ANCHOR, SCOPE-BOUNDARIES — produced 24 more consecutive discards (exps 70–93), all at the same 8.460 composite. The grounded dimension remained stubbornly at 6 regardless of what the instruction demanded.

This pointed to a second problem underneath the first.

Failure Mode 2: Baseline Contamination

Every experiment was trying to beat 8.740. That score was established at experiment 25 from a single evaluation run. With --eval-n 1 (the default), the baseline is whatever the eval suite returns on one execution of the task.

The dimension data shows that any given T_B run lands on one of exactly two profiles: grounded=6 (composite ≈ 8.46) or grounded=8 (composite ≈ 8.74). In the 20 most recent T_B runs logged in runs.jsonl, grounded=8 appeared approximately 25–30% of the time. Experiment 25 happened to produce a grounded=8 run. The baseline was set from that run and never re-examined.

The implication is structural: every subsequent experiment was trying to beat a 75th-percentile sample of its own score distribution. The true expected composite for the baseline instruction — the score you would get on average, not in the best case — was approximately 8.53, not 8.740. An optimizer designed to advance by 0.1 points was asked to consistently clear a bar set at the one-in-four outcome.

The math: Expected composite ≈ 0.70×E[wiggum_r1] + 3.0, where E[wiggum_r1] ≈ 0.25×8.2 + 0.75×7.8 = 7.9 → composite ≈ 8.53. The baseline was 8.74. To beat it, any instruction needed to produce grounded=8 more reliably than baseline did, not just sometimes.

This is the deeper reason why every grounded-targeting instruction failed: not because they were bad approaches, but because the bar they needed to clear was set from an outlier. A new instruction that reliably produced grounded=8 on 50% of runs would have an expected composite of ≈ 8.59 — still below 8.74, still discarded every time the eval landed on grounded=6.

There is a separate effect compounding this. When a new instruction adds a specific requirement — "cite a named tool," "include a failure mode," "state a scope boundary" — the producer model follows it, which changes the character of the output. The evaluator scores organic, naturally-grounded content differently from formulaic additions. Fourteen out of 24 experiments that required naming a tool or benchmark scored 8.240 rather than 8.460: the additional requirement caused criteria_rate to drop from 1.0 to 0.83 (5 of 6 criteria satisfied instead of 6), because the structural requirement apparently replaced content the has_impl_notes() criterion depended on. Explicit demands for grounded content do not produce the same evaluator reward as grounded content that emerges naturally.

Finding 2: A single-eval baseline is a point sample from a noisy distribution. If that sample is at the high end of the distribution, the optimizer is racing against an unlucky draw for every subsequent experiment. Baseline re-estimation with eval-n ≥ 3 should be a periodic obligation, not a one-time initialization.

Failure Mode 3: The Self-Defeating Ban List

The hard-ban list in PROPOSE_PROMPT grew experiment by experiment as approaches failed. By experiment 68, it included over a dozen families, among them:

The ban was justified: every attempt to use these approaches after experiment 25 had failed to advance the baseline. The ban encoded that failure correctly.

The problem is what the ban concealed. Experiment 25 — the one that established the 8.740 baseline — had used exactly these techniques. Its description: "Changed SYNTH_INSTRUCTION_PROSE to require a narrative format with word count constraint." The actual instruction it wrote:

"output ONLY the markdown starting with #   Write a concise narrative (200–300 words) explaining how prompt specificity impacts LLM reasoning performance. Use clear, accessible language and describe one practical example where increasing prompt detail led to better output quality."

Notice the instruction is topic-specific to prompt specificity — not cost management, which is the T_B task. The model almost certainly ignored the topic direction and wrote about cost management as instructed by the task prompt. What it kept was the structural signature: narrative format, tight word count, no enumerated list. That instruction got a grounded=8 run. The baseline was set.

Every subsequent experiment that tried narrative format or word count constraints was trying to advance beyond 8.740, not merely reach it. They failed because beating 8.740 requires getting a run better than the best profile (grounded=8, composite=8.740), which means completeness or grounded would need to reach 9 — mathematically possible but not achieved. The ban was then added: "NARRATIVE always discards, WORD COUNT always discards."

Both statements were technically accurate. Both were dangerously misleading. The ban captured the failure to advance and translated it into the instruction that these approaches "don't work" — erasing the information that they were responsible for the best score the system had ever produced.

Finding 3: A ban list that encodes "this approach cannot beat baseline" and a ban list that encodes "this approach does not work" are not the same thing. The former is a local navigation constraint; the latter is a global assessment of approach quality. Conflating them erases information about what baseline the approach established, which is exactly the information needed to diagnose baseline contamination.

The Compound Effect

The three failure modes did not operate independently. They amplified each other through the loop's feedback structure.

The corrupted diagnostic sent the proposer into the HOW-FIRST attractor. After 14 failed experiments, the correct diagnostic was installed and the ban list was updated with the entire HOW-FIRST family. The proposer moved to grounded-targeting approaches. These also failed, for reasons the diagnostic could not explain: every grounded-targeting addition was hitting a baseline established by a lucky run with a completely different instruction structure. The proposer, seeing that its new approaches failed for the same reasons the old ones did, exhausted five new target angles in another 24 experiments.

Meanwhile, the approaches that could theoretically have produced a more reliable expected score — narrative format, tight word count, the properties of the exp-25 instruction — were banned. The proposer could not propose them even if the research context pointed toward them, because the ban list said they always discard.

The loop had constructed a trap: the correct target (more reliable grounded=8) was prohibited; the prohibited approach (narrative+word count) was the only instruction family that had ever produced the baseline; the baseline itself was an outlier that the true expected value of any instruction was unlikely to reach. The loop could run indefinitely without advancing.

Four Convergence Detectors

Each failure mode has a natural detector. The detectors operate at different points in the loop and target different signals.

Detector 1

Semantic Attractor Guard (pre-eval)

Before running an evaluation, compare the proposed description against the last N discards using bigram overlap or TF-IDF cosine similarity. If the similarity to any recent discard exceeds a threshold, reject the proposal without spending an eval run and re-prompt with an explicit note identifying the closest match.

The threshold wants to be loose enough to permit genuine novelty while tight enough to catch "HOW-FIRST, variant 7." A cosine similarity ≥ 0.65 or bigram overlap ≥ 50% is a reasonable starting point. This detector catches attractor lock before it consumes compute.

def _proposal_is_attractor(description: str, recent_discards: list[str], threshold: float = 0.65) -> tuple[bool, str | None]:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np

    if not recent_discards:
        return False, None
    corpus = recent_discards + [description]
    vecs = TfidfVectorizer(ngram_range=(1, 2)).fit_transform(corpus)
    sims = cosine_similarity(vecs[-1], vecs[:-1])[0]
    best = int(np.argmax(sims))
    if sims[best] >= threshold:
        return True, recent_discards[best]
    return False, None
Detector 2

Family Entropy Monitor

Tag each proposal with a coarse family label derived from its description (e.g., "HOW-FIRST," "NAMED-SYSTEMS," "FAILURE-MODES"). Track a sliding window of family labels across the last 10 experiments. Compute Shannon entropy over the label distribution. If entropy falls below a threshold — meaning one family dominates — auto-inject a hard ban for the dominant family into PROPOSE_PROMPT rather than waiting for a human to notice the pattern.

This is the semantic upgrade to the existing PLATEAU_DISCARDS logic. The current plateau detector fires on score stagnation; this one fires on semantic stagnation. A loop can cycle through five variations of the same approach, all producing slightly different scores, without triggering the score-based plateau — but the entropy would collapse.

import math
from collections import Counter

def _family_entropy(family_labels: list[str]) -> float:
    counts = Counter(family_labels)
    total = sum(counts.values())
    return -sum((c / total) * math.log2(c / total) for c in counts.values())

def _dominant_family(family_labels: list[str], entropy_threshold: float = 1.0) -> str | None:
    if _family_entropy(family_labels[-10:]) < entropy_threshold:
        return Counter(family_labels[-10:]).most_common(1)[0][0]
    return None
Detector 3

Baseline Validity Re-estimation

After K consecutive discards (a reasonable default is 15), automatically re-run the baseline instruction with eval-n=3 and compare the new estimate to the stored single-eval baseline. If the re-estimated expected value falls more than delta_threshold below the stored baseline, update it. A stored baseline of 8.74 that re-estimates to 8.53 is not a ceiling that experiments are failing to reach — it is a false ceiling constructed from a sampling artifact.

This detector would have resolved the contamination problem at experiment 40 instead of experiment 90. The cost is one extra eval run every K experiments — negligible compared to the compute wasted chasing an outlier.

def _maybe_reestimate_baseline(
    consecutive_discards: int,
    reestimate_interval: int,
    task_ids: list[str],
    current_baseline: float,
    delta_threshold: float,
    eval_n: int = 3,
) -> float:
    if consecutive_discards % reestimate_interval != 0:
        return current_baseline
    print(f"[convergence] re-estimating baseline with eval-n={eval_n}...")
    new_estimate = run_eval(task_ids, n=eval_n)
    if new_estimate < current_baseline - delta_threshold:
        print(f"[convergence] baseline updated: {current_baseline:.3f} → {new_estimate:.3f} (single-eval outlier)")
        return new_estimate
    return current_baseline
Detector 4

Global Convergence Exit

A hard stop: if the total number of experiments exceeds N_max and there have been zero advances since experiment M_last_advance, declare convergence, log the diagnosis, and exit. The current loop has no exit condition other than an interrupt signal. It will run indefinitely in a converged state, burning compute without any mechanism to surface the fact that it has stopped making progress.

The exit condition should be accompanied by a structured convergence report: the last advance, the current baseline, the distribution of recent scores, and the dominant family from the entropy monitor. This report is the input to the next iteration of the loop design — the information needed to decide whether to change the producer model, the evaluator, the task, or the eval-n.

def _is_globally_converged(
    history: list[dict],
    max_experiments: int = 100,
    min_advance_lookback: int = 40,
) -> bool:
    if len(history) < max_experiments:
        return False
    recent = history[-min_advance_lookback:]
    return all(h["status"] == "discard" for h in recent)

What the Loop Actually Needs

The four detectors address the symptoms. The underlying requirement is that an autonomous optimizer treat its own steering document — the diagnostic, the ban list, the target angles — as experimental artifacts subject to the same scrutiny as any other hypothesis.

The autoresearch loop was designed to update the synthesis instruction based on evidence. It was not designed to update its beliefs about which dimensions to target, which approaches are viable, or whether the baseline itself was measured correctly. Those beliefs were encoded in a prompt that the proposer read but could not question. The loop updated the artifact it was told to optimize while the beliefs that governed its optimization remained fixed and wrong.

A more robust design separates the two update loops: an outer loop that periodically audits the diagnostic against the accumulated runs data and updates the ban list with context preserved (banning an approach because it cannot advance a given baseline is different from banning it because it cannot reach that baseline), and an inner loop that proposes instruction changes within the constraints the outer loop defines.

The GEPA framework, discussed in the companion post, formalizes this separation through Pareto frontier selection and minibatch screening. The convergence detectors described here are the practical entry point: simpler to instrument, visible in the existing loop structure, and sufficient to prevent the most costly failure modes from running undetected for 65 experiments.

The 8.740 baseline, for what it is worth, re-estimated to 8.530 with eval-n=3 and --reset-baseline. The loop, recalibrated, has a realistic target to aim for again.

What the Literature Leaves Open

Three papers from the harness’s own lit-review corpus land directly on the failure modes documented here.

JudgeSense (2604.23478) introduces the Judge Sensitivity Score (JSS) — the fraction of semantically equivalent prompt paraphrases yielding identical verdicts. On factuality tasks, judges cluster near a JSS of 0.63 due to what the authors call a “polarity-inverted prompt artifact,” recovering to ~0.9 after correction. This is the mechanism behind the grounded=6↔8 oscillation: the WIGGUM evaluator’s sensitivity to synthesis output phrasing is what makes the baseline stochastic, not any property of the synthesis instruction. The autoresearch loop spent 90 experiments trying to solve an evaluator noise problem by changing producer instructions.

FormatSpread (2310.11324) shows that LLM sensitivity to prompt formatting “remains even when increasing the number of examples or performing instruction tuning.” This confirms why neither the 14 HOW-FIRST experiments nor the 24 NAMED-SYSTEMS experiments stabilized the grounded dimension: the variance source is structural to the evaluator and not responsive to synthesis instruction changes.

Conformal Prediction for LLM-as-a-Judge (2509.18658) proposes constructing continuous prediction intervals from a single run, with the interval midpoint as a low-bias alternative to raw scores. The contaminated 8.740 baseline was a point-estimate trap: one lucky sample (grounded=8, specificity=9) set a scalar threshold the loop could never reliably beat. Interval-valued baselines would have surfaced the bimodal score distribution before the loop ran at all.

Open question: Could the autoresearch loop use conformal prediction intervals on the baseline — reporting it as 8.46–8.74 rather than 8.740 — so that advance/discard decisions are robust to evaluator noise rather than chasing a single lucky sample?