From Hill-Climbing to Pareto
DSPy's GEPA optimizer (Agrawal et al., 2025) formalizes exactly what autoresearch.py does by hand — but adds Pareto frontier selection, minibatch screening, and rich textual feedback. Those three ideas explain, precisely, why the hand-rolled loop oscillates and plateaus.
The autoresearch loop in the harness is a hand-rolled prompt optimizer: propose a change to SYNTH_INSTRUCTION, run the eval suite, keep the change if it improves the composite score, discard it otherwise, and repeat. After 100+ experiments this loop plateaued at 8.74 — stuck below the 9.0 PASS threshold — with a distinctive oscillation pattern: experiments 20–38 kept proposing two alternative strategies in rotation, keeping neither long enough to escape the attractor.
GEPA (Genetic-Pareto, Agrawal et al., 2025) is a prompt optimizer built on the same principle, but with three structural differences that would have prevented both problems. This post maps each GEPA concept to a concrete failure mode in the harness loop and describes what the upgrade path looks like.
I1 — The Autoresearch Loop as Primitive GEPA
The two systems share the same skeleton. In GEPA terms, the autoresearch loop is:
| GEPA concept | autoresearch.py implementation | Gap |
|---|---|---|
| Candidate pool (Pareto frontier) | Single scalar-best: baseline_score |
No diversity preservation |
| Sample a candidate to mutate | Always mutates the current committed instructions | Single lineage — no crossover |
| Collect execution traces + textual feedback | get_recent_eval_feedback() + _extract_discarded() |
Feedback was broken (off-by-one bug) until exp 38 |
| LLM reflection → propose new instruction | propose_instructions() via Qwen3-Coder:30b |
Matches |
| Minibatch rollout → quick filter | Full run_eval(task_ids, eval_n) every experiment |
No cheap screening; bad candidates cost as much as good ones |
| Update Pareto frontier | if delta > threshold: baseline_score = score |
Winner-take-all; complementary candidates evicted immediately |
Three gaps matter. The rest of the post addresses each in turn.
I2 — Gap 1: Pareto Frontier vs. Scalar Best
The oscillation in experiments 20–38 had a specific shape: the proposer alternated between two instruction variants, A and B, each beating the other's score by a small margin on alternate runs. Neither was consistently better; each was better on a different subset of eval tasks. Scalar-best selection collapsed to whichever happened to win on the last run and immediately evicted the other.
GEPA's Pareto frontier keeps both. A candidate survives on the frontier if it achieves the best score on at least one evaluation instance — not if it's the global maximum. In the oscillation case, A would survive because it wins on task T_E; B would survive because it wins on T_D. The next mutation samples from both, with probability proportional to coverage (how many instances each candidate "wins"). This guarantees that complementary strategies are retained and explored simultaneously rather than taking turns.
A minimal implementation doesn't require DSPy as a dependency. A ParetoPool class with four operations covers it:
@dataclasses.dataclass
class PoolCandidate:
synth: str
synth_count: str
synth_prose: str
score: float
description: str
experiment: int
class ParetoPool:
"""Keeps up to `size` candidates; retains any that ever scored
best on at least one eval run. Evicts the worst on overflow."""
def __init__(self, size: int = 4):
self.size = size
self._pool: list[PoolCandidate] = []
def add(self, candidate: PoolCandidate) -> None:
if len(self._pool) < self.size:
self._pool.append(candidate)
else:
worst_idx = min(range(len(self._pool)),
key=lambda i: self._pool[i].score)
if candidate.score > self._pool[worst_idx].score:
self._pool[worst_idx] = candidate
def sample(self) -> PoolCandidate | None:
"""Proportional-to-score sampling; returns None if pool empty."""
if not self._pool:
return None
scores = [c.score for c in self._pool]
lo = min(scores)
weights = [s - lo + 0.01 for s in scores]
total = sum(weights)
r = random.random()
cum = 0.0
for c, w in zip(self._pool, weights):
cum += w / total
if r <= cum:
return c
return self._pool[-1]
def best_score(self) -> float | None:
return max((c.score for c in self._pool), default=None)
def summary(self) -> str:
return "\n".join(
f" [{c.score:.3f}] exp {c.experiment}: {c.description[:60]}"
for c in sorted(self._pool, key=lambda c: -c.score)
) or "(empty)"
Integration into the main loop requires two changes: when an experiment is kept, add it to the pool; at the start of each iteration, sample a parent from the pool and pass its instructions as current to the proposer. The git state continues to reflect only the scalar-best committed candidate — the pool is in-memory, not in git. When a pool-sampled parent's mutation is discarded, git checkout -- agent.py restores the committed scalar-best automatically.
Scalar-best collapses to one lineage immediately on keep; the Pareto pool retains up to N complementary candidates and samples from all of them, weighted by score.
I3 — Gap 2: Textual Feedback as Optimization Signal
GEPA's second key insight is that scalar scores are a lossy compression of what went wrong. A score of 8.2 on depth doesn't tell the proposer whether the agent wrote shallow summaries, failed to cite sources, or misunderstood the task domain. Textual feedback — evaluator comments, failure traces, parse errors — tells the proposer why the score was what it was, enabling targeted repairs rather than random mutations.
The harness evaluator already produces rich textual feedback. The Wiggum evaluator returns issues, feedback, and per-dimension scores for each run. get_recent_eval_feedback() surfaces this in the proposer prompt:
[T_B round 1 score=8.1 rel=9 cmp=8 dep=7 grounded=8 spc=8 str=9]
issue: depth dimension penalized — implementation steps listed without
concrete parameter values or configuration examples
feedback: the response identifies the right practices but treats each
as a general recommendation; a practitioner cannot apply
them without further research
This is rich enough for a capable proposer to make a targeted change: "add parameter values and configuration examples to depth-heavy items." The oscillation in experiments 20–38 persisted despite this feedback being available because _extract_discarded() had an off-by-one bug — it was reading the tasks column as the description, giving the proposer "- T_B" for every discard instead of the actual experiment description. The proposer had no memory of what it had tried and why it failed.
That bug was fixed (parts[5] → parts[6] in the TSV column order after the tasks column was added), but it illustrates a principle GEPA makes explicit: the feedback channel is as important as the scoring function. An optimizer with a rich scorer and a broken feedback channel degrades to random search.
Academic grounding: Agrawal et al. (2025) show that GEPA outperforms scalar-reward GRPO and other prompt optimizers on 7 of 8 benchmarks with dramatically fewer rollouts. The advantage concentrates on tasks where the scoring function is noisy or coarse — exactly the regime where harness eval operates (single composite float from a single-judge LLM). The paper frames this as "GEPA preserves natural-language traces from LLM-based workflows rather than reducing them to numerical rewards, mirroring human diagnostic processes."
I4 — Gap 3: Minibatch Screening Before Full Eval
Every experiment in the autoresearch loop, whether the candidate is clearly broken or genuinely promising, incurs the same eval cost: eval_n full runs across all task IDs. With --eval-n 3 and three tasks, a single experiment costs ~45–60 minutes of wall-clock time. Roughly 80% of experiments are discards. The majority of that compute is spent confirming that bad candidates are bad.
GEPA's minibatch rollout addresses this asymmetry. Before committing to a full evaluation, GEPA runs the candidate on a small minibatch of training instances — typically 2–4 examples. Candidates that fail the minibatch are discarded immediately without a full eval run. Only candidates that clear the minibatch threshold proceed to full evaluation.
The harness equivalent: run _run_eval_once([task_ids[0]]) before the full run_eval(task_ids, eval_n). The screen uses a single task, single sample — the fastest possible eval. If the quick score falls below an absolute floor (not a relative-to-baseline comparison, which would require per-task baselines), the candidate is discarded immediately:
MINIBATCH_FLOOR = 6.5 # absolute floor — clearly broken instruction
def run_eval_screened(task_ids: list[str], n: int, baseline: float
) -> tuple[float, bool]:
"""Quick 1-task screen; full eval only if instruction isn't broken.
Returns (score, did_full_eval).
"""
screen_task = [task_ids[0]]
print(f" [screen] quick eval on {screen_task[0]}...")
quick = _run_eval_once(screen_task)
if quick < MINIBATCH_FLOOR:
print(f" [screen] {quick:.3f} < floor {MINIBATCH_FLOOR} "
f"— skipping full eval (clearly broken instruction)")
return quick, False
print(f" [screen] {quick:.3f} >= floor — proceeding to full eval")
return run_eval(task_ids, n), True
The floor is set at 6.5 rather than relative to baseline because the proposer occasionally produces outputs that aren't instructions at all — GEPA calls these "hallucinated documents." The existing check rejects instructions over 1,200 characters or with more than 3 newlines; the floor check catches the ones that pass the length check but produce nonsense scores. A score below 6.5 on any single task is a reliable signal of a broken instruction: good instructions on this harness cluster in the 8.0–9.5 range.
The expected saving: with an 80% discard rate and ~15 minutes per single-task quick eval, minibatch screening saves roughly 30–40 minutes per discarded experiment that clears the length check but fails the floor. On long runs, this is significant.
I5 — A Note on Zeta 2 as a Specialized Proposer
GEPA proposes instruction changes using a general-purpose LLM. The harness uses Qwen3-Coder:30b — a 30B reasoning model — because the proposer needs to understand the evaluation context, the history of failed experiments, and the structure of the instruction text well enough to make a targeted single change. That reasoning demand justifies a large model.
Zed Industries' Zeta 2 (May 2026) is a different kind of model: an 8B code-edit prediction model fine-tuned from Seed-Coder-8B-Base on next-edit suggestion tasks. Its prompt format — suffix-prefix-middle with a git-merge-style editable region and an explicit edit history — is optimized for proposing targeted code changes given a context of prior edits:
<[fim-suffix]>
instruction text that comes after the editable region
<[fim-prefix]><filename>edit_history
--- a/agent.py
+++ b/agent.py
-old instruction line
+new instruction line (attempt 1 — discarded)
--- a/agent.py
+++ b/agent.py
-instruction line (attempt 2)
+new instruction line (attempt 2 — kept)
<filename>agent.py
current instruction before editable region
<<<<<<< CURRENT
the instruction text|cursor|to mutate
=======
<[fim-middle]>
This format is structurally a good fit for the proposer task: the edit history is exactly autoresearch.tsv rendered as diffs, and the editable region is the instruction block between the sentinel markers. The key limitation is that Zeta 2 was trained on code edits, not instruction rewrites. The instruction text is natural language, not code — Zeta 2's training distribution doesn't include "rewrite this prompt to emphasize implementation depth over source citations." A general reasoning model handles the semantic component that Zeta 2 would miss.
The better fit for Zeta 2 in the harness is as a specialized model for code-generation subtasks within the agent — tasks where the agent is asked to write or modify code, not instruction text. There, Zeta 2's next-edit format and 8B footprint are advantages over a 35B general model: faster, cheaper, and specifically trained for the code-edit task distribution.
The practical split: use a large general model (Qwen3-Coder:30b or equivalent) for the proposer, which reasons about evaluation history and instruction semantics; use a specialized small model (Zeta 2 or similar) for code-generation subtasks within the agent's output pipeline. The proposer's job is understanding; the agent's code task is pattern completion — these call for different models.
I6 — What a GEPA-Upgraded Autoresearch Loop Looks Like
Putting the three changes together, the loop becomes:
pareto_pool = ParetoPool(size=4)
pareto_pool.add(PoolCandidate(..., score=baseline_score, experiment=0))
while True:
# 1. Sample parent from Pareto pool (not always current best)
parent = pareto_pool.sample()
current = {
"synth": parent.synth,
"synth_count": parent.synth_count,
"synth_prose": parent.synth_prose,
}
# 2. Propose mutation of parent with pool context visible to proposer
pool_summary = pareto_pool.summary()
proposal = propose_instructions(current, history, eval_feedback,
research_context, pool_summary,
parent_experiment=parent.experiment)
# 3. Apply and commit mutation
write_instructions(proposal["synth"], proposal["synth_count"],
proposal["synth_prose"])
git_commit(f"autoresearch exp {experiment}: {proposal['description']}")
# 4. Screen before full eval
score, did_full_eval = run_eval_screened(task_ids, eval_n, baseline_score)
# 5. Update pool and baseline
if score > baseline_score + delta_threshold:
status = "keep"
baseline_score = score
pareto_pool.add(PoolCandidate(
synth=proposal["synth"],
synth_count=proposal["synth_count"],
synth_prose=proposal["synth_prose"],
score=score,
description=proposal["description"],
experiment=experiment,
))
else:
status = "discard"
git_reset_discard() # restores committed scalar-best
log_experiment(experiment, score, baseline_score, status,
proposal["description"], task_ids)
The key behavioral change is in step 1. Rather than always proposing against the globally-best committed instructions, the proposer sees a sampled parent — which might be the second-best or third-best candidate in the pool. This breaks the single-lineage attractor that caused the oscillation. After enough iterations, one lineage will pull ahead cleanly and dominate the pool, at which point the sampling converges back to near-scalar-best behavior naturally.
The proposer prompt also gains a {pool_summary} field that tells it which candidates exist and their scores. This context lets the proposer avoid re-proposing a variant it knows is already in the pool at a lower score — a more principled version of the PREVIOUSLY TRIED AND FAILED section that currently drives the hard-banned list.
What the Literature Leaves Open
- GEPA's Pareto frontier is defined over per-instance scores from a minibatch, not over per-task composite scores from a fixed eval suite. In the harness context, where the eval suite is a small fixed set of tasks (T_A through T_E), does Pareto selection over tasks provide meaningful diversity, or does the small task count collapse it back to scalar-best in practice?
- GEPA's textual feedback mechanism assumes the feedback is interpretable by the proposer. When feedback is generated by the same LLM family as the proposer (e.g., Qwen3 evaluating and Qwen3 proposing), does the proposer systematically overfit to the evaluator's blind spots rather than improving actual output quality?
- Minibatch screening rejects candidates below an absolute floor. In a regime where all candidates cluster in the 8.0–9.0 range — as in the harness plateau — the floor check rarely fires. What is the right design for a relative screen that correctly identifies candidates unlikely to beat the current best without requiring per-task baselines?
- GEPA's crossover step — combining best-performing modules from distinct lineages — is optional in the harness context, since there is only one mutable module (the synthesis instruction). Does the single-module constraint limit the benefit of Pareto selection to cases where distinct instruction strategies genuinely differ, or does it remain useful even for single-instruction optimization?
- For the proposer specialization question: at what eval task complexity does a 30B general reasoning model outperform an 8B code-edit model (such as Zeta 2) for instruction rewriting, and is that crossover point stable across different instruction domains (technical vs. prose vs. structured)?