May 25, 2026 • 14 min read • Agentic Harness Engineering Series

What SkillOpt Gets Right

Microsoft's SkillOpt (arXiv 2605.23904) is a gradient-free skill optimizer built around three design choices our autoresearch loop doesn't have: a proactive validation gate, a fast/slow epoch structure, and a persistent best_skill.md artifact. Each one exposes a concrete gap. One we've partially patched; two we haven't touched.

SkillOpt dropped on arXiv the same week we hit experiments 105–106: two consecutive runs proposing the identical NAMED-SYSTEMS instruction variant, keeping neither, looping. The timing was irritating and instructive. SkillOpt is a gradient-free framework for optimizing LLM agent skills—it uses batch rollouts, LLM reflection, and a hard acceptance gate to converge on a best_skill.md that transfers across task instances. Reading the README against our loop log made the structural gaps immediately visible.

This post is a gap analysis, not a SkillOpt explainer. The three gaps are: (1) the loop proposes before it validates, (2) all updates happen at one timescale, and (3) the best instruction is never written to a file. For each gap I'll describe what SkillOpt does, what the harness does instead, what we built in response, and what the literature says about the underlying problem.

On the lit reviews: To ground each gap, we ran four targeted literature sweeps using the harness's own lit_review_skill.py. The queries were: gradient-free prompt optimization, validation gating convergence criteria, hierarchical multi-timescale LLM updates, and skill library procedural knowledge persistence. Reviews 3 and 4 pulled directly relevant papers. Reviews 1 and 2 drifted—the arxiv fetch surfaced prompt injection and mathematical optimization stopping criteria rather than the OPRO/APE/DSPy literature we were aiming for. What follows draws on what the sweeps found, honestly labeled.

I1 — SkillOpt in One Paragraph

SkillOpt runs in epochs. Within each epoch: sample a batch of task instances, run rollouts using the current skill instruction, collect per-instance scores, and pass the full trace—including failures—to an LLM proposer that generates a candidate revision. The candidate is then evaluated on a held-out validation set. If it improves the validation score, it becomes the new current instruction for the next epoch. If it doesn't, the current instruction is unchanged. At the end of training, the best-validated instruction is written to best_skill.md—a file that persists independently of the training loop and can be loaded by any downstream agent that needs the skill.

The three structural elements—validation gate, epoch structure, persistent artifact—are individually known ideas. What SkillOpt does is compose them into a coherent gradient-free optimization loop. The autoresearch loop has none of the three; it proposes, evaluates on the full task suite, and keeps the current-best in a running Python variable.

I2 — Gap 1: Validation Gating

The most immediate gap. In autoresearch, a proposal is accepted if its composite score exceeds the current baseline by a delta threshold. Both training and acceptance use the same eval suite. There is no held-out set; the proposer can and does overfit to the tasks it's been scored on. Experiments 105–106 were a textbook example: the NAMED-SYSTEMS variant scored above baseline on the eval tasks, was accepted, scored below baseline on the next run (different random seed), was discarded, and the proposer reproposed the same variant. Two experiments wasted, same instruction at the end.

I2 — Propose-Then-Validate vs. Propose-Then-Accept

SkillOpt (top) screens proposals against a held-out validation set before accepting; the autoresearch loop (bottom) accepts on the same task distribution used for training, with no independent check.

The literature on stopping criteria—even the mathematical optimization literature that our validation-gating sweep accidentally surfaced—is consistent on this point. Kaur et al. (2026, arXiv 2602.22107, "Don't stop me now") ran an empirical study comparing early stopping on validation accuracy versus validation loss across neural classifiers. The punchline: early stopping on accuracy is the worst of the options tested, inferior to loss-based stopping and to post-hoc selection across all epochs. The mechanism is that accuracy is coarser and noisier than loss; a single-epoch accuracy read has high variance and will fire the acceptance criterion on variance rather than signal. For the harness, the analogy is direct: the composite Wiggum score is a coarse aggregate (single LLM judge, three tasks, n≤5 samples). Accepting on a single eval run—which is what experiments 105–106 did—is accepting on noise.

A regret-based stopping criterion from Bayesian optimization (Pruher et al., 2026, arXiv 2605.22561) frames the same issue differently: a stopping rule is sound only if it provides an ε-optimality guarantee with high probability, which requires either a distributional bound on the evaluation function or multiple independent draws. A single composite score from a single eval run provides neither.

The TDD governance paper (2026, arXiv 2604.26615) that the gradient-free sweep surfaced—off-target for OPRO/APE but unexpectedly useful here—makes a similar argument from a software engineering angle. Operationalizing TDD in LLM pipelines requires "validation gates" that act as go/no-go checkpoints before a candidate is committed; without them, the pipeline's Red-Green-Refactor cycle collapses into a Red-Red loop where nothing is ever demonstrably green.

What We Built: Kimi Unblock

The change we made this week addresses cycling reactively, not validation proactively. When consecutive_discards ≥ KIMI_STUCK_THRESHOLD (default 6), the loop calls get_kimi_unblock_suggestion()—a cloud model (kimi-k2.5) consulted for an outside perspective on why the proposer is stuck. The suggestion is injected into the next proposer call as {kimi_guidance}.

KIMI_STUCK_THRESHOLD = int(os.environ.get("KIMI_STUCK_THRESHOLD", "6"))

def get_kimi_unblock_suggestion(current, history, eval_feedback,
                                consecutive_discards) -> str:
    prompt = _KIMI_UNBLOCK_PROMPT.format(
        synth=current["synth"],
        history=history[-3000:],
        eval_feedback=eval_feedback,
        consecutive_discards=consecutive_discards,
    )
    response = ollama.chat(
        model=KIMI_MODEL,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0.7},
    )
    suggestion = response["message"]["content"]
    suggestion = re.sub(r"<think>.*?</think>", "", suggestion,
                        flags=re.DOTALL).strip()
    return suggestion

This is reactive: it fires after the cycle has already consumed 6 experiments. SkillOpt's validation gate is proactive: proposals are rejected before they're committed, before they're added to the history, and before they can mislead the proposer into treating a noise-accepted instruction as a genuine signal. The Kimi unblock prevents indefinite cycling; the validation gate prevents initial acceptance of noisy proposals. Both are needed; we have one.

We also added NAMED-SYSTEMS to the hard-banned list with an explicit note:

HARD_BANNED = """
  - Citing named tools, frameworks, or published benchmarks for each practice
    (NAMED-SYSTEMS angle: tried in exps 105-106; proposer cycled the identical
    proposal two experiments in a row; exp 104 showed zero score variance
    across 5 samples — instruction changes had no measurable effect)
"""

That's a post-hoc fix. The gate would have caught it prospectively, before the proposer had time to re-propose it.

Implementation note: A minimal validation gate for the harness doesn't require a separate held-out task suite—we don't have enough tasks for a true train/validation split. The pragmatic version: before committing, re-run the eval with a different random seed and require the proposal to clear baseline on both runs. This cuts the noise-acceptance rate substantially without requiring new eval tasks. The cost is one additional eval run per accepted proposal; on the current discard rate (>80%), the net cost is roughly one extra run per five experiments rather than per experiment.

I3 — Gap 2: Fast/Slow Epoch Structure

SkillOpt's epoch loop operates at two timescales. The fast timescale is within-epoch: the proposer sees per-instance rollout traces and generates a candidate revision. The slow timescale is cross-epoch: best_skill.md is only updated when a candidate clears the validation gate, which happens at most once per epoch, and often less frequently. The gap between timescales means the slow update is a genuine signal—it only fires when there's enough evidence across an epoch's batch to justify a change.

The autoresearch loop has one timescale. Every experiment is both a fast update (propose) and a slow update (commit if kept). There is no batch aggregation across experiments before proposing; the proposer sees a rolling window of recent eval results, not a structured epoch-batch summary. This makes the proposer reactive to last-run noise rather than responsive to epoch-level trends.

The meta-learning literature explains why two timescales help. The "Reusable Options via Gradient-based Meta Learning" paper (Harb et al., 2022, arXiv 2212.11726) demonstrates that temporal abstractions (options in hierarchical RL) that are learned and transferred across tasks require a fast adaptation loop (task-level) and a slow update loop (option-level). When the two loops are collapsed to one, options overfit to the current task and lose transferability. The analogy to the autoresearch loop is direct: the "option" is the instruction strategy (e.g., "add implementation steps with parameter values"), and collapsing fast and slow updates means the proposer re-discovers the same strategy from scratch each time rather than building on a stable cross-experiment signal.

The meta-gradient RL literature provides a related structural insight. Flennerhag et al. (2022, arXiv 2211.10550) show that when the inner and outer optimization loops share the same discount factor—the same timescale—the outer update develops a systematic bias toward myopic policies. The debiasing fix is an additional "outer value function" that runs at the slower timescale and corrects for the mismatch. In autoresearch terms: using eval feedback from the last experiment to propose the next instruction conflates fast (per-experiment) and slow (per-strategy) signals, and the proposer develops a systematic bias toward last-run performance rather than epoch-level trends.

The Tutorial on Meta-Reinforcement Learning (Beck et al., 2023, arXiv 2301.08028) frames this as the core challenge of meta-RL: the inner loop adapts a policy to a task; the outer loop updates the meta-parameters that make inner-loop adaptation fast. The two loops must be kept structurally separate to avoid conflating task-specific adaptation with meta-level learning. This is exactly the separation SkillOpt enforces and the autoresearch loop doesn't.

Timescale	SkillOpt	autoresearch.py
Fast (within-epoch)	Per-instance rollout traces → proposer	Per-experiment eval feedback → proposer (same)
Slow (cross-epoch)	Validation-gated update of `best_skill.md`	Not implemented — every keep is both timescales
Proposer input	Epoch-batch summary + per-instance failures	Rolling window of recent experiments (last 5)
Commit criterion	Validation-gated improvement	delta > threshold on single eval run

Adding an epoch structure to autoresearch doesn't require rewriting the loop. A minimal version: define an epoch as N experiments (e.g., N=10). Within the epoch, the loop runs as normal—propose, eval, keep/discard. At epoch boundary, the proposer receives a summary of the epoch's keeps and discards (not just the last-5 window) and generates an epoch-level candidate. The epoch-level candidate is evaluated with a different random seed (the validation gate from Gap 1). If it clears, it becomes the new baseline for the next epoch. If it doesn't, the epoch-level baseline rolls back to the pre-epoch instruction. This gives the slow update its own signal without requiring a separate held-out task suite.

Lit review note: The hierarchical multi-timescale sweep targeted LLM-specific fast/slow update schedules; it pulled the meta-RL literature instead. The MAML and Reptile comparison paper (arXiv 2310.06148) is a useful secondary source: it shows that MAML's inner-loop specialization produces less diverse features than fine-tuning when the test distribution shifts, which maps to the proposer overfitting problem—a proposer that adapts too fast to per-experiment signals loses the diversity needed to escape local optima. The epoch structure is a structural regularizer against this.

I4 — Gap 3: Skill as Persistent Artifact

SkillOpt writes best_skill.md. This is not a minor implementation detail. It means:

The best instruction survives a process crash and a machine restart.
It can be loaded by an agent that was never part of the training run.
It has a human-readable name that can be versioned, diffed, and rolled back independently of the training log.
Multiple skills can coexist as multiple files without coupling their update loops.

The autoresearch loop keeps the best instruction in two places: the running Python variable baseline_score tracks the numeric score, and the committed git state tracks the instruction text. Neither is a named, self-describing artifact. If you want to know what SYNTH_INSTRUCTION looks like in experiment 87, you git log the file; if you want to load it in a new process, you read agent.py and extract the string manually. The instruction is implicitly versioned by git, not explicitly versioned as a skill artifact with its own identity.

The Procedural Knowledge Libraries paper (Kapoor et al., 2025, arXiv 2506.14715) makes this distinction precisely. Traditional knowledge management stores end products—the final instruction, the final model, the published paper. What it discards is the full process record: the hypotheses that were tried, the failures that informed the final state, the decisions that were made at each branch point. PKLs argue for capturing the full arc in a structured storage schema; the end product is one entry in a larger procedural record, not a standalone artifact.

For the autoresearch loop, autoresearch.tsv is an incomplete PKL. It records per-experiment scores, descriptions, and keep/discard status—but it doesn't record which instruction text produced which score in a way that's independently loadable. To reconstruct experiment 87's instruction, you need both the TSV (for metadata) and the git log (for text). The TSV is a log; it's not a skill artifact.

UI-Voyager (Yan et al., 2026, arXiv 2603.24533) provides a complementary framing from the mobile GUI agent literature. UI-Voyager builds a self-evolving skill library through two stages: Rejection Fine-Tuning (RFT) that learns from failed trajectories, and Group Relative Self-Distillation (GRSD) that refines learned skills against successful ones. The key architectural choice: skills are stored as named, loadable entries in a skill library, not as weights in a single model. When a skill fails on a new task instance, the skill entry—not the whole model—is the unit of update. This decomposition is what makes UI-Voyager achieve 81% pass@1 on AndroidWorld without retraining the base model.

The analogous decomposition for autoresearch: SYNTH_INSTRUCTION, SYNTH_COUNT, and SYNTH_PROSE are currently embedded in agent.py as Python string constants. They should be a named file—skills/synthesis.md or similar—loaded at runtime, versioned by a hash, and loadable by any process that needs the current best synthesis skill. The training loop writes the file on each validated keep; the agent reads it on startup. The git history for agent.py becomes irrelevant for skill tracking; the skill file has its own history.

The Agentic Skill Discovery paper (Wang et al., 2024, arXiv 2405.15019) shows that skills built incrementally from zero—starting with no predefined library and growing through LLM-generated task proposals—are more transferable than skills defined upfront, because each skill entry is validated against a specific set of successful trajectories rather than specified by hand. The autoresearch loop builds one skill incrementally; it doesn't grow a library. But the transferability argument applies: a skill entry backed by a validated set of trajectories is more reusable than a string constant that happens to have a high composite score on the last eval run.

I4 — Instruction-in-Code vs. Skill-as-Artifact

Current state (left): instruction text lives inside agent.py, retrievable only via git log. Proposed state (right): a named skill file, written on each validated keep, loadable independently of the training loop.

The Minimal Implementation

The change is small. Add a skills/ directory. On each keep, write the accepted instructions to skills/synthesis.md with a YAML front matter block that records the experiment number, composite score, and the task IDs it was validated on:

---
experiment: 112
score: 8.91
tasks: [T_A, T_B, T_C, T_D, T_E]
validated: true
date: 2026-05-25
---

# Synthesis Instruction

[instruction text here]

In agent.py, replace the embedded string constant with a loader:

def load_synthesis_skill(path: str = "skills/synthesis.md") -> dict:
    with open(path) as f:
        raw = f.read()
    front, _, body = raw.partition("\n---\n")
    meta = yaml.safe_load(front.lstrip("---\n"))
    meta["instruction"] = body.strip()
    return meta

SYNTHESIS = load_synthesis_skill()

The training loop now has a single write target. The agent has a single read source. The git history for agent.py is no longer the versioning mechanism for the skill—skills/synthesis.md is. And if you want to roll back to experiment 87's instruction, you git log -- skills/synthesis.md and checkout the hash directly, without grepping through agent.py's change history.

What the Literature Leaves Open

SkillOpt's validation gate uses a held-out set drawn from the same distribution as the training batch. For harnesses with small, fixed task suites (five tasks is typical), a train/validation split reduces the training signal to three or four tasks. Is a different-seed re-run on the same tasks a sound substitute for a true held-out set, or does it just sample the same noise distribution?
The fast/slow epoch structure works when the inner loop (fast) and outer loop (slow) have genuinely different update targets. In autoresearch, both loops update the same instruction text. Is there a meaningful decomposition of the instruction into a fast-updated component (e.g., task-specific phrasing) and a slow-updated component (e.g., structural strategy) that would make the two-timescale structure non-trivial?
The Kimi unblock is a reactive intervention: it fires after cycling is detected. SkillOpt's validation gate is proactive: it prevents noise-accepted proposals from entering the history. Is there a way to combine both—proactive gating and reactive diversification when the gate is consistently rejecting? Consistent rejection (no proposal clears the gate in N epochs) is a different failure mode from cycling (proposals accepted and discarded alternately), and they may need different interventions.
Skill files and PKLs capture the end state of a training run. The process—the sequence of discarded proposals, the reasons for each discard, the feedback that shifted the proposer's direction—is in autoresearch.tsv but not in skills/synthesis.md. For a PKL to be genuinely useful (Kapoor et al., 2025), both are needed. What is the right schema for a skill entry that includes enough process context to explain why the current instruction is what it is, without becoming a full log dump?
UI-Voyager's two-stage learning (RFT then GRSD) separates failure learning from success distillation. The autoresearch loop does neither explicitly—the proposer receives mixed feedback from kept and discarded experiments. Is there a benefit to structuring the proposer prompt as two stages: first, extract what failed and why; then, propose a change that addresses the failures while preserving what succeeded? This is a prompting change, not a loop change, and could be implemented without any of the structural changes above.

← Previous From Hill-Climbing to Pareto: GEPA and the Autoresearch Loop Next → Autoresearch Convergence