← Back to Blog

Closing the SkillOpt Gaps: What Actually Shipped

A previous post compared the harness autoresearch loop to Microsoft’s SkillOpt (arXiv:2605.23904) and identified three concrete gaps: proactive validation gating before the full eval suite runs, a fast/slow epoch structure to amortize expensive evaluation, and a persistent skill artifact that is versioned separately from the optimization history. This post documents what was actually implemented from each, and what remains open.

GapSkillOpt designStatus
Proactive validation gating Screen proposals on held-out tasks before committing to full eval Shipped (adapted)
Fast/slow epoch structure Cheap fast gate; full eval only for survivors Partial
Persistent skill artifact Versioned best_skill.md separate from optimization history Open

Gap 1: Proactive validation gating — shipped as _validate_proposal()

SkillOpt’s validation gate tests a proposed skill update on held-out tasks before running the full evaluation suite, rejecting proposals that fail a cheap probe. The intent is to catch obviously broken instructions without spending 15–30 minutes on the full eval.

The harness implementation diverges from that design but addresses the same failure mode. Instead of a held-out task probe, _validate_proposal() applies two mechanical checks that catch the proposals most reliably known to fail:

1. Routing check. If the proposer changed an instruction that no active eval task exercises, the proposal is rejected immediately. This prevents the common failure where the model correctly identifies an improvement to SYNTH_INSTRUCTION_PROSE but the running task set only invokes SYNTH_INSTRUCTION—the change would never fire and the eval score would be unchanged regardless of its quality.

active_keys = _active_instruction_keys(task_ids)
changed_keys = {k for k in ("synth", "synth_count", "synth_prose")
                if result.get(k, "") != current.get(k, "")}
if changed_keys and changed_keys.isdisjoint(active_keys):
    return f"routing violation: changed {changed_keys} but active tasks use {active_keys}"

2. Ban-list pattern check. Three compiled regex patterns match the structural signatures of instruction families that historically produced attractor lock—proposals that passed the evaluator but induced identical outputs across diverse tasks. Any proposal whose synth_prose matches a banned pattern is rejected without running the eval suite.

_PROSE_BAN_PATTERNS = [
    ("narrative/continuous-flow",
     r"\b(cohesive\s+narrative|continuous\s+(flow|paragraph)|...)\b"),
    ("sequential/process-steps",
     r"\b(distinct\s+step\s+in\s+a\s+process|sequential\s+(step|depend|...))\b"),
    ("logical-chain",
     r"\b(logical\s+chain|chain.of.thought\s+narrative|cause.and.effect|...)\b"),
]

The difference from SkillOpt’s design is that these checks are heuristic, not empirical. They do not run the proposal on a held-out sample—they pattern-match against known failure signatures accumulated over 100+ experiments. This is cheaper (no LLM calls, no eval runs) but narrower: a novel failure mode that does not match any banned pattern will not be caught. The check is a mechanical backstop, not a general validity gate.

A secondary benefit of the routing check: it surfaces when the proposer is confused about which instruction path the running tasks use. This confusion was a silent failure mode before the check was added—proposals that changed the wrong instruction always scored equal to baseline and were discarded, but the proposer would continue generating the same family of wrong-target proposals.

Gap 2: Fast/slow epoch structure — partial

SkillOpt uses a two-phase structure: a fast epoch runs a small subset of tasks to screen candidates, and the slow epoch runs the full suite only for candidates that survive the fast gate. This amortizes the cost of evaluation across proposals, concentrating compute on candidates most likely to advance the baseline.

The harness has a MINIBATCH_FLOOR check that is structurally similar: a 2-task quick eval runs before the full 3-task eval, and proposals scoring below an absolute floor (6.5) are discarded without the full run. This is a fast gate.

MINIBATCH_FLOOR = 6.5  # absolute floor — clearly broken instruction

quick = run_eval(task_ids=QUICK_TASKS)  # 2 tasks
if quick < MINIBATCH_FLOOR:
    print(f"  [screen] {quick:.3f} < floor {MINIBATCH_FLOOR} — skip full eval")
    discard()
    continue

full = run_eval(task_ids=ALL_TASKS)  # 3 tasks

What is missing from the SkillOpt design is the relative fast gate—using the fast epoch score as a rank signal, not just an absolute cutoff. SkillOpt’s fast gate selects the top-K candidates from a pool before the slow epoch runs; the harness evaluates one candidate at a time and applies only a floor check. In a regime where all candidates cluster above the floor (as happened after experiment 90, when the baseline stabilized at 8.2–8.6), the floor check fires rarely and the fast gate provides little cost savings.

The open design question: a relative fast gate would require evaluating multiple candidates in a batch before selecting survivors. The current loop is serial—propose, eval, keep/discard, repeat. Batching proposals would require holding several candidate instructions in memory simultaneously and deferring the keep/discard decision until all are scored, which changes the loop’s state machine substantially.

Gap 3: Persistent skill artifact — still open

SkillOpt maintains a best_skill.md file that is versioned separately from the optimization log. The skill artifact is what gets deployed; the optimization history is what the proposer reads. This separation means rollback is explicit (point to a prior artifact version) rather than implicit (grep git log for the last accepted commit).

In the harness, the current best instruction is a sentinel-delimited string inside agent.py:

# AUTORESEARCH:SYNTH_INSTRUCTION:BEGIN
SYNTH_INSTRUCTION = """
Write a structured synthesis...
"""
# AUTORESEARCH:SYNTH_INSTRUCTION:END

The autoresearch loop edits this block in-place, commits the change, and uses git checkout -- agent.py to roll back discards. The “best skill” is always whatever is in agent.py on the current branch; there is no separate artifact file that can be pinned, compared, or deployed independently.

This creates two practical limitations. First, identifying the best instruction across a long experiment run requires scanning git log, which is not designed for this query. Second, there is no mechanism to revert to an instruction from 40 experiments ago without manually checking out that commit and cherry-picking the sentinel block—a workflow that does not survive branch changes.

The cleanest fix would be to extract the three sentinel blocks into a dedicated harness/synthesis_instructions.py file that autoresearch edits in isolation. The log JSONL already records every accepted instruction in synth, synth_count, and synth_prose fields, so the data for a proper artifact store exists—it just is not materialized into a queryable file.

The Kimi unblock: an emergent replacement for Gap 4

SkillOpt’s design also included a global exit when convergence was detected—a hard stop requiring human intervention. The harness replaced this with a cloud model consultation (described in detail here): when consecutive_discards ≥ KIMI_STUCK_THRESHOLD (default 6), the loop pauses and queries kimi-k2.5:cloud for a fresh proposal direction.

This is a meaningfully different design philosophy. The global exit treats convergence as a terminal state. The Kimi unblock treats it as a local minimum navigable by querying an oracle with a broader prior—a cloud model that has seen more instruction patterns than the local proposer. The tradeoff is that the loop never terminates cleanly: it continues running, potentially cycling through Kimi suggestions until a new local minimum is found or the operator kills the process.