May 24, 2026 • 14 min read • Agentic Harness Engineering

Seven Principles and a Moving Frontier

The harness roadmap has been rewritten twice. The North Star and the seven guiding principles have not changed once. Here is what that tells you about how to build self-improving systems.

Two snapshots of the same project, separated by roughly eight months of development. The first roadmap was a plan. The second is a plan plus a record — milestones checked off, new milestones discovered, and an ever-more-detailed picture of what Stage 4 actually requires. The text of the North Star goal is identical in both. The seven guiding principles are identical word-for-word. What changed is everything below them.

This post traces the evolution: what was there from the beginning, what each completed milestone revealed, what new milestones that forced, and what remains. It is a companion to the architecture series — less about how each component works and more about why the roadmap looks the way it does.

The Constants

The North Star has not changed:

"A locally-running swarm of specialized agents that iteratively improves its own harness, models, and capabilities — without human intervention beyond goal-setting and checkpoint approval. The loop closes continuously: run → preference data → fine-tune → hot-swap → benchmark → promote or revert. Human checkpoints at goal-setting and promotion; everything between is autonomous."

What follows from it — that the target hardware is a single RTX 5000 Ada (63.8GB VRAM), that the scaffolding is largely buildable before the models are capable enough to use it, that human checkpoints exist at goal-setting and promotion only — has also not changed. These are statements about what the system is for, not how to build it. They do not become obsolete when the implementation details shift.

The seven guiding principles are the same in both roadmaps. They are worth reading as a set, because they form a coherent system rather than a list of maxims:

Principle 1

Build for deletion. Every workaround exists because models can't yet handle it natively. Design so the workaround is trivially removable when the model improves.

Principle 2

Verify externally at every stage boundary. The model's self-report is not verification.

Principle 3

Add observability before adding features. Structured traces before new tools. Logging is not optional.

Principle 4

Evaluator and producer must be different models. Same-model evaluation is circular.

Principle 5

The harness is the product. The model is a commodity input. Reliability lives in the harness.

Principle 6

Every manual hand-off is a loop that hasn't closed yet. Each one is a target for automation.

Principle 7

Telemetry is what separates a critic from a scorer. Typed event traces tell you why — the self-improvement loop stalls without this signal.

Principles 1, 3, and 7 form a cluster about observability and impermanence: instrument everything, build to delete, distinguish scoring from understanding. Principles 2 and 4 form a cluster about verification: external checks, model separation. Principles 5 and 6 are the strategic posture: the harness accumulates value; manual hand-offs are technical debt.

Taken together, they answer every "why did you build it this way?" question in the architecture series. They also explain every new milestone that appeared in the second roadmap.

The Original Roadmap: Three Milestones and a Horizon

The first roadmap had a simple three-milestone structure leading to Stage 4:

0.1.0 PyPI release. Deck skill implementation, package_data for templates, bundled dashboard. Gate: oh --help and /introspect work from a pip install.
0.2.0 Dashboard completeness. Sessions, Artifacts, Analytics, Submit live streaming, MCP inspector. Gate: all views wired and showing real data.
0.3.0 Self-improvement loop. Plan approval, search result cache, chunked URL retrieval, autoresearch stall replan. Gate: eval-driven improvement runs without human intervention per iteration.
Stage 4 Autonomous swarm. Proposer/Executor/Critic loop, Git worktrees as isolation substrate, vLLM hot-swap, A2A protocol, Docker sandbox. Gate: the harness improves itself without human involvement between checkpoints.

This structure was right. The progression — get the package installable, make the dashboard complete, close the improvement loop, then build the swarm — is the correct order. What the original roadmap underspecified was the gap between 0.3.0 and Stage 4. That gap turned out to contain several entire milestones' worth of work.

The original skill inventory was also essentially correct: research, /lit-review, /annotate, /browser, /design, /build-page, /site, /deck, /transcribe, /recall, /introspect, /email, /orientation, /debug, /sync-wiki, /panel. These sixteen skills covered the breadth of what a research and production-support agent needs. The evolved roadmap added three more: /curate, /grill-me, and /onboarding — each discovered by hitting a real gap, not by speculative feature planning.

What Each Milestone Revealed

Each completed milestone exposed a constraint that was invisible before the milestone was done. This is not a failure of planning — it is how iterative systems work. The constraint could only be seen after the instrumentation to see it was in place.

0.1.0 → 0.2.0: The dashboard gap was real

Getting the harness installable as a proper package revealed that the original Flask/Jinja dashboard was not fit for production operation. The React/TypeScript rewrite was already planned, but the full dashboard completeness milestone — all views wired, paginated, with real data — was underspecified. Sessions and Artifacts were placeholders; Analytics was a placeholder; Submit had no live streaming. These were not gaps discovered by reading the code; they were gaps discovered by using the installed package.

Principle 3 applied: observability before features. The dashboard is the observability layer for the harness operator. Without it complete, the operator is flying blind during the self-improvement loop.

0.2.0 → 0.3.0: The search pipeline had hidden bottlenecks

With the dashboard complete and runs.jsonl being populated, two patterns became visible that were invisible before:

Both were added to 0.3.0 mid-milestone, because Principle 3 had done its job: the traces told the story.

0.3.0 → 0.3.5: Orchestration needed a policy, not a fixed fan-out

The original orchestration design was always-parallel: if the planner returned subtasks, they all ran concurrently. Running 0.3.0 at scale revealed that this was wrong for sequential-dependency tasks and wasteful for single-focus tasks. The planner — which already has full task and memory context — is the right place to make the orchestration decision. The 0.3.5 milestone added policy-driven orchestration: orchestration_style and allow_parallelism to the Plan dataclass, planner-produced rather than hardcoded.

This milestone also added /grill-me and /onboarding — both discovered as real operational gaps. /grill-me emerged from noticing that the agent was making assumptions about ambiguous task strings rather than asking. /onboarding emerged from /grill-me: if the agent can interview a user to understand a task, it can also interview them once at setup to understand who they are.

0.3.5 → 0.3.6: The critic infrastructure was missing

The autoresearch loop was running and producing improvement signals. But it was producing scalar improvement signals — a composite score — not structured explanations of why a mutation worked or failed. Principle 7 names this exactly: telemetry is what separates a critic from a scorer.

Milestone 0.3.6 added the infrastructure for a real critic: logger policy fields (which orchestration style was chosen, what subtask results were, what tool failures occurred), a unified structured event protocol ([EVENT]<json> lines from every pipeline stage), and evaluator rotation to break evaluator monoculture. These are prerequisites for Stage 4 — without them, the Proposer/Executor/Critic loop has no signal richer than a scalar.

This milestone also specified two new items that were not in the original roadmap at all:

0.3.6 → 0.3.7: The memory system was degrading silently

The ChromaDB semantic memory was working. Memories were accumulating. But with no quality signal and no user-facing window, low-quality memories were competing at equal weight with high-quality ones in retrieval. This was invisible until 0.3.6's observability additions made it visible.

Milestone 0.3.7 added the full memory observability stack: run_id provenance (every memory traceable to the run that produced it), RLHF quality signals (thumbs up/down adjusting a quality score), quality-weighted retrieval reranking (adjusted = cosine_similarity × max(0.2, 1 + quality × 0.15)), a Memory panel in the dashboard, and a UMAP+D3 ontology visualization on top of the existing ChromaDB embeddings.

All of this was shipped and checked complete. The schema migration, 7 API endpoints, Memory panel, ontology graph, RLHF pruning, security panel, system panel, and test suite expansion to 419 tests — all in one milestone. This is what "add observability before features" enables: once the observability infrastructure exists, the quality improvements it reveals can be addressed quickly.

The Milestone Map: Original vs. Evolved

Roadmap Milestone Comparison

Shared present in both roadmaps   New added in the evolved roadmap   Done completed

What Changed in Stage 4

Stage 4 — the Autonomous Swarm — was always the destination. Both roadmaps describe it with the same four components: Proposer/Executor/Critic loop, Git worktrees as isolation substrate, vLLM hot-swap, and Docker sandbox. What the evolved roadmap added was the implementation detail that comes from actually building toward those components.

Governance before autonomy

The original roadmap had no governance layer. The evolved roadmap added three files: constitution.md (hard constraints), ethos.md (evaluation standards), and cadence.md (operational loop). These are not agent system prompts — they are shared documents injected into every agent's bootstrap context, above the prompt cache boundary so vLLM prefix caching can reuse them across turns. The "democratic constitution" pattern closes the loop: Wiggum's failure patterns distill into proposed constitution diffs; human approves before merge.

Why does this belong in Stage 4 and not earlier? Because with a single agent, behavioral constraints live in the system prompt and only need to be consistent with themselves. With multiple autonomous agents, behavioral constraints need to be consistent across agents — which requires an external, shared document, not per-agent prompts.

Worktree coordination as a VFS substitute

The original roadmap described Git worktrees as isolation substrate for Proposer mutations. The evolved roadmap specified the directory convention that makes this work as a lightweight coordination protocol — no VFS, no distributed filesystem, no kernel driver:

worktrees/<branch>/
  tasks/      ← agent claims work by writing a file here (task_id.json)
  leases/     ← task_id.lock with {agent_id, claimed_at} — coordinator expires stale locks
  artifacts/  ← outputs staged here; never written directly to shared data/
  events/     ← append-only per-run event log for this agent

Workers only write inside their branch tree. Coordinator reads leases/ to detect abandonment and re-queue. Promotion is git merge --ff-only; cleanup is git worktree remove --force. This gives VFS-style coordination semantics without infrastructure overhead. The multi-machine scaling path — NFS mount → FUSE coordinator → distributed filesystem — is sequenced by complexity, not pre-specified.

A2A protocol: from sketch to specification

The original roadmap mentioned A2A as a "foundation" — the observation that the producer→evaluator→Wiggum loop is already an A2A pattern, just in-process rather than networked. The evolved roadmap turned this into a full specification: Agent Cards at /.well-known/agent.json, POST / task endpoints, agents.toml registry, planner-level delegate_to routing, parent_run_id for log correlation, and a comparison table of in-process vs. networked tradeoffs.

The key design constraint that emerged during specification: agents must not import from each other. All coordination happens over the wire. A researcher agent that deep-imports from a coder agent defeats isolation and reintroduces the in-process coupling A2A is meant to eliminate. This constraint is obvious in retrospect but was not stated in the original roadmap because it only becomes a real concern when you start writing the implementation.

Sub-agent minimal context mode

Not in the original roadmap at all. The insight: worker agents (orchestrator subtasks, Executor instances) don't need the full coordinator context — no memory recall, no self-update guidance, no evaluation rubric, no session history. A --minimal-context flag strips the system prompt to task + tools + constitution.md only. This reduces per-subtask token cost by 30–50% on research tasks. The pattern mirrors OpenClaw's promptMode=minimal: sub-agents get the equivalent of AGENTS.md + TOOLS.md only.

What the Added Milestones Have in Common

Every milestone that was added to the evolved roadmap can be traced to one of the seven principles:

Milestone additionPrinciple it instantiates
0.3.5: Policy-driven orchestrationP6 — the always-parallel fan-out was a manual heuristic, a loop that hadn't closed yet
0.3.5: /grill-me + /onboardingP6 — agent guessing at ambiguous tasks was a manual hand-off
0.3.6: Logger policy fieldsP7 — telemetry separates critic from scorer; without policy fields, runs.jsonl can't answer "did parallel execution outperform sequential?"
0.3.6: Structured event protocolP3 — unified [EVENT]<json> makes every pipeline stage observable in real time
0.3.6: COCOMO II analogP3 — observability of predicted cost before committing, not just actual cost after
0.3.6: Evaluator rotationP4 — monoculture evaluator is circular; rotation breaks it and guards producer==evaluator collision
0.3.7: Memory RLHF + ontologyP3 + P6 — memory quality degrading silently was an observability failure; RLHF closes the quality feedback loop
Stage 4: Governance layerP1 + P5 — behavioral constraints belong in the harness, not per-agent prompts; designed to be updated as the system improves
Stage 4: Sub-agent minimal contextP1 — context overhead that models don't need is a workaround for future models that can route with less
Stage 4: Harness ontology layerP7 — the Proposer can't be a critic without blast-radius context; code structure is a prerequisite for structured mutation
0.4.0: Personal TTS modelP6 — voice output going to an external API was a manual hand-off; fine-tuning on local recordings closes it

None of the new milestones are scope creep. Each one closes a gap that became visible only after the preceding milestone instrumented it. This is what Principle 3 is for: observability before features means that the features you add are the ones that the data shows you need, not the ones you imagine you need.

Experimental Timeline

The narrative above describes what each milestone revealed. This section documents the numbers — the experimental results, failure modes, and quantitative findings that drove each implementation decision. All figures are drawn directly from harness-engineering/journal.md and the associated runs.jsonl corpus.

Sessions 1–2 • Early 2026 Evaluator Selection & CRD Baseline

The first challenge was finding an evaluator that could discriminate between good and bad outputs. Three evaluators were tried before settling on one that produced meaningful signal.

EvaluatorOutcomeDisqualifying finding
qwen2.5:72bRejectedToo slow — per-eval latency made iteration impractical
glm4:9b (loose)RejectedRubber stamp — ceiling at 9/10 regardless of PASS_THRESHOLD
glm4:9b (tight)PartialDiscriminates more but still not reliable on edge cases
Qwen3-Coder:30bAdopted

Experiment 01 — a 9-run Completely Randomized Design (CRD) to establish baseline variance. All 9 runs passed. Task stability varied substantially:

TaskPass rateCVInterpretation
T_B9/9 (100%)13.2%Most stable — reliable eval target
T_C44.2%Most variable — evaluator noise or task ambiguity
Experiment 02 finding: H1 falsified. The glm4:9b evaluator produced a score ceiling at 9/10 regardless of which PASS_THRESHOLD was set. The threshold had no discriminating power — the same outputs received the same scores across threshold settings. This disqualified glm4:9b as a reliable evaluator and forced the switch to Qwen3-Coder:30b.

Experiment 03 — first run with Qwen3-Coder:30b as evaluator. The new evaluator was genuinely discriminating:

  • Overall pass rate dropped from 100% (Exp 01 with rubber-stamp evaluator) to 4/9 (44%)
  • T_A: 0/3 PASS — complete ceiling failure on the hardest task type
  • Revision loop was active in 8/9 runs — nearly all outputs required wiggum revision to reach passing score
  • Two distinct wiggum failure modes identified: regression (revision worsens score) and stagnation (revision makes no progress)

Experiment 04 — switched to 32B producer to test whether the T_A ceiling was an evaluator artifact or a producer capability limit:

  • Overall: 12/16 PASS (75%) vs. 4/9 (44%) with 7B producer
  • T_A ceiling broken: 4/4 PASS
  • T_B: flat depth dimension across all runs — the synthesis instruction, not the producer, was the bottleneck
Implementation decision: T_B's flat depth score (first appearance of this pattern) pointed to the synthesis instruction as the binding constraint — a finding that would drive the entire autoresearch program.
Sessions 2–3 • Early 2026 Search Pipeline Development

The agent's web search pipeline was the first component subjected to systematic ablation. Two architectural changes had measurable impact; a third revealed a latent bug.

Dual search implementation: moving from single-query to dual-search (initial broad query + targeted follow-up) produced the first large quality improvement:

MetricBeforeAfterDelta
Avg output bytes+58%
Output lines
First wiggum score (T_D)7.79.0+1.3
Wiggum rounds required1.31.0−0.3

Count-check retry analysis — across 123 runs, the enumeration retry path had a measurable quality penalty:

Conditionnmean r1
Triggered count retry31 (25%)7.53
No retry92 (75%)7.92
Root cause: the retry path was using SYNTH_INSTRUCTION_COUNT, a session-1-era instruction that had never gone through autoresearch optimization. It produced ~1,300-byte outputs vs. 5,000–7,000-byte outputs from the optimized SYNTH_INSTRUCTION. Fix: synthesize_with_count() now uses SYNTH_INSTRUCTION with the count constraint injected as a prefix. SYNTH_INSTRUCTION_COUNT is dead code, left in place inside its autoresearch sentinels. The −0.39 score penalty immediately disappeared.

Novelty scale compression — 55 runs with novelty scores revealed the search saturation signal was effectively binary:

Novelty scoreFrequency
224 runs (44%)
330 runs (55%)
41 run (2%)

Round 4 mean novelty score: 2.83 — borderline and still potentially useful. NOVELTY_THRESHOLD=3 stops on score=2, meaning some useful content in later rounds was being discarded. This finding informed the search rounds experiment (see Phase 3).

Search rounds vs. r1 — preliminary data from 138 scored runs:

Search roundsnmean r1
0 (cache hit)1147.72
2127.77
347.20
427.55
568.07
Confound note: the 5-round advantage (8.07) was observed on n=6, too small for confidence. A clean saturation-loop ablation (1 vs. 5 rounds) was confounded by the SYNTH_INSTRUCTION_COUNT bug — both 1-round and 5-round runs used the broken instruction and scored 6.9 identically. A clean rerun was scheduled after the bug fix.
2026-04-12 Full-Corpus Analysis: 211 Runs, 38 Traces

A deep analysis of runs.jsonl (211 total runs, 138 scored) and 38 Perfetto traces captured in the first month of operation. This analysis drove the entire roadmap revision for 0.3.5–0.3.7.

Time allocation — from 38 Perfetto traces (T_D/T_E runs and /annotate sessions):

StageMean % wall timeImplication
synthesize53%Dominant cost center; synthesis latency is the highest-leverage optimization target
wiggum (total)32%
   wiggum_revise~94% of wiggum timeEval is nearly free; all cost is the revision LLM call
   wiggum_eval~6% of wiggum time
gather_research15%Cheap enough that search cache ROI is bounded; synthesis is the priority

Wiggum lift/regression distribution — across 138 scored runs:

OutcomeCount%Mean magnitude
Lifted (r_final > r1)3525%+1.02
Unchanged8964%
Regressed (r_final < r1)1410%−0.38

Within enumerated tasks specifically: 17 lifts, 10 regressions, 13 unchanged out of 40 multi-round runs. The 10% regression case motivated the best-round restoration fix: before returning FAIL at max rounds, restore the best-scoring round's content to disk if a later round scored lower. Also fixed a latent bug: wiggum's termination gate was using the MAX_ROUNDS global constant instead of the max_rounds local variable, meaning the env override was scoping the loop but not the early-exit check.

Dimension weakness profile — 104 scored runs, five dimensions:

DimensionMean scoreWeightPriority
relevance8.880.20
structure8.540.10
completeness7.190.25Secondary
depth6.870.30Primary — highest weight × lowest (non-specificity) score
specificity6.610.15Weakest but low weight; lower leverage than depth
Implementation decision: depth (weight 0.30) was identified as the highest-leverage autoresearch target. Specificity is weakest by score but lower-weight. A +1 on depth is worth +0.30 composite; a +1 on specificity is worth +0.15. This finding primed autoresearch session 4's PROPOSE_PROMPT explicitly for depth.

Task type performance & pass rates:

Task typenmean r1Pass rate
unknown (Apr 7, glm4:9b evaluator)178.76100%
best_practices427.8164%
enumerated667.5652%
research (/annotate)136.910%
Evaluator calibration note: The 100% pass rate on Apr 7 "unknown" runs is not a model quality signal — those runs used glm4:9b (lenient evaluator) and a 7B producer. The apparent drop to 64% best_practices pass rate reflects switching to Qwen3-Coder:30b evaluator, not performance degradation. Cross-evaluator score comparisons require a calibration run to establish conversion factors; this was identified here and scheduled as EXP-A.

MagenticOne architecture review (2026-04-12) — mapped against harness equivalents:

MagenticOneHarness equivalentGap
Task Ledgerplanner.py Plan + knowledge_stateFlat string vs. verified-facts / open-gaps structure
Stall → replanNOVELTY_THRESHOLD + WIGGUM_MAX_ROUNDSPer-stage detection, not pipeline-level
CLOSED_BOOK init promptMissingGo straight to search without auditing prior knowledge
Highest-value borrow identified: closed-book prior knowledge pass — before gather_research(), ask the producer what it already knows and what gaps exist. Gap list seeds plan_query() so searches target actual unknowns rather than re-surfacing what the model already knows. Roadmapped as EXP-E.
Sessions 3–4 • Early–Mid 2026 Autoresearch Sessions: Proposer Attractors & Hang Modes

The autoresearch loop ran across multiple sessions targeting SYNTH_INSTRUCTION improvement. Each session revealed a distinct failure mode in the proposer's behavior.

Session 3 results:

  • Best result: exp 7 — baseline 8.915 using "when NOT to use" framing + input boundary specification
  • Exp 9: confidence ratings approach — result: −0.950 regression. The proposer had found a new attractor (confidence ratings) that moved in the wrong direction.
Proposer attractor pattern (first observation): the proposer repeatedly returned to the same class of instruction change (confidence ratings, format constraints, exact N practices) across experiments. Once an approach appeared in "Unexplored Angles," it would dominate successive proposals regardless of whether it had been tried and failed. Fix: move failed approaches from "Unexplored Angles" to a HARD-BANNED list. This pattern recurred in subsequent sessions and required increasingly aggressive hard-banning.

Session 4 failure modes: two hang modes discovered running autoresearch concurrently with fine-tuning:

  • Keep_alive hang: run_eval() was passing OLLAMA_KEEP_ALIVE=-1 to the eval subprocess. Producer and evaluator models stayed loaded indefinitely, blocking the proposer from loading on the next iteration. Fix: set OLLAMA_KEEP_ALIVE=120 in autoresearch.py's eval env — models release ~2 minutes after eval completes.
  • GPU contention: fine-tuning v2 (Qwen2.5-7B QLoRA on 63.8GB) consumed all available VRAM. Ollama couldn't load the 32B producer for eval at all. Not a code bug — a resource scheduling issue. Resolution: pause autoresearch until training completes.

Current session (T_B, delta 0.1, eval-n 3→5):

  • Baseline: 8.740 on T_B task
  • Exps 40–44: all discarded — proposer in format-churn attractor (word counts, narrative, exact N practices)
  • Infinite failure loop at exp 46: propose_instructions returned None (proposer hallucinating multi-line instructions caught by newline guard); continue returned to loop top without incrementing experiment counter. Fix: circuit breaker — consecutive_proposal_failures counter with MAX_PROPOSAL_FAILURES=5; logs as "failure" and advances experiment counter after threshold.
  • Exps 48–54: quantification attractor — proposer tried numerical metrics/thresholds 6 consecutive times, never moving depth dimension. Hard-banned. Proposer prompt rewritten with live dimension diagnostic data and depth-targeting HOW/implementation-steps directive.
  • Bimodal score distribution: scores cluster at 7.96 and 8.46 due to grounded dimension fluctuating between 6 and 8; depth=7 is the ceiling regardless.
2026-04-13 • Session 5 Annotation Pipeline Overhaul & Fine-Tuning Initiation
Root cause of all annotation hangs: a single bug in skills.py — the else branch of _clean_pdf_text extended cleaned with short-run lines but never advanced i past the current non-short line. Any line with >2 characters caused an infinite loop. Every model tried (Qwen3-Coder, kimi-k2.5, pi-qwen-32b) appeared to hang — they were all waiting on stuck preprocessing. This bug predated all model testing in this session and masked every subsequent experiment. Lesson: before suspecting model behavior, audit preprocessing.

Format change: sentence-labeling → generative. The prior implementation labeled existing abstract sentences with topic/motivation brackets. This broke on abstracts that don't contain explicit topic sentences (e.g., Mistral 7B's abstract). Switched to fully generative format: model synthesizes 1-2 prose sentences per section from full paper content, matching the Nanda Annotated Abstract framework (8 bold section headers).

Wiggum annotation rubric: PASS_THRESHOLD raised from 8.0 to 9.0 for annotation path. Key finding from pre-fix evaluation: evaluator scored a known-bad annotation 9.3/10 and passed it. Root cause: annotation had correct structure but section content was direct quote from abstract, not synthesis. New rubric explicitly distinguishes "recitation" (scores 1–4) from "synthesis" (scores 7–10).

Fine-tuning v1 — Qwen2.5-7B-Instruct, QLoRA:

ParameterValue
ModelQwen/Qwen2.5-7B-Instruct
LoRA rank / alphar=16 / alpha=32
Target layersq/k/v/o/gate/up/down projectors
QuantizationNone — full bf16 (63.8GB VRAM)
Training examples121 gold (arxiv-fetched abstracts), 90/10 split
HardwareNVIDIA RTX 5000 Ada Generation, 63.8GB VRAM
Infrastructure fixes requiredbitsandbytes silent CPU offload on Windows; device_map="cuda:0" failure; max_seq_length API change (trl ≥ 0.13); warmup_ratio deprecated in trl v5.2; CP1252 encoding (-X utf8 required)

Two arxiv paper corpora identified and annotated in parallel: arxiv_agentic_papers.md (300 papers) and arxiv_agentic_harness_engineering_papers.md (300 papers), producing expanded training data for fine-tuning round 2.

2026-04-13/14 • Sessions 6–11 Skill Expansion, Fine-tuning v2, & Corpus Infrastructure

Fine-tuning v2 — expanded dataset from corpus annotation:

SourceExamples
Gold (human-curated)121
arxiv_agentic_papers.md (agent-annotated)251
arxiv_agentic_harness_engineering_papers.md (agent-annotated)299
Skipped / failed50
Total in finetune_dataset_v2.jsonl718

Training: 1,938 steps, 3 epochs, 646 train / 72 eval examples. Killed at step 1,237 (63.8% complete, epoch 1.91) by a Windows OS update reboot. Loss at interruption: ~0.56; mean token accuracy: ~0.84. Root cause of total loss: save_strategy="epoch" only writes checkpoints at epoch boundaries — the run was 55 steps short of completing epoch 2. Fix: save_strategy="steps" with save_steps=100 (~15 min interval), save_total_limit=3. Added --resume flag for auto-detection of latest checkpoint.

Corpus infrastructure:

  • index_papers.py — bulk-loaded annotated corpus into ChromaDB: 739 papers indexed, 861 total observations in memory store. Each paper stored with task_type="paper" and fixed timestamp so papers sort before run observations in retrieval.
  • failure_patterns.py — aggregated wiggum issues across runs.jsonl: 645 issues extracted, 107 recurring clusters. Top findings: missing implementation notes (56×), unclosed code fences (31×), shallow cross-paper synthesis (24×), missing quantitative evidence (19×), overly broad conclusions (14×). These directly drove the next two SYNTH_INSTRUCTION updates.
  • semantic_scholar.py — Semantic Scholar Graph API integration for citation-graph enrichment: hub scores (in-corpus citation count), gap candidates (uncovered papers cited by corpus), within-corpus adjacency. SQLite cache with 30-day TTL.

Skills built (Session 6–11):

SkillSessionKey design finding
/github6Commit message with llama3.2:3b: ~8s, accurate. With pi-qwen-32b: 620s, generic. Model matters for time-constrained subtasks.
/review74 anti-pattern rubric. llama3.2:3b + phi4:14b both hallucinated warnings on clean diffs. Qwen3-Coder:30b returned 0 warnings correctly. Wider rubrics amplify hallucination.
curator.py85-persona filter (Pragmatic Engineer, Academic Rigorist, Synthesis Thinker, Contrarian, Newcomer). Passes if mean ≥ 3.5 AND no score < 2. Contrarian persona is the critical differentiator for catching overclaiming.
DPO dataset93 cross-run pairs initially. Revision path requires content-per-round in wiggum_eval_log (added 2026-04-14, zero-cost observability change). Growth path: 10–20 autoresearch sessions on same task set.
/lit-review107-step pipeline: fetch → S2 enrich → curate → annotate+wiggum → cluster → synthesize → render. Checkpointed per paper. Each run generates DPO pairs, curated CSV, memory observations.
/recall11Semantic memory search over 862 accumulated observations (papers + run data). MSYS2 path mangling fix required: Git Bash converts /skill tokens to Windows absolute paths — parse_skills() detects and strips.

Dynamic keep_alivekeep_alive=60 was arbitrary and blocked Ollama concurrency when models ran longer. Replaced with two-stage system: Stage 1 heuristic from explicit_skills (90s for short calls, 300s+ base for research), Stage 2 refined from historical p90 of runs.jsonl durations (+20% buffer). Root cause of all prior concurrency blocking: OLLAMA_NUM_PARALLEL unset (default 1). Set to 4 via setx OLLAMA_NUM_PARALLEL 4 + Ollama restart.

2026-04-15 • Session 14 Token Accounting Fixes & vLLM Integration

Token accounting bugs corrected:

  • Planner tokens not countedplanner.py called ollama.chat() but never returned the response for logging. 5–10 LLM calls per run were absent from tokens_by_stage. Fix: make_plan() return type changed to tuple[Plan, object].
  • tok/s wrong denominator — dashboard was dividing total tokens by total_ms, which includes model cold-start load time. Fix: logger.py now accumulates eval_ms (generation only) and prompt_ms (prompt-eval only). Cold-start inflation no longer deflates displayed tok/s.

vLLM integration — rationale: Ollama serializes all ollama.chat() calls per model. ThreadPoolExecutor(max_workers=4) gives process concurrency but Ollama collapses it to a serial LLM queue — 3 of 4 parallel subtasks block waiting. Solution: inference.py unified backend shim routing to Ollama or vLLM based on INFERENCE_BACKEND env var.

WSL2 setup required — vLLM does not support native Windows pip installs. Key version pinning required:

  • transformers==4.49.0 — trl 0.7.3 installs 5.5.4 which removed all_special_tokens_extended from Qwen2Tokenizer, crashing tokenizer init
  • --enable-prefix-caching for vLLM — shared context (constitution, ethos) cached across turns

End-to-end validation: test_harness_vllm.bat passed — 376.6s, in=7,013 out=1,063 tokens. Planner, search, novelty, markitdown, security, synthesis, write, and memory stages all confirmed working via vLLM backend. runs.jsonl entry logged correctly.

2026-04-24 • Session 28 Multi-Model Benchmarking: Qwen3.6-35B

Model: Qwen3.6-35B-A3B-UD-IQ3_S.gguf — 13.7GB, fits 16GB VRAM with ~2.3GB headroom. Served via llama-server (native Windows, no WSL2 required).

Quant selection rationale for 16GB (RTX 5000 Ada, 16,375 MiB):

QuantSizeVerdict
UD-IQ3_XXS13.2 GBMax headroom, lowest quality — rejected
UD-IQ3_S13.7 GBSelected — best quality that fits safely
UD-Q3_K_S15.4 GB~600 MB headroom — too tight
UD-Q3_K_M16.6 GBExceeds VRAM — rejected
llama.cpp version gotcha: build b8914 crashes with key qwen35moe.rope.dimension_sections has wrong array length; expected 4, got 3. The GGUF stores 3 mrope sections [11, 11, 10] but b8914 expects 4. Fix: rebuild from latest main. This class of issue (model format expectations out of sync with checkpoint) will recur as new quantization schemes ship.

bench_model_compare.py bug: run_task_live() was looking up run records by task_type == "T_A" (bench IDs), but agent.py writes semantic types ("research", "best_practices", "enumerated") to runs.jsonl. Lookup always returned {} → 0% pass / NaN scores across all tasks. Fix: snapshot pre-count before subprocess call, find first new record for that model after run.

Initial benchmark results (n=1 per task, thinking mode off):

TaskScorePassKB outputDuration
T_A — context engineering (top 5)7.8FAIL12.8263s
T_B — cost management7.8FAIL6.4209s
T_C — agent failure modes7.8FAIL5.8230s
T_D — context window management7.8FAIL10.3243s
T_E — prompt injection defense7.8FAIL6.2193s
T_F — introspect (no web search)n/aPASS5.225s
T_G — file-based synthesis7.5FAIL4.7240s
Uniform 7.8 scores are an evaluator calibration artifact, not a model quality signal. All research tasks returning identical scores independent of content length (6.2–12.8 KB) and task type (best_practices, enumerated) is statistically implausible. A calibration run (same outputs, both evaluators) is required before interpreting cross-model comparisons. This is the same evaluator-calibration problem identified in the April 12 analysis — it compounds every time a new evaluator or producer is introduced. Full head-to-head (qwen3.6-35b vs. qwen3-14b, thinking on) pending.

What Remains

The evolved roadmap has three remaining open items that are structurally different from everything checked off. One item — Nanda annotator integration — closed since this post was drafted. Fine-tune v2 training completed, the checkpoint was converted to GGUF (Q4_K_M quant, 4.7 GB vs. v1's 8.1 GB full-precision), registered with Ollama as nanda-annotator-v2-q4km:latest, and compared against v1 head-to-head across 10 out-of-sample research synthesis tasks (Exp D). v1 mean Wiggum score: 9.870 (n=46 papers); v2-q4km: 9.800 (n=43). The 0.07-point gap was not large enough to declare a winner on quality, but the v2-q4km checkpoint is 42% smaller — a meaningful operational improvement for a system where VRAM is the binding constraint. Both models are in Ollama; v2-q4km is the default annotator going forward.

Everything blocked on data accumulation or a preceding milestone is intentional. The roadmap was designed so that each stage gates the next on real signal, not on confidence. This is the principle in action: verify externally at every stage boundary.

The Pattern

The harness roadmap evolved in a predictable way: the North Star stayed fixed, the principles stayed fixed, and the tactical milestones multiplied as each completed one made the next constraint visible. The milestones that were added were not invented — they were revealed by the telemetry from the milestones that were done.

This is also a description of the system the roadmap is building. The autoresearch loop works the same way: the baseline stays fixed as the reference, the evaluator stays fixed as the arbiter, and the synthesis instruction evolves as each experiment makes the next constraint visible. The roadmap is a slower-running version of the same loop.

The invariant in both cases: the signal that tells you what to do next comes from the system you've already built, not from speculation about what the system will need. Build the instrumentation first. The features follow from the data.

← Previous The Harness Data Model Series → Architecture Series