Observability and the Data Flywheel
Sections F and G close the loop: three observability patterns that make every run auditable and diagnosable, and three self-improvement patterns that extract training data from the pipeline as a byproduct of normal operation.
The prior nine posts covered the pipeline from entry point to output. This final post covers the two feedback loops that operate outside and around the pipeline: observability, which makes the pipeline's internal behavior visible and diagnosable, and self-improvement, which feeds the pipeline's outputs back into the models that generate them.
Section F contains three observability patterns. Section G contains three self-improvement patterns. Together they form the flywheel that turns a harness that runs well into a harness that runs better.
F1 — The RunTrace
A harness run that fails silently is worse than one that fails loudly. Silent failures produce PASS records in runs.jsonl while actual outputs are empty, truncated, or overwritten. The RunTrace pattern uses a Python context manager to guarantee that telemetry is recorded for every run, including runs that fail mid-execution.
with RunTrace(run_id, task, model_config) as trace:
trace.enter_stage("plan")
plan = planner.plan(task, memory_context)
trace.enter_stage("research")
for query in plan.search_queries:
results = search(query)
novelty = memory.assess_novelty(results)
trace.record_tool_call("search", query, results, novelty_score=novelty)
trace.enter_stage("synthesize")
output = agent.synthesize(plan, results, model_config)
trace.enter_stage("evaluate")
scores, feedback = wiggum.loop(output, task, model_config)
trace.finalize("PASS", output, scores) # writes to runs.jsonl + .trace.json
trace.finalize() is called inside the with block's __exit__, which runs even if an exception is raised. If the synthesize stage crashes, the trace record is finalized with status="ERROR" and the exception message, not silently omitted. Every run produces a runs.jsonl record and a Chrome Trace Event file.
The RunTrace context manager accumulates telemetry during execution and finalizes into both a JSONL record and a Chrome Trace Event file. The JSONL record feeds analytics; the trace file feeds Perfetto flame graph visualization.
F2 — The JSONL Audit Log
Every RunTrace.finalize() call appends one JSON object to data/runs.jsonl. The file is never rewritten — only appended. The record schema captures everything needed for failure diagnosis and training data extraction:
{
"run_id": "20260522T143027_a4f2b1c3",
"task": "survey speculative decoding techniques in transformer inference",
"producer_model": "qwen3:32b",
"evaluator_model": "glm4:latest",
"run_duration_s": 187.4,
"input_tokens": 12840,
"output_tokens": 3201,
"wiggum_rounds": 2,
"wiggum_scores": [6.87, 8.31],
"wiggum_dimensions": {
"relevance": [7.5, 8.0], "completeness": [7.0, 8.5],
"depth": [7.0, 8.5], "specificity": [5.5, 8.0],
"structure": [7.5, 8.0], "groundedness": [6.5, 8.5]
},
"final": "PASS",
"tool_calls": [
{"name": "search", "query": "speculative decoding methods survey",
"urls": ["https://..."], "novelty_score": 8.4},
...
],
"output_path": "outputs/20260522T143027_a4f2b1c3.md"
}
The plain JSONL format is the pattern's primary advantage. No database schema migrations. No server dependency. Full portability. A 1,000-run log is typically under 20 MB and loads into pandas in under a second:
import pandas as pd, json
runs = pd.DataFrame(json.loads(l) for l in open("data/runs.jsonl"))
# Failure rate by model
print(runs.groupby("producer_model")["final"].value_counts(normalize=True))
# Mean first-round Specificity score (weakest dimension)
runs["spec_r1"] = runs.wiggum_dimensions.apply(lambda d: d["specificity"][0])
print(runs.groupby("producer_model")["spec_r1"].mean().round(2))
F3 — The Chrome Trace Exporter
Latency is the hardest problem to diagnose from logs alone. A run that takes 240 seconds could be slow because synthesis is slow, or because the planner is cold-starting, or because the evaluator is running three revision rounds, or because search is returning slowly. All four look the same in a run_duration_s field.
The Chrome Trace Exporter emits a {"traceEvents": [...]} JSON file alongside every runs.jsonl entry. Each event has a stage name, a phase (B for begin, E for end), and a microsecond timestamp. Loading the file in ui.perfetto.dev (drag-and-drop, no installation) renders a flame graph where each stage is a colored block, and gaps between blocks are visible as white space.
A planner block that is wider than the synthesis block means the planner model is not staying warm in VRAM. A flat white gap between "synthesis complete" and "eval start" means the evaluator is loading cold. Both are diagnosable in under 30 seconds using a Perfetto flame graph and fixable with a single environment variable change.
F4 — The Dashboard API Layer
The JSONL audit log and Chrome traces provide the raw data. The dashboard API layer makes that data queryable from the UI without log parsing or file management. Three endpoint groups extend beyond the core runs/sessions views:
Memory management (/api/memories): Full CRUD over the memory store, plus a graph endpoint (/api/memories/graph) that returns the memory graph as a node-edge structure for visualization, and a prune-candidates endpoint that surfaces low-quality or redundant memories for manual review. Feedback ratings on individual memories flow back to the store's quality score, creating a lightweight human-in-the-loop quality signal alongside the automated Wiggum scores.
Security audit (/api/security/events, /api/security/summary): The data/security_events.jsonl log from the E-section patterns is exposed as a filterable API. Events can be sliced by severity ("block", "warn"), event type (injection scanner, AST guard, path sandbox, CDP guard), layer, or run ID. The summary endpoint returns aggregate counts per dimension for KPI cards, making it possible to see at a glance whether injection scanner blocks are concentrated on memory writes or synthesis writes, and whether a particular run produced an unusual number of block-severity events.
System configuration (/api/system/config, /api/system/files, /api/system/skills): The config endpoint returns non-secret environment variables, harness settings, and the three live SYNTH_INSTRUCTION values — the same values autoresearch modifies — without requiring a shell session. The files endpoint serves an explicit allowlist of governance files (AGENTS.md, ROADMAP.md, wiki knowledge files, user profile, user config), some of which are editable in-place from the dashboard. The skills endpoint returns the full skill registry, including which skills expose an auto-hook or a prompt override.
The combination of structured JSONL on disk and a queryable HTTP layer over it means the same data serves both offline analysis and live dashboard monitoring without duplication. The dashboard reads from the same files the agent writes to — no secondary database, no ETL step, no synchronization lag.
G1 — The Data Flywheel
The harness produces quality-labeled outputs as a byproduct of normal operation. Every PASS run has a Wiggum score above 8.0. Every run with multiple rounds has a scored round-1 draft and a scored final revision. This is exactly the structure that Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) training require — without any additional human annotation.
The flywheel completes when fine-tuned model output re-enters production, generating higher-quality runs that produce better training data for the next fine-tuning cycle.
def extract_sft_dataset(runs_path: str, min_score: float = 8.5) -> list[dict]:
"""Extract instruction-completion pairs for SFT training."""
pairs = []
for line in open(runs_path):
run = json.loads(line)
if run["final"] != "PASS":
continue
if run["wiggum_scores"][-1] < min_score:
continue
output = open(run["output_path"]).read()
pairs.append({"instruction": run["task"], "output": output,
"quality_score": run["wiggum_scores"][-1]})
return pairs
def extract_dpo_dataset(runs_path: str) -> list[dict]:
"""Extract preference pairs from multi-round runs."""
pairs = []
for line in open(runs_path):
run = json.loads(line)
if len(run["wiggum_scores"]) < 2:
continue
if run["wiggum_scores"][-1] - run["wiggum_scores"][0] < 1.0:
continue # not enough quality gap for a meaningful preference signal
# Round 1 draft vs final revision = natural preference pair
pairs.append({
"prompt": run["task"],
"chosen": open(run["output_path"]).read(), # final revision
"rejected": run.get("draft_path", ""), # round-1 draft
"score_delta": run["wiggum_scores"][-1] - run["wiggum_scores"][0]
})
return pairs
Academic grounding: Refined DPO (arXiv:2402.08005v1) demonstrates that synthetic preference pairs generated by automated self-critique—without human annotation—produce effective behavioral alignment across safety, robustness, and sycophancy reduction tasks. The harness’s extract_dpo_dataset() function applies the same principle: Wiggum-scored round-1 drafts and final revisions form natural preference pairs without requiring human labelers, as long as the evaluator itself remains well-calibrated.
The flywheel's quality depends directly on the calibration of the Wiggum evaluator. Evaluator drift — scores that inflate over time as the evaluator's behavior shifts — degrades label quality silently. The Evaluator Pool (A3) converts systematic drift into visible per-evaluator variance, which triggers recalibration against human-rated holdout sets.
G2 — The RL Rollout
The Data Flywheel extracts training data from existing production runs. The RL Rollout generates training data specifically for fine-tuning by running the Wiggum Loop as an online reward signal. For each task in a rollout set, the producer generates N candidates with varied sampling parameters; the evaluator scores each; the dimensional scores become a reward function for RLHF-style training:
def rollout(tasks: list[str], config: ModelConfig,
n_candidates: int = 4) -> list[dict]:
preference_pairs = []
for task in tasks:
candidates = []
for temp in [0.6, 0.7, 0.8, 0.9][:n_candidates]:
output = agent.synthesize(task, config, temperature=temp)
scores, _ = wiggum.evaluate_once(output, task, config)
candidates.append({"output": output, "score": scores["composite"],
"dimensions": scores})
candidates.sort(key=lambda c: c["score"], reverse=True)
if candidates[0]["score"] - candidates[-1]["score"] > 0.5:
preference_pairs.append({
"prompt": task,
"chosen": candidates[0]["output"],
"rejected": candidates[-1]["output"],
"reward": candidates[0]["score"] - candidates[-1]["score"]
})
return preference_pairs
The evaluator serves as the reward model, eliminating the need to train a separate one. The risk is reward hacking: if the fine-tuned model learns to satisfy the rubric rather than to genuinely improve output quality, scores inflate while true quality stagnates. Periodic holdout evaluation against human raters is the correct diagnostic. A widening gap between Wiggum scores and human ratings is the signal that the flywheel has decoupled from reality.
G3 — The Literature Review Pipeline
The final pattern is the most complex skill in the registry: a seven-stage pipeline that produces peer-review-quality literature surveys from a topic string and date range. It is implemented as a /lit-review skill (~880 lines) and is the highest-leverage entry point in the harness for academic use cases.
Seven stages from topic string to peer-review-quality survey. The multi-persona curation stage (Stage 3) is the most distinctive: five independent reviewer personas each score papers 1–5; papers with mean < 3.5 or any score < 2 are excluded.
Stage 3 — multi-persona curation — is the most distinctive feature of the pipeline. Five reviewer personas (Pragmatist, Rigorist, Synthesizer, Contrarian, Newcomer) each independently score every paper from 1 to 5. Papers with a mean score below 3.5 are excluded. Papers where any persona scores below 2 are excluded regardless of mean — a single-weak-link rule that prevents inclusion of clearly off-topic or methodologically unsound papers. Papers passing both criteria are included.
The multi-persona approach serves two purposes: it simulates the diversity of perspectives that a real committee review would bring, and it makes the curation step tractable — a single model asked to curate 50 papers with a single perspective will tend to be either too permissive or too selective, while five perspectives with a consensus rule produces calibrated inclusion rates across task domains.
The full review passes through the Wiggum Loop (C1) before final output. Runtime for a 50-paper review is typically 45–90 minutes on standard hardware.
Closing the Loop
The ten posts in this series cover the complete harness pattern catalog from substrate to flywheel. The dependency structure runs upward: inference patterns (A) enable context patterns (B), which feed verification (C), which operates within orchestration (D) and security (E), all producing telemetry consumed by observability (F) and training data extraction (G). The flywheel in G feeds back to A, completing the circuit.
The harness thesis stated at the start of this series: for a fixed task domain, the quality of the scaffolding surrounding a language model matters more than the choice of model. The twenty-seven patterns across these ten posts are the operational substance of that claim — the scaffolding, implemented, documented, and tested across 1,500 production runs.
The full pattern catalog is expanding into a book — Agentic Harness Engineering: Full-Stack Open Source Scaffolding Using Python and TypeScript — with deeper implementation notes, consequences tables, and cross-references for all twenty-seven patterns. Each pattern documented here is a chapter in that catalog.
What the Literature Leaves Open
Several questions raised by this body of research remain unresolved — and bear directly on how the harness flywheel should be designed and governed:
- How quickly does DPO fine-tuning on harness-generated preference pairs degrade on out-of-distribution tasks — and does the harness need a held-out task battery to detect when flywheel-trained weights have overfit to the autoresearch distribution?
- When the reward signal used to generate preference pairs is itself a harness-evaluated score, how many flywheel cycles does it take before reward hacking emerges — where the producer learns to maximize Wiggum scores rather than actual output quality?
- What is the minimum number of preference pairs required before a DPO fine-tuning run produces a statistically measurable improvement in harness score distributions, and does that threshold vary by model size?
- During early flywheel cycles — when the producer is weak, outputs are low-quality, and preference pairs are noisy — does training on this data accelerate or impede eventual quality convergence, and how should the harness handle the bootstrapping period?