May 23, 2026 • 15 min read • Agentic Harness Engineering Series

The Telemetry Router

Every harness run appends one JSON record to data/runs.jsonl containing search queries the agent invented, knowledge gaps the planner declared, dimension-level evaluator feedback, and planner chain-of-thought. A single skill that reads those signals can simultaneously feed the autoresearch optimizer, the lit-review pipeline, and the data flywheel — with no additional inference cost.

Section G of this series described three self-improvement patterns — the Data Flywheel, the RL Rollout, and the Literature Review Pipeline — as if they were independent loops. In practice they share one input: the run record. Every harness execution writes the same append-only JSONL record regardless of which improvement loop might eventually consume it. The problem is that nothing currently reads runs.jsonl with the purpose of routing its signals to the right downstream system. That gap is exactly what the telemetry router skill fills.

This post describes what runs.jsonl actually contains, which signals within it are most useful for which downstream system, and what the routing skill looks like in practice.

J1 — What runs.jsonl Actually Contains

The harness logger writes a structured record for every run. The fields relevant to the routing skill fall into four categories:

Search behavior. The tool_calls array records every tool invocation with its name and query. Web search calls carry the query string the agent invented — not a user-supplied query, but one the agent generated to fill a gap it identified during synthesis. The plan object carries a parallel search_queries list: the queries the planner declared it would need before the research stage began.

# Simplified runs.jsonl record (selected fields)
{
  "run_id":    "a3f8c1d2",
  "task":      "top 3 context window management strategies for long-document RAG",
  "task_type": "technical_count",
  "timestamp": "2026-05-22T14:31:07Z",

  "plan": {
    "search_queries": [
      "LLM context window management strategies production 2025",
      "sliding window attention vs retrieval augmented generation tradeoffs",
      "token budget enforcement agentic pipeline"
    ],
    "knowledge_gaps": [
      "exact token limits for Qwen3-Coder vs Llama 3.3 in practice",
      "whether sliding window causes coherence degradation in long chains"
    ]
  },

  "tool_calls": [
    {"name": "web_search", "query": "sliding window attention LLM coherence degradation 2025"},
    {"name": "web_search", "query": "token budget enforcement harness agentic pipeline benchmark"}
  ],

  "planner_cot": "The task asks for a count-constrained list of strategies. ...",

  "wiggum_eval_log": [{
    "round": 1, "score": 7.8,
    "dims": {"relevance": 9, "completeness": 8, "depth": 6,
             "grounded": 8, "specificity": 7, "structure": 9},
    "issues": ["depth penalized: strategies listed without parameter-level detail"],
    "feedback": "the response identifies sliding window, retrieval chunking, and..."
  }],

  "final": "FAIL"
}

Knowledge gaps. The knowledge_gaps field in the plan is the most direct signal in the entire record. It represents what the planner explicitly declared it didn't know before researching — the questions the agent went looking for answers to. These are not user-supplied topics; they are the harness's own assessment of where its knowledge is insufficient.

Evaluation feedback. The wiggum_eval_log carries per-dimension scores, a list of identified issues, and a natural-language feedback string for each evaluation round. The dimension that scores lowest most consistently across task types tells you where the synthesis instruction fails structurally, which is precisely what autoresearch should target next.

Chain-of-thought traces. planner_cot and synth_cot are the raw reasoning strings from the planner and synthesizer respectively. The planner CoT contains tacit assumptions about task decomposition — which sub-questions it thought were load-bearing, which it treated as derivable, which it skipped. These are not visible in the final output and are rarely examined. They are the richest source of content for posts that document harness reasoning rather than harness architecture.

J2 — Signal 1: Query Clusters as Lit-Review Seeds

Over hundreds of runs, the search_queries and tool_calls[*].query fields accumulate a corpus of questions the agent invented but may never have found satisfactory answers to. Runs that ended with final: "FAIL" or low depth scores are the most informative: they represent queries where the web search returned insufficient material and the agent synthesized from inadequate grounding.

Clustering these queries by semantic similarity surfaces topic areas where the agent repeatedly goes looking but comes back weak. Each cluster is a candidate lit-review topic — the agent's behavior is revealing a gap in the knowledge base that would improve future runs if filled.

def extract_query_clusters(runs: list[dict],
                           min_score_threshold: float = 8.5,
                           top_n: int = 10) -> list[dict]:
    """Return the N most-recurrent query topics from failing or weak runs."""
    weak_queries: list[str] = []
    for run in runs:
        score = run.get("wiggum_scores", [0])[-1] if run.get("wiggum_scores") else 0
        if score >= min_score_threshold:
            continue  # only mine runs that didn't fully satisfy the evaluator

        plan = run.get("plan") or {}
        weak_queries.extend(plan.get("search_queries", []))
        for tc in run.get("tool_calls", []):
            if tc.get("name") == "web_search":
                weak_queries.append(tc["query"])

    # Simple n-gram frequency as a proxy for clustering
    from collections import Counter
    import re
    tokens = []
    for q in weak_queries:
        tokens.extend(re.findall(r'\b[a-z]{4,}\b', q.lower()))
    top_terms = Counter(tokens).most_common(top_n * 3)

    # Group queries by most frequent terms
    clusters = []
    seen: set[str] = set()
    for term, count in top_terms:
        if term in ("with", "from", "that", "this", "using", "have", "been"):
            continue
        matching = [q for q in weak_queries if term in q.lower() and q not in seen]
        if len(matching) >= 2:
            seen.update(matching)
            clusters.append({
                "term": term,
                "count": count,
                "example_queries": matching[:3],
                "suggested_lit_review": f'oh /lit-review "{term} agentic systems" --after 2024-06',
            })
        if len(clusters) >= top_n:
            break
    return clusters

The knowledge_gaps field requires even less processing. The planner already did the semantic work — it declared specific propositions it lacked evidence for. A simple aggregation of knowledge_gaps across low-scoring runs produces a ranked list of propositions the harness doesn't know how to address, each of which is a direct `/lit-review` prompt.

J3 — Signal 2: Per-Task-Type Score Distributions as Autoresearch Targets

The autoresearch loop's --tasks flag is currently set manually. The operator decides which eval tasks to optimize against. But runs.jsonl records task_type and per-dimension scores for every production run — not just eval suite runs. This means the harness has a continuously-updated picture of which task types it underperforms on in the wild, not just in the eval suite.

def weakness_map(runs: list[dict],
                 min_runs: int = 5) -> list[dict]:
    """Return task types ranked by mean depth score, ascending (worst first)."""
    from collections import defaultdict
    scores_by_type: dict[str, list[float]] = defaultdict(list)

    for run in runs:
        task_type = run.get("task_type", "unknown")
        if not task_type:
            continue
        for entry in run.get("wiggum_eval_log", []):
            depth = entry.get("dims", {}).get("depth")
            if depth is not None:
                scores_by_type[task_type].append(float(depth))

    ranked = []
    for ttype, depths in scores_by_type.items():
        if len(depths) < min_runs:
            continue
        ranked.append({
            "task_type": ttype,
            "mean_depth": round(sum(depths) / len(depths), 2),
            "n_runs": len(depths),
            "worst_score": min(depths),
        })

    return sorted(ranked, key=lambda r: r["mean_depth"])

The output maps directly to autoresearch task selection. If task_type: "prose_best_practices" has the lowest mean depth score across 40 production runs, that's the task type autoresearch should target next — and the corresponding eval task (T_B or T_D, depending on task fingerprint) is the right --tasks argument. The skill closes the feedback loop between production behavior and optimization target selection that currently doesn't exist.

The dimension to sort by matters. Depth has the highest weight (0.25) in the composite score and is the dimension most sensitive to instruction quality — it's the one autoresearch has the most leverage on. Sorting the weakness map by mean depth, not composite score, gives autoresearch the most actionable target.

J2–J4 — Telemetry Router: One Read, Three Downstream Loops

A single pass over runs.jsonl extracts three signal types and routes each to the downstream system best positioned to act on it.

J4 — Signal 3: Curated Pairs for the Data Flywheel

The data flywheel described in Section G requires preference pairs: a "chosen" output (high-scoring run on a given task) and a "rejected" output (low-scoring run on the same or comparable task). Currently there is no principled selection of which runs belong in training data — all runs could theoretically be included, but training on noisy low-quality pairs degrades rather than improves the model.

The telemetry router provides the curation step. For each task type, it identifies the top-scoring runs (chosen candidates) and the lowest-scoring runs that attempted the same task type (rejected candidates), filters for pairs where the score gap exceeds a threshold, and writes them in NeMo RL's DPO manifest format:

def build_flywheel_pairs(runs: list[dict],
                         min_gap: float = 1.5,
                         chosen_threshold: float = 9.0,
                         rejected_ceiling: float = 7.5) -> list[dict]:
    """Curate (chosen, rejected) pairs from runs.jsonl for DPO training."""
    from collections import defaultdict
    by_type: dict[str, list[dict]] = defaultdict(list)

    for run in runs:
        task_type = run.get("task_type")
        score = run.get("wiggum_scores", [0])[-1] if run.get("wiggum_scores") else 0
        content = run.get("final_content", "")  # synthesized output, first 16k chars
        task = run.get("task", "")
        if not task_type or not content:
            continue
        by_type[task_type].append({"score": score, "task": task,
                                   "content": content, "run_id": run["run_id"]})

    pairs = []
    for task_type, candidates in by_type.items():
        chosen = [c for c in candidates if c["score"] >= chosen_threshold]
        rejected = [c for c in candidates if c["score"] <= rejected_ceiling]
        for ch in chosen:
            for rej in rejected:
                if ch["score"] - rej["score"] >= min_gap:
                    pairs.append({
                        "task_type": task_type,
                        "prompt": ch["task"],
                        "chosen": ch["content"],
                        "rejected": rej["content"],
                        "score_gap": round(ch["score"] - rej["score"], 2),
                        "chosen_run_id": ch["run_id"],
                        "rejected_run_id": rej["run_id"],
                    })
    return pairs

The min_gap threshold matters. DPO training on pairs with a score gap below ~1.5 provides weak signal — the model has no clear "chosen" direction to move toward. Pairs with gaps above 2.0 are the most useful; pairs where the gap is below 0.5 should be excluded entirely. The router's filtering step is what makes the flywheel trainable rather than just large.

Academic grounding: Rafailov et al. (2023, DPO) and subsequent scaling analyses (Tunstall et al., 2023; Xu et al., 2024) show that DPO preference pair quality consistently dominates quantity. A curated set of 500 high-gap pairs outperforms 5,000 random pairs from the same distribution. The harness generates on the order of 10–50 runs per day in active use; after 30–60 days of accumulation, the gap-filtered pair set from runs.jsonl should be large enough to produce a measurable improvement on held-out tasks.

J5 — The CoT Trace as a Content Source

The planner_cot field is the least-mined signal in runs.jsonl. It contains the planner's raw reasoning about how to decompose a task: which sub-questions it treated as load-bearing, which it considered derivable from other answers, which it skipped as out-of-scope. These decisions are not visible in the final output and are almost never examined.

They are, however, exactly the kind of content that makes useful blog posts: not "here is the architecture," but "here is why the planner made this particular choice in this particular situation." A planner that consistently skips certain sub-questions on prose tasks is revealing a structural assumption about what "completeness" means — one that might be worth making explicit and questioning.

The routing skill handles this by flagging runs where the planner CoT contains unusual patterns: tasks where more than N sub-questions were declared but fewer than N/2 were pursued, runs where the planner explicitly noted a knowledge gap but did not schedule a search query to fill it, or runs where the planner CoT is substantially longer than average (indicating the task triggered unusual deliberation). These flagged runs are candidates for manual review as blog post source material, not for automated downstream processing.

def flag_unusual_cot(runs: list[dict],
                     cot_length_percentile: float = 0.9) -> list[dict]:
    """Flag runs with anomalous planner CoT for manual review."""
    cot_lengths = [len(r.get("planner_cot", "")) for r in runs
                   if r.get("planner_cot")]
    if not cot_lengths:
        return []

    cot_lengths.sort()
    threshold = cot_lengths[int(len(cot_lengths) * cot_length_percentile)]

    flagged = []
    for run in runs:
        cot = run.get("planner_cot", "")
        gaps = (run.get("plan") or {}).get("knowledge_gaps", [])
        queries = [(run.get("plan") or {}).get("search_queries", [])]

        reasons = []
        if len(cot) > threshold:
            reasons.append("unusually long CoT — complex task deliberation")
        if len(gaps) > 2 and len(queries[0]) < len(gaps):
            reasons.append(f"{len(gaps)} knowledge gaps declared but only "
                           f"{len(queries[0])} searches scheduled")

        if reasons:
            flagged.append({
                "run_id": run["run_id"],
                "task":   run.get("task", ""),
                "score":  run.get("wiggum_scores", [0])[-1],
                "reasons": reasons,
                "cot_excerpt": cot[:400],
            })
    return flagged

J6 — The Skill as a Harness Slash Command

The routing skill is a natural fit for the harness skill system: it reads a local file, produces structured output, and has no side effects. Invoked as oh /telemetry-router --last 500, it reads the most recent 500 runs from data/runs.jsonl and writes three output files:

data/lit_review_seeds.md — ranked query clusters with a ready-to-run /lit-review command for each, plus aggregated knowledge gaps from low-scoring runs.
data/autoresearch_targets.md — weakness map sorted by mean depth score, with recommended --tasks arguments for the next autoresearch run and the specific evaluator issues driving each weakness.
data/flywheel_pairs.jsonl — curated DPO preference pairs, gap-filtered, in NeMo RL manifest format, ready to pass to uv run python examples/run_dpo.py.

Each output is self-contained: the lit review seeds file is readable as a planning document, the autoresearch targets file is paste-able into a terminal, and the flywheel pairs file is directly consumable by NeMo RL without transformation.

The routing skill has no inference cost — it reads a local JSONL file and applies Python logic. A full analysis of 500 runs completes in under two seconds. This makes it cheap to run frequently: after every autoresearch session, after a new batch of production runs, or on a cron schedule. The downstream systems it feeds (autoresearch, lit-review, NeMo RL) all have high compute costs — the router's value is precisely that it minimizes how often those systems need to run by giving them better-targeted inputs.

J7 — The Meta-Pattern: Operational Telemetry as a Compound Signal

The deeper principle is that any pipeline that records its own behavior generates training signal as a byproduct of operation. The harness doesn't need a separate data collection step; every production run is simultaneously a performance record (for autoresearch), a knowledge gap declaration (for lit-review), and a preference example (for the data flywheel). The bottleneck is not data collection — it's signal routing.

This is structurally similar to what RLHF practitioners call "online" vs. "offline" learning: offline methods collect data first and train later, while online methods learn from the data generated during deployment. The telemetry router makes the harness online in a weak sense — not in the sense of gradient updates during inference, but in the sense that every production run immediately generates actionable signals for all three improvement loops, and those signals are ready to consume as soon as the routing skill is invoked.

The compounding effect is what makes this valuable over time. An autoresearch run seeded by the weakness map improves the synthesis instruction. Better instructions produce higher-scoring runs. Higher-scoring runs produce better chosen/rejected pairs in the flywheel. Better flywheel pairs fine-tune the producer model. A fine-tuned producer produces outputs that search for different things — generating new query clusters for the lit-review pipeline. Each loop tightens the others.

What the Literature Leaves Open

When query clusters are used as lit-review seeds, the resulting literature may address the surface topic without addressing the specific sub-question that generated the cluster. How should the routing skill distinguish queries that signal a vocabulary gap (the agent doesn't know the right search terms) from queries that signal a knowledge gap (the right terms exist but the relevant literature hasn't been retrieved)?
DPO training on harness-curated pairs uses the Wiggum evaluator score as the ground truth for "chosen" vs. "rejected." When the Wiggum evaluator itself has blind spots (dimensions it consistently over- or under-weights), does the flywheel amplify those blind spots into the producer model, and how many flywheel cycles does it take before this becomes detectable?
The weakness map sorts task types by mean depth score across production runs. Production runs have a different distribution than eval suite runs: they cover the tasks users actually request, which may differ systematically from the eval tasks autoresearch optimizes against. Does optimizing against production weakness clusters improve eval suite scores, or does it improve production performance while leaving eval scores flat?
Planner CoT flagging identifies unusual deliberation patterns, but the unusual patterns are not labeled as good or bad — a long CoT might indicate good reasoning about a hard task or confused reasoning about an easy one. What annotation interface (human review, LLM-as-judge, or automated rubric) would most efficiently convert flagged runs into usable blog post raw material?
The telemetry router is a read-only skill with no inference cost, making it safe to run frequently. But its outputs are only useful if they are acted on: lit-review seeds that aren't run, autoresearch targets that aren't scheduled, and flywheel pairs that aren't trained on provide no value. What scheduling or alerting mechanism would ensure the router's outputs are consumed rather than accumulating as unused reports?

← Previous From Hill-Climbing to Pareto Next → 10 · Observability and the Data Flywheel