The Harness Data Model

Five JSONL files, one entity hierarchy, and everything tokens_by_stage is actually measuring.

May 23, 2026 • Agentic Harness Engineering

K1 — The Entity Hierarchy

Every object the harness produces — from a single LLM message to a multi-session research project — lives somewhere in a five-level hierarchy:

Project ── projects.jsonl
  └── Session ── sessions.jsonl
      └── Run ── runs.jsonl (one record per completed run)
            ├── Artifact ── artifacts.jsonl (output files, traces, datasets)
            ├── Message ── messages.jsonl (every LLM prompt + response)
            ├── Plan ── plans.jsonl (planner output before search)
            └── Observation ── data/memory.db (ChromaDB + SQLite FTS5)

All IDs share one format: a UTC timestamp prefix followed by 12 hex chars from a random UUID.

# harness/schema.py
def make_id() -> str:
    ts = datetime.now(UTC).strftime("%Y%m%dT%H%M%SZ")
    return f"{ts}-{uuid.uuid4().hex[:12]}"

# 20260517T143022Z-a1b2c3d4e5f6
#  ├── sortable by wall time ─┘
#  └──────────── collision-safe uuid suffix

The timestamp prefix makes every ID sortable without a secondary sort column and makes cross-file joins unambiguous — you always know when a record was created just by reading its ID. The run_id field threads through every downstream record: every Message, Artifact, Plan, and Observation written during a run carries the same run_id, enabling full-fidelity reconstruction of any run from the flat JSONL files alone.

K2 — The Run Record

Every completed run appends exactly one JSON object to data/runs.jsonl. The record is the canonical audit trail for a run — the dashboard, telemetry router, autoresearch loop, and DPO curator all read from it. Here is the full top-level schema, annotated by concern:

{
  // ── Identity ───────────────────────────────────────────────
  "run_id":           "20260517T143022Z-a1b2c3d4e5f6",
  "session_id":       "20260517T140000Z-...",
  "project_id":       "20260501T000000Z-...",
  "parent_run_id":    "",          // set by orchestrator for subtasks
  "experiment_id":    "",          // set by autoresearch loop
  "treatment_level":  "",          // autoresearch synth variant label
  "task_id":          "",          // queue item ID (API path)

  // ── Task ───────────────────────────────────────────────────
  "timestamp":        "2026-05-17T14:30:22.000Z",
  "task":             "survey speculative decoding papers from 2025",
  "task_type":        "research",  // classified by Wiggum
  "producer_model":   "pi-qwen3.6",
  "evaluator_model":  "Qwen3-Coder:30b",

  // ── Timing ─────────────────────────────────────────────────
  "run_duration_s":   47.3,        // wall clock: start → finish() call
  "total_eval_ms":    31200.0,     // sum of LLM generation time across all stages
  "total_prompt_ms":  4100.0,      // sum of prefill time across all stages

  // ── Token accounting ───────────────────────────────────────
  "input_tokens":     14200,       // sum across all stages
  "output_tokens":    3800,        // sum across all stages (incl. thinking tokens)
  "total_tokens":     18000,
  "generation_tok_s": 121.8,       // output_tokens / (total_eval_ms / 1000)
  "total_thinking_chars": 4320,    // CoT chars across all stages

  "tokens_by_stage":  { ... },     // see K3

  // ── Research ───────────────────────────────────────────────
  "tool_calls":       [...],       // [{name, query, result_chars, urls}]
  "total_search_chars": 48200,
  "quality_floor_hit":  false,     // true if total_search_chars < 1800
  "memory_hits":      3,
  "memory_context_titles": [...],
  "injection_stripped": 0,
  "files_read":       [],
  "vision_images":    [],
  "plan":             { ... },     // planner output; see K5

  // ── Output ─────────────────────────────────────────────────
  "output_path":      "data/eval/survey_speculative_decoding_20260517T143022Z.md",
  "output_lines":     312,
  "output_bytes":     18420,
  "final_content":    "...",       // first 16 000 chars of the output
  "synth_forced":     false,       // true if synthesis was forced despite low search quality

  // ── Evaluation ─────────────────────────────────────────────
  "wiggum_rounds":    2,
  "wiggum_scores":    [6.8, 8.4],
  "wiggum_dims":      [{"relevance":7.0,"completeness":6.5,...}, ...],
  "wiggum_eval_log":  [...],       // full per-round eval trace
  "synth_cot":        [...],       // thinking blocks from synthesis calls
  "planner_cot":      [...],       // thinking blocks from planner call
  "trajectory":       [...],       // ordered steps: {seq, stage, thinking, tool, query}
  "final":            "PASS",

  // ── Orchestration ──────────────────────────────────────────
  "orchestrated":       false,
  "orchestration_style": null,
  "allow_parallelism":   null,
  "subtask_count":       0,
  "subtask_results":     [],

  // ── Value accounting ───────────────────────────────────────
  "tac_hours":  2.5,               // estimated human researcher-hours for same task
  "leverage":   18.3,              // (tac_s × quality_norm) / (runtime_s + cost_s)

  // ── Misc ───────────────────────────────────────────────────
  "screenshots":        [],
  "count_check_retry":  false,
  "validation_passed":  null
}
On leverage: when tac_hours is present (from the tac_estimate LLM call), leverage = (tac_s × quality/10) / (runtime_s + cost_equivalent_s) — a dimensionless ratio of human-equivalent output over machine time+cost. Without TAC, a proxy is computed from wiggum_score × output_lines / runtime_hours.

K3 — Token Accounting: tokens_by_stage

The top-level input_tokens / output_tokens tell you the run total. tokens_by_stage is the breakdown that makes those numbers actionable: it tells you which part of the pipeline is burning compute, whether the evaluator dominates synthesis, and — after the accounting fixes in this commit — how fast each model is actually generating tokens.

Each key in tokens_by_stage is a pipeline stage name. The stages produced by a typical research run:

planner
Search query + gap generation
search_query
Per-query reformulation calls
compress_knowledge
Long-context summarisation
synth
Primary synthesis call
synth_count
Output length enforcement
tool_loop
Browser / file tool calls
wiggum_eval
Quality scoring (evaluator model)
wiggum_revise
Revision after eval feedback
summarize_eval
Context compression for evaluator
summarize_revise
Surgical compression for revision
memory_compress
Observation compression for store
tac_estimate
Human researcher time estimate
memory
Context-block injections
annotate
Lit-review paper annotation
cluster
Lit-review theme clustering
synthesize
Lit-review cross-cluster synthesis

Each stage entry has this schema:

"synth": {
  "input":          9400,     // tokens fed in (prompt tokens for this call)
  "output":         1820,     // tokens generated
  "calls":          1,        // number of LLM calls to this stage
  "total_ms":       14800.0,  // wall time: request dispatch → response received
  "eval_ms":        13200.0,  // generation time (first token → stream end)
  "prompt_ms":      1600.0,   // prefill time (request → first token, TTFT)
  "thinking_chars": 2840,     // chars of CoT in the response, if any
  "tok_s":          137.9     // output / (eval_ms / 1000) — generation speed
}

The timing split matters. total_ms = prompt_ms + eval_ms + overhead. eval_ms is the generation phase — the part that scales with output length and determines throughput. prompt_ms is prefill — the part that scales with context length and determines TTFT. A high prompt_ms with a low eval_ms means the bottleneck is context processing, not generation; that's a signal to compress or cache the prompt, not to switch to a faster sampler.

Previously invisible stages: before the accounting fix, two LLM calls contributed zero tokens to any run record. tac_estimate (the human-time estimation call) and memory_compress (the compression call that creates the stored observation) both made ollama.chat() calls whose responses were immediately discarded. For a model like glm4:9b on a 600-char excerpt prompt, memory_compress typically consumes 800–1200 input tokens per run — across 100 runs that's up to 120k tokens of invisible compute.

Run-Level Derived Metrics

The following fields are computed at trace.finish() time from the accumulated stage data and written to every run record going forward:

FieldTypeMeaning
total_tokensintinput_tokens + output_tokens
total_eval_msfloatSum of eval_ms across all stages — total generation time
total_prompt_msfloatSum of prompt_ms across all stages — total prefill time
generation_tok_sfloatoutput_tokens / (total_eval_ms / 1000) — run-level throughput
total_thinking_charsintSum of thinking_chars across all stages

The complement of total_eval_ms + total_prompt_ms is wall-clock time spent outside LLM calls: web search latency, disk I/O, embedding, ChromaDB queries, memory retrieval, and Python overhead. On a typical research run, LLM time accounts for 75–85% of run_duration_s; the rest is search latency dominated by DuckDuckGo round-trips.

K4 — The Message Log

data/messages.jsonl is the full conversation log — every prompt sent to an LLM and every response received, written as individual records in send order. It is the only file in the harness that gives you the complete reconstruction of what the model saw and said during a run.

// One record per LLM message turn
{
  "run_id":     "20260517T143022Z-a1b2c3d4e5f6",
  "session_id": "20260517T140000Z-...",
  "project_id": "20260501T000000Z-...",
  "seq":        4,                 // monotonically increasing within a run
  "role":       "user",            // system | user | assistant | context | tool
  "stage":      "synth",           // which pipeline stage produced this turn
  "content":    "You are a research synthesizer...\n\n# Task\n...",
  "cot":        null,              // thinking block (assistant turns only)
  "tool_calls": null,
  "tool_name":  null,
  "chars":      12840,
  "timestamp":  "2026-05-17T14:30:35.000Z"
}

The role field has five values. Four match the standard chat convention; the fifth is harness-specific:

RoleMeaning
systemSystem prompt — pipeline instructions, persona, output constraints
userPrompt sent to the model (task, context, feedback, revision instructions)
assistantModel response — content field holds the answer, cot holds the thinking block
toolTool output injected as a message (search results, file contents)
contextNon-LLM injection logged for accounting only — memory blocks, research summaries. The chars field is used to estimate a context token count (chars / 4) that feeds the tokens_by_stage treemap.

The cot field matters for thinking-mode models. For Qwen3 with enable_thinking=True, every assistant turn carries a separate reasoning block that precedes the final answer. Storing cot separately from content means you can study the model's internal reasoning without it contaminating the answer text in downstream processing — and you can compute thinking-to-answer ratios per stage to identify where the model is spending its reasoning budget.

# Reconstruct one run's full conversation from messages.jsonl
import pandas as pd

msgs = pd.read_json("data/messages.jsonl", lines=True)
run  = msgs[msgs.run_id == "20260517T143022Z-a1b2c3d4e5f6"].sort_values("seq")

for _, m in run.iterrows():
    print(f"[{m.seq:03d}] {m.role:10} {m.stage or '':15} {m.chars or 0:6,}c")
    if m.cot:
        print(f"          thinking: {len(m.cot):,} chars")

K5 — Plans and Artifacts

plans.jsonl

data/plans.jsonl records the planner's output before any search begins. One plan record is written per run (or per subtask in orchestrated runs). It captures the model's explicit belief state: what it already knows, what it doesn't, what it intends to search for, and how complex it judges the task.

{
  "plan_id":       "20260517T143022Z-plan-a1b2c3",
  "run_id":        "20260517T143022Z-a1b2c3d4e5f6",
  "task":          "survey speculative decoding papers from 2025",
  "plan_type":     "agent",       // agent | orchestrator
  "task_type":     "research",
  "complexity":    "medium",
  "search_queries": [
    "speculative decoding 2025 arxiv",
    "draft model verification throughput benchmark",
    "SpecTr EAGLE speculative sampling survey"
  ],
  "known_facts": [
    "Speculative decoding uses a small draft model + large verifier",
    "Token acceptance rate is the key throughput metric"
  ],
  "knowledge_gaps": [
    "How does EAGLE-2 compare to SpecTr on coding tasks?",
    "What's the VRAM overhead of keeping a draft model loaded?"
  ],
  "subtasks": [],
  "created_at": "2026-05-17T14:30:25.000Z"
}

knowledge_gaps is the signal the telemetry router extracts to seed literature reviews and autoresearch targeting. A gap is the model's own statement of what it believes is underspecified — which makes it a high-quality query seed, since the model generated it after reading the task and its own memory context, not from keyword heuristics.

artifacts.jsonl

Every file the harness writes during a run is registered in data/artifacts.jsonl:

{
  "artifact_id":  "20260517T143110Z-art-b5c6d7",
  "run_id":       "20260517T143022Z-a1b2c3d4e5f6",
  "type":         "output",   // output | trace | kg | annotation | dataset | lit_review
  "path":         "/abs/path/data/eval/survey_speculative_decoding_20260517T143022Z.md",
  "bytes":        18420,
  "lines":        312,
  "content_hash": null,
  "created_at":   "2026-05-17T14:31:10.000Z"
}

The type enum distinguishes the kind of artifact: output is the primary research document, trace is the Chrome Trace JSON for Perfetto, dataset is a generated fine-tuning file, lit_review is a lit-review output. The dashboard's Artifacts view groups by type; the DPO flywheel reads output artifacts for pair curation.

K6 — Sessions and the Project Layer

A session is a bounded execution context — one invocation of oh in REPL mode, or one API server instance. data/sessions.jsonl records two events per session: session_start (when the CLI launches) and session_end (when it exits or the API shuts down).

// session_start
{ "event": "session_start", "session_id": "...", "project_id": "...",
  "triggered_by": "cli", "started_at": "2026-05-17T14:00:00.000Z" }

// session_end
{ "event": "session_end", "session_id": "...", "project_id": "...",
  "ended_at": "2026-05-17T15:12:00.000Z", "runs": 4,
  "total_input_tokens": 58400, "total_output_tokens": 12800,
  "artifacts": 4, "duration_s": 4320.0 }

The project layer is the outer envelope. data/projects.jsonl is an append-only event log — creating or updating a project appends a new record rather than overwriting. The active project is resolved by checking HARNESS_PROJECT_ID, then the .harness-project dotfile, then the last active project in the log. Project-scoped aggregation is available via project_stats():

from harness.schema import project_stats, resolve_project_id

stats = project_stats(resolve_project_id())
# {
#   "runs": 47, "passes": 41, "pass_rate": 0.872,
#   "avg_score": 8.3,
#   "total_input_tokens": 682000, "total_output_tokens": 178400,
#   "artifacts": 47,
#   "artifact_types": {"output": 47, "trace": 47}
# }

K7 — Querying the Logs

Because every file is append-only JSONL with consistent run_id keys, the whole data model can be queried with standard tools — no harness process needs to be running.

jq

# Generation throughput for every PASS run (tok/s)
cat data/runs.jsonl \
  | jq 'select(.final=="PASS" and .generation_tok_s != null)
        | {task: .task[:60], tok_s: .generation_tok_s, model: .producer_model}' \
  | jq -s 'sort_by(-.tok_s)'

# Stages that consumed the most output tokens
cat data/runs.jsonl \
  | jq -r 'select(.final=="PASS") | .tokens_by_stage | to_entries[]
           | "\(.key)\t\(.value.output)\t\(.value.tok_s // "n/a")"' \
  | sort -t$'\t' -k2 -rn | head -20

# Knowledge gaps from the last 10 runs — lit-review seed candidates
cat data/plans.jsonl \
  | jq -s 'sort_by(.created_at) | last(.[]) | .knowledge_gaps[]'

pandas

import pandas as pd
import json

runs = pd.read_json("data/runs.jsonl", lines=True)

# Mean tok/s by producer model
runs[runs.final == "PASS"].groupby("producer_model")["generation_tok_s"].mean()

# Thinking overhead: thinking chars as share of total output chars
runs["thinking_share"] = runs["total_thinking_chars"] / (runs["output_tokens"] * 4)
runs[["task", "thinking_share", "wiggum_scores"]].query("thinking_share > 0.5")

# Prefill dominance: stages where prompt_ms > eval_ms
stage_rows = []
for _, row in runs.iterrows():
    for stage, vals in (row.tokens_by_stage or {}).items():
        if isinstance(vals, dict):
            stage_rows.append({
                "run_id": row.run_id, "stage": stage,
                "prompt_ms": vals.get("prompt_ms", 0),
                "eval_ms":   vals.get("eval_ms", 0),
                "output":    vals.get("output", 0),
            })
stages = pd.DataFrame(stage_rows)
stages[stages.prompt_ms > stages.eval_ms][["stage","prompt_ms","eval_ms"]]

DuckDB — cross-file joins

-- Install: pip install duckdb

-- Runs where the plan's knowledge gaps mention "latency"
-- joined to wiggum scores (cross-file join using run_id)
SELECT
    r.run_id,
    r.task,
    r.wiggum_scores[-1]  AS final_score,
    r.generation_tok_s,
    p.knowledge_gaps
FROM read_json_auto('data/runs.jsonl',   format='newline_delimited') r
JOIN read_json_auto('data/plans.jsonl',  format='newline_delimited') p
  ON r.run_id = p.run_id
WHERE array_to_string(p.knowledge_gaps, ' ') ILIKE '%latency%'
  AND r.final = 'PASS'
ORDER BY final_score DESC;

-- Total tokens and cost per session
SELECT
    s.session_id,
    s.triggered_by,
    s.duration_s / 60          AS session_min,
    SUM(r.total_tokens)        AS tokens,
    SUM(r.generation_tok_s)    AS avg_tok_s,
    COUNT(r.run_id)            AS runs
FROM read_json_auto('data/sessions.jsonl', format='newline_delimited') s
JOIN read_json_auto('data/runs.jsonl',     format='newline_delimited') r
  USING (session_id)
WHERE s.event = 'session_start'
GROUP BY 1, 2, 3
ORDER BY tokens DESC;
What the schema reveals about system design choices: The split between runs.jsonl (aggregated) and messages.jsonl (per-turn) mirrors the difference between a metrics database and an event log. runs.jsonl is optimised for analytics queries — pass rate, score distribution, tok/s benchmarks. messages.jsonl is optimised for replay and DPO pair construction — you can reconstruct exact prompt/response pairs, filter by stage, and export chosen/rejected pairs without touching the run record at all. The model is deliberately denormalised for read performance at analysis time, with run_id as the join key when cross-file queries are needed.
The Telemetry Router
Agentic Harness Engineering