The Pipeline in Motion: Tracing a Task Through All Eleven Subsystems
The previous post established the central claim: harness quality dominates model selection for a
fixed task domain. This post is the evidence base for that claim—not statistics, but
mechanism. We trace a single research task from the moment the user presses Enter to the moment
a vetted markdown file lands in the output directory and a structured record lands in
runs.jsonl.
The task: python agent.py "survey speculative decoding techniques in transformer inference".
The run will pass the Wiggum quality gate in two evaluation rounds, taking approximately 210 seconds
on a single consumer GPU. Every subsystem it touches is named and explained below.
The Eleven Subsystems
The eleven subsystems are organized in three tiers. The entry tier (Entry Points, Orchestration, API Routes) handles how work arrives in the pipeline. The execution tier (Core Agent Loop, Inference, Research Tools, Skills, Memory) is where the actual work happens. The infrastructure tier (Security, Observability, Storage) provides guarantees and instrumentation that cut across all stages.
Stage 1: Entry Point (0–1s)
cli.py receives the task string and makes the first routing decision: does the
task begin with a / prefix? If so, it looks up the handler in the
_SKILLS registry. Our task does not, so it routes to the standard agent loop and
opens a RunTrace context manager. The run ID is stamped at entry:
20260522-a3f4b2c1—a UTC timestamp prefix plus UUID4 hex, sortable and
collision-resistant.
The RunTrace is the post’s spine. Every subsequent operation appends to it:
stage transitions, LLM messages with token counts, tool calls with novelty scores, Wiggum scores
per round. When the run exits—successfully or not—finalize() writes
the complete record to data/runs.jsonl and a Chrome Trace Events file to
data/traces/.
Stage 2: Planning (1–16s)
The task string is intercepted before any search query is issued. planner.py sends
it to a fast 9B model with a structured extraction prompt requesting four fields:
- search_queries — targeted queries replacing the raw task string (e.g., “speculative decoding autoregressive inference speedup 2024”, “draft model verification parallel token generation”)
- known_facts — facts already in the memory store that don’t need re-fetching
- knowledge_gaps — what the pipeline does not yet know and must find
- subtasks — a decomposition hint for the orchestrator, if the task warrants parallel execution
The plan is written to data/plans.jsonl before any subtask executes. This matters
for crash recovery: if the pipeline fails during research, the plan record is already on disk and
the run can be diagnosed without re-running the planning step.
The planner never receives the raw task string as a search query. This is a hard constraint, not a convention. Submitting “survey speculative decoding techniques” directly to a web search engine produces overview articles; the planner’s targeted queries produce primary sources. The 12–18 second planning latency is recovered within the first research round.
Academic grounding: NL2Plan (arXiv:2405.04215v2) demonstrates that structured decomposition of a natural-language task into formal planning artifacts—without expert input—outperforms submitting the raw task string directly to a planner across seven planning domains. The harness’s Planner-First pattern (B1) applies this same principle at the query-generation level: decompose first, search second, never conflate the two.
Stage 3: Research (16–65s)
agent.gather_research() issues the planner’s queries in rounds, applying
a novelty gate after each batch. The gate asks the planner model whether the batch adds
information not already represented in the accumulated knowledge state:
Round 1: 5 queries → 5 result batches
novelty scores: [7.2, 2.1, 6.8, 1.9, 5.4]
gate (threshold 3.0): batches 2, 4 discarded
3 batches pass → knowledge state updated
Round 2: 2 targeted gap queries
novelty scores: [6.1, 4.8]
both pass → knowledge state updated
Total context assembled: ~18,400 chars
Redundant content gated: ~6,200 chars
The 6,200 characters of gated content would otherwise reach the synthesis prompt and dilute it.
Without the gate, the producer model encounters the same three papers cited by six different
sources and scores “comprehensive coverage” by repeating them; with it, the prompt
contains only non-redundant material. All results are cached in RESEARCH_CACHE keyed
by query string, so Wiggum revision rounds do not re-fetch already-retrieved content.
Stage 4: Synthesis (65–145s)
The producer model—the largest model in the pipeline, 32B parameters or larger—receives a synthesis prompt containing: the task string, the planner’s known facts and gap list, the research context (novelty-gated), and the memory context (top four observations from the dual-backend store, retrieved by semantic + keyword search on the task). Generation takes 60–90 seconds.
Two synthesis behaviors are worth noting. First, the synthesis instruction is explicitly set to prose depth: the model is directed to produce substantive analytical writing, not stub explanations. (This was a discovered invariant: a “list the key points” instruction reliably produces shallow enumerations that fail the Depth and Specificity dimensions; a “write at the level of a competent technical reviewer” instruction does not.) Second, the producer runs with thinking mode disabled—chain-of-thought reasoning during synthesis consumes token budget without improving output quality and creates context overflow on long documents.
Stage 5: Wiggum Loop (145–215s)
The synthesis output enters the Wiggum Loop—covered in detail in the previous post. For this run:
Round 1 → score 6.4 (Completeness: 5.5, Groundedness: 5.0 fail)
Revision prompt: targets Completeness + Groundedness only
Producer revises at temperature 0.3
Round 2 → score 8.3 → PASS
The evaluator is selected deterministically from the pool by hashing the run ID. The same run, re-executed, always uses the same evaluator—reproducibility for debugging. Different runs are distributed across pool members—drift mitigation for analytics.
Stage 6: Persist (215–220s)
Passing outputs go to three destinations simultaneously:
- Filesystem —
outputs/20260522-183042-speculative_decoding.md - Memory store —
memory.compress_and_store()writes a compressed observation (narrative summary + fact list) to both ChromaDB (embedded withall-MiniLM-L6-v2) and SQLite FTS5 (indexed on task, title, and narrative). A prompt injection scan gates the write—web-fetched content that passed through synthesis could carry payload into future runs. - Audit log —
RunTrace.finalize()appends one JSON record todata/runs.jsonland writes a Chrome Trace Events file.
The Six JSONL Files
Every stage of this run has been writing structured data to append-only JSONL files. By the time the run exits, six files carry its complete lifecycle:
projects.jsonl
Project-level context: name, path, creation timestamp. Groups runs by working directory.
sessions.jsonl
Session envelope: start time, model stack, environment snapshot. Multiple runs share one session.
plans.jsonl
Pre-execution plan record: queries, known facts, knowledge gaps, subtask decomposition. Written before any subtask executes.
runs.jsonl
The primary artifact: full run record with Wiggum scores, token counts, tool calls, novelty scores, output path, PASS/FAIL.
artifacts.jsonl
Per-file artifact registry: every temp file, output, and intermediate document produced or consumed by the run.
messages.jsonl
Full message log: every LLM exchange with role, stage, content, token count, and optional chain-of-thought text.
The JSONL format is the critical design decision. No database server, no schema migrations, full portability. A 1,500-run log is typically under 20 MB and loads into pandas in under a second. The files are append-only—a failed run still writes its partial record, making failure diagnosis possible without re-running.
Stage Timeline
The Keep-Alive Budget
One detail invisible in the timeline but critical to its shape: VRAM residency management. Without it, every stage transition that changes the active model would incur a 30–60 second cold-start load. The pipeline avoids this by pre-allocating model residency before the first inference call and managing it through the run.
The planner model (9B, fast) loads before the planning stage and releases when planning completes—its VRAM is needed for the producer’s context window during synthesis. The producer (32B) loads before synthesis and stays warm through revision. The evaluator loads before the first Wiggum round. On constrained hardware (8 GB VRAM), the three models load sequentially rather than simultaneously; on standard hardware (24 GB), all three stay warm throughout the run.
The Chrome Trace file for this run, loaded in ui.perfetto.dev, shows two evaluation
blocks of roughly equal width—the signature of a two-round pass. A three-round run shows
a visually wider third block: the producer is generating longer revisions as it works against
dimensions that are harder to fix than the evaluator’s feedback implies.
What the Subsystem Map Reveals
Looking at the full architecture diagram, one pattern stands out: Security and Observability are the only subsystems that touch every other subsystem. Security checks fire on every external input (web content through the injection scanner, file paths through the path sandbox, agent-generated code through the AST guard, browser URLs through the CDP guard). Observability instruments every stage transition, every LLM call, every tool invocation.
This is not incidental. Security and observability are the two subsystems whose failure modes are most expensive: a security miss has cascading effects across future runs (an injected memory observation poisons everything that retrieves it), and an observability miss means the failure that does occur is undiagnosable. Both subsystems justify their cross-cutting implementation.
The next post catalogs what goes wrong when any of the other nine subsystems fails.
What the Literature Leaves Open
Several questions raised by this body of research remain unresolved — and bear directly on how the harness pipeline should be instrumented and refined:
- Are there runtime signals observable during Stage 2 planning — query decomposition time, sub-query count, entity overlap between sub-queries — that reliably predict whether the downstream failure will be a planning failure (F2) or a retrieval failure (F1), and can the harness reroute before committing to a search strategy?
- The Novelty Gate filters redundant single-hop retrievals effectively, but does it systematically miss multi-hop reasoning chains where each individual hop looks novel while the combined inference path was already traversed in a prior round?
- At what point does planning overhead — the 12–18 second latency tax of Stage 2 — become unjustifiable relative to retrieval quality gains, and is there a query-complexity threshold below which skipping the planner produces equivalent output at lower cost?
- Could the JSONL audit log produced by a full pipeline trace — with per-stage latency, retrieval hit rate, Novelty Gate decisions, and Wiggum scores — be used to train a lightweight planner that learns which decomposition strategies succeed on which query types?
- When the harness operates in a cache-sparse regime (early in a research session, few prior retrievals) versus a cache-dense regime (late session, large memory store), should the retrieval strategy change — and how should the pipeline detect which regime it is in?