← Back to Blog

The Pipeline in Motion: Tracing a Task Through All Eleven Subsystems

The previous post established the central claim: harness quality dominates model selection for a fixed task domain. This post is the evidence base for that claim—not statistics, but mechanism. We trace a single research task from the moment the user presses Enter to the moment a vetted markdown file lands in the output directory and a structured record lands in runs.jsonl.

The task: python agent.py "survey speculative decoding techniques in transformer inference". The run will pass the Wiggum quality gate in two evaluation rounds, taking approximately 210 seconds on a single consumer GPU. Every subsystem it touches is named and explained below.

The Eleven Subsystems

Harness Architecture — Eleven Subsystems
Data flows left-to-right and top-to-bottom. Security wraps all external inputs; Observability instruments all stages. The agent loop sits at the center, coordinating Inference, Research Tools, Skills, and Memory.

The eleven subsystems are organized in three tiers. The entry tier (Entry Points, Orchestration, API Routes) handles how work arrives in the pipeline. The execution tier (Core Agent Loop, Inference, Research Tools, Skills, Memory) is where the actual work happens. The infrastructure tier (Security, Observability, Storage) provides guarantees and instrumentation that cut across all stages.

Stage 1: Entry Point (0–1s)

cli.py receives the task string and makes the first routing decision: does the task begin with a / prefix? If so, it looks up the handler in the _SKILLS registry. Our task does not, so it routes to the standard agent loop and opens a RunTrace context manager. The run ID is stamped at entry: 20260522-a3f4b2c1—a UTC timestamp prefix plus UUID4 hex, sortable and collision-resistant.

The RunTrace is the post’s spine. Every subsequent operation appends to it: stage transitions, LLM messages with token counts, tool calls with novelty scores, Wiggum scores per round. When the run exits—successfully or not—finalize() writes the complete record to data/runs.jsonl and a Chrome Trace Events file to data/traces/.

Stage 2: Planning (1–16s)

The task string is intercepted before any search query is issued. planner.py sends it to a fast 9B model with a structured extraction prompt requesting four fields:

The plan is written to data/plans.jsonl before any subtask executes. This matters for crash recovery: if the pipeline fails during research, the plan record is already on disk and the run can be diagnosed without re-running the planning step.

The planner never receives the raw task string as a search query. This is a hard constraint, not a convention. Submitting “survey speculative decoding techniques” directly to a web search engine produces overview articles; the planner’s targeted queries produce primary sources. The 12–18 second planning latency is recovered within the first research round.

Academic grounding: NL2Plan (arXiv:2405.04215v2) demonstrates that structured decomposition of a natural-language task into formal planning artifacts—without expert input—outperforms submitting the raw task string directly to a planner across seven planning domains. The harness’s Planner-First pattern (B1) applies this same principle at the query-generation level: decompose first, search second, never conflate the two.

Stage 3: Research (16–65s)

agent.gather_research() issues the planner’s queries in rounds, applying a novelty gate after each batch. The gate asks the planner model whether the batch adds information not already represented in the accumulated knowledge state:

Round 1: 5 queries → 5 result batches
  novelty scores: [7.2, 2.1, 6.8, 1.9, 5.4]
  gate (threshold 3.0): batches 2, 4 discarded
  3 batches pass → knowledge state updated

Round 2: 2 targeted gap queries
  novelty scores: [6.1, 4.8]
  both pass → knowledge state updated

Total context assembled: ~18,400 chars
Redundant content gated: ~6,200 chars

The 6,200 characters of gated content would otherwise reach the synthesis prompt and dilute it. Without the gate, the producer model encounters the same three papers cited by six different sources and scores “comprehensive coverage” by repeating them; with it, the prompt contains only non-redundant material. All results are cached in RESEARCH_CACHE keyed by query string, so Wiggum revision rounds do not re-fetch already-retrieved content.

Stage 4: Synthesis (65–145s)

The producer model—the largest model in the pipeline, 32B parameters or larger—receives a synthesis prompt containing: the task string, the planner’s known facts and gap list, the research context (novelty-gated), and the memory context (top four observations from the dual-backend store, retrieved by semantic + keyword search on the task). Generation takes 60–90 seconds.

Two synthesis behaviors are worth noting. First, the synthesis instruction is explicitly set to prose depth: the model is directed to produce substantive analytical writing, not stub explanations. (This was a discovered invariant: a “list the key points” instruction reliably produces shallow enumerations that fail the Depth and Specificity dimensions; a “write at the level of a competent technical reviewer” instruction does not.) Second, the producer runs with thinking mode disabled—chain-of-thought reasoning during synthesis consumes token budget without improving output quality and creates context overflow on long documents.

Stage 5: Wiggum Loop (145–215s)

The synthesis output enters the Wiggum Loop—covered in detail in the previous post. For this run:

Round 1 → score 6.4 (Completeness: 5.5, Groundedness: 5.0 fail)
  Revision prompt: targets Completeness + Groundedness only
  Producer revises at temperature 0.3
Round 2 → score 8.3 → PASS

The evaluator is selected deterministically from the pool by hashing the run ID. The same run, re-executed, always uses the same evaluator—reproducibility for debugging. Different runs are distributed across pool members—drift mitigation for analytics.

Stage 6: Persist (215–220s)

Passing outputs go to three destinations simultaneously:

  1. Filesystemoutputs/20260522-183042-speculative_decoding.md
  2. Memory storememory.compress_and_store() writes a compressed observation (narrative summary + fact list) to both ChromaDB (embedded with all-MiniLM-L6-v2) and SQLite FTS5 (indexed on task, title, and narrative). A prompt injection scan gates the write—web-fetched content that passed through synthesis could carry payload into future runs.
  3. Audit logRunTrace.finalize() appends one JSON record to data/runs.jsonl and writes a Chrome Trace Events file.

The Six JSONL Files

Every stage of this run has been writing structured data to append-only JSONL files. By the time the run exits, six files carry its complete lifecycle:

projects.jsonl

Project-level context: name, path, creation timestamp. Groups runs by working directory.

sessions.jsonl

Session envelope: start time, model stack, environment snapshot. Multiple runs share one session.

plans.jsonl

Pre-execution plan record: queries, known facts, knowledge gaps, subtask decomposition. Written before any subtask executes.

runs.jsonl

The primary artifact: full run record with Wiggum scores, token counts, tool calls, novelty scores, output path, PASS/FAIL.

artifacts.jsonl

Per-file artifact registry: every temp file, output, and intermediate document produced or consumed by the run.

messages.jsonl

Full message log: every LLM exchange with role, stage, content, token count, and optional chain-of-thought text.

The JSONL format is the critical design decision. No database server, no schema migrations, full portability. A 1,500-run log is typically under 20 MB and loads into pandas in under a second. The files are append-only—a failed run still writes its partial record, making failure diagnosis possible without re-running.

Stage Timeline

Run Stage Timeline — Typical Two-Round Pass (~210s)
GPU-bound stages (synthesis, evaluation, revision) dominate wall-clock time. Planning latency is recovered in the first research round. VRAM keep-alive eliminates the cold-start gap between stages.

The Keep-Alive Budget

One detail invisible in the timeline but critical to its shape: VRAM residency management. Without it, every stage transition that changes the active model would incur a 30–60 second cold-start load. The pipeline avoids this by pre-allocating model residency before the first inference call and managing it through the run.

The planner model (9B, fast) loads before the planning stage and releases when planning completes—its VRAM is needed for the producer’s context window during synthesis. The producer (32B) loads before synthesis and stays warm through revision. The evaluator loads before the first Wiggum round. On constrained hardware (8 GB VRAM), the three models load sequentially rather than simultaneously; on standard hardware (24 GB), all three stay warm throughout the run.

The Chrome Trace file for this run, loaded in ui.perfetto.dev, shows two evaluation blocks of roughly equal width—the signature of a two-round pass. A three-round run shows a visually wider third block: the producer is generating longer revisions as it works against dimensions that are harder to fix than the evaluator’s feedback implies.

What the Subsystem Map Reveals

Looking at the full architecture diagram, one pattern stands out: Security and Observability are the only subsystems that touch every other subsystem. Security checks fire on every external input (web content through the injection scanner, file paths through the path sandbox, agent-generated code through the AST guard, browser URLs through the CDP guard). Observability instruments every stage transition, every LLM call, every tool invocation.

This is not incidental. Security and observability are the two subsystems whose failure modes are most expensive: a security miss has cascading effects across future runs (an injected memory observation poisons everything that retrieves it), and an observability miss means the failure that does occur is undiagnosable. Both subsystems justify their cross-cutting implementation.

The next post catalogs what goes wrong when any of the other nine subsystems fails.

Up next in series

A Failure Taxonomy for Agentic Systems

Six failure classes derived from 1,500 logged runs, with frequency data, representative records, and the patterns that reduced each class’s rate.

Read more →

What the Literature Leaves Open

Several questions raised by this body of research remain unresolved — and bear directly on how the harness pipeline should be instrumented and refined:

← Previous 1 · The Harness Thesis Next → 3 · Failure Taxonomy