May 20, 2026 • 7 min read • Agentic Harness Engineering

The Runs View: Pipeline Stage Visualization and Live Run Monitoring

1,442 runs, filterable by PASS / FAIL / ERROR, with a live pipeline stage map and a detail panel that breaks down every output into score, clusters, paper count, and token spend.

Every research task the harness executes is a run. The Runs view is the closest thing to a real-time control room for watching that execution: a scrollable list on the left, a pipeline stage diagram in the center, and a full detail panel that opens when you click any completed run. As of this writing the list holds 1,442 entries — lit reviews, best-practices syntheses, OSINT lookups — and the view handles all of them the same way.

Runs view showing live pipeline stage visualization and run list with PASS/ERROR badges

A live /lit-review run mid-execution. TASK and SEARCH stages are marked done; MEMORY and PLAN are active. The left sidebar shows the full run queue with timestamp and status badges.

The Run List

The left sidebar is a reverse-chronological list of every run the harness has executed. Four filter tabs — ALL PASS FAIL ERROR — let you narrow to a specific outcome. A free-text filter at the top searches by task string, which is useful when you want to find a specific lit review topic out of hundreds of entries.

Each row shows the timestamp, the full task string (truncated), elapsed time, and the outcome badge. A pulsing green dot marks the currently running job — the live run header above the pipeline diagram also repeats the full task string so you know exactly what the harness is working on at any moment.

The list shown here was captured while a batch of eight lit-review runs was executing. Several returned ERROR (inference timeout during a heavy concurrent load), while the SLM vs. LLM run completed with PASS at 1,435 seconds — just under 24 minutes for a full 80-paper survey synthesis.

The Pipeline Stage Diagram

When a run is active, the center panel renders a horizontal stage map that updates in real time as each stage transitions. The harness pipeline has six named stages:

TASKParse the incoming task string, resolve skill routes, and validate flags.
MEMORYQuery the ChromaDB observation store for relevant prior context. Injects summaries into the planning prompt.
PLANGenerate a structured research plan: search queries, source priorities, depth targets.
SEARCHExecute queries against arXiv and Semantic Scholar. Fetch, deduplicate, and rank candidate papers.
SYNTHESISAnnotate, cluster, and synthesize fetched papers into the output document using SYNTH_INSTRUCTION.
EVALScore the output on a 0–10 scale using the Wiggum evaluator. Pass threshold is 8.0.

Each card shows its status inline: done (checkmark, green border), active (pulsing, cyan border), or queued (dim, no border). If a stage errors out, the card turns red and the run moves to ERROR status — but the cards that completed successfully stay green, so you can diagnose exactly where execution broke.

The Detail Panel

Clicking any completed run in the list swaps the center panel to the run detail view. The OVERVIEW tab shows the core output metrics, while the COMPUTE tab breaks down resource usage by stage.

Run detail panel for a PASS run showing score 8.00, 275 papers, 3 clusters, token counts, and paper cards

Detail panel for the SLM vs. LLM lit review: score 8.00, 275 papers fetched, 3 clusters, 111,627 input tokens, 12,573 output tokens, 14.1 KB output file.

OVERVIEW Metrics

Score

Wiggum evaluator score on a 0–10 scale. 8.0 is the PASS threshold. Displayed prominently as "Global: X.XX" with color coding — green for pass, red for fail.

Papers

Total papers fetched by the SEARCH stage. For --max-fetch 80 runs the ceiling is 80, but deduplication and Semantic Scholar enrichment can bring this higher or lower.

Clusters

Number of thematic clusters the synthesis stage identified. For the SLM vs. LLM run: 3 clusters out of 275 papers, labeled "Efficiency and Sustai…" and "Attention Mechanisms…"

Token Counts

Input and output tokens for the full run. The 275-paper lit review consumed 111,627 input tokens and produced 12,573 output tokens — roughly a 9:1 compression ratio.

Duration

Wall-clock time from task submission to output file written. The SLM run took 3,408.7 seconds — dominated by the 80-paper fetch and multi-pass synthesis.

Output Size

File size of the final markdown report. 14.1 KB for a 275-paper synthesis is compact — the harness prioritizes information density over length.

Paper Cards

Below the metrics grid, the detail panel renders a card for each annotated paper: title, abstract excerpt, arXiv ID, and the cluster it was assigned to. Scrolling through them gives you a ground-truth view of what the synthesis was actually working from — useful for catching irrelevant papers that inflated the fetch count or checking whether a key paper was included.

The COMPUTE Tab (Lit Review Runs)

The second tab on a lit-review run detail panel is labeled Compute. It replaces the pipeline DAG with a per-stage token and timing table, showing exactly how the inference budget was distributed across the run's LLM calls.

COMPUTE tab showing per-stage token breakdown: Curate 275 calls 110k tokens 55m, Annotate 736 tokens 56.9s, Cluster 356 tokens 4.6s, Synthesize 841 tokens 12.6s

COMPUTE tab for the SLM vs. LLM lit review: 280 total LLM calls consuming 56.3M token-milliseconds. A purple bar spans the full width, proportional to wall time, showing that Curate (275 individual paper reads) dominated.

The table columns are:

STAGENamed pipeline stage — Curate, Annotate, Cluster, Synthesize, or Eval.
CALLSNumber of LLM inference calls made by this stage. Curate has one call per paper (275 for this run).
INPUT TOKTotal input tokens consumed across all calls in this stage. Curate: 110,430 — one abstract per call.
OUTPUT TOKTotal output tokens generated. Synthesis produced only 436 tokens from 841 input — tight compression.
AVG IN/CALLAverage input tokens per LLM call. Curate averages 402 tokens/call (one paper abstract). Cluster: 356 for the full cluster prompt.
TIMEWall-clock time for the stage. Curate took 55.0m; Synthesize took only 12.6s despite being the most "intelligent" stage.

The Curate stage dominates both token spend and wall time on any large fetch: 275 papers × ~402 tokens each = 110,430 input tokens and 55 minutes. The synthesis stage that actually writes the output report consumed less than 1% of that time. This ratio is typical — the harness spends most of its budget reading, not writing.

The Context Window Tab (Best Practices & Research Runs)

For non-lit-review run types — best_practices, research, OSINT — the second tab is labeled Context Window instead of Compute. Rather than a per-stage call table, it renders a treemap visualization of how the model's context window was filled.

Context Window tab for a best_practices PASS run showing treemap of SYNTH COUNT 21.7%, SYNTHESIS 21.6%, COMPRESS 9.3%, SEARCH 5.9%, unused 41.5% — fill 19,180 / 32,768 tokens at 58.5%

Context Window tab for a best_practices PASS run on prompt injection defense. Model: pi-qwen-32b, inferred limit 32,768 tokens. Fill: 19,180 / 32,768 • 58.5%. The large unused region (41.5%) shows this run had headroom — no context pressure.

The Fill Gauge

At the top of the tab, a single horizontal gauge reads X / limit TOK • Y%. The limit is inferred from the model name — pi-qwen-32b maps to 32,768 tokens; larger models map to 128k or higher. The gauge color shifts from cyan (under 70%) to amber (70–90%) to red (over 90%), giving an immediate visual signal of context pressure without reading any numbers.

The Treemap

Below the gauge, a horizontal treemap divides the full context limit into colored blocks — one per pipeline stage — sized proportionally to that stage's input token count. Stages that inject pre-built context (memory summaries, compressed prior outputs) are rendered with a dashed border rather than a solid one, labeled "context injection · estimated tokens" in the detail panel.

Context Window treemap for a best_practices run showing SYNTHESIS, RESEARCH CTX, EVAL, PLANNER, MEMORY CTX, memory_compress, tac_estimate, Sys prompt segments with fill 9,221/32,768 at 28.1%

A cost-envelope best_practices run with lighter context fill (28.1%). More stages are visible here: RESEARCH CTX and MEMORY CTX appear as dashed-border context injection blocks, alongside the active inference stages (SYNTHESIS, EVAL, PLANNER).

The legend at the bottom lists every visible stage with its percentage. Stages that are too narrow to label in the treemap still appear in the legend, so no token spend is invisible.

Clicking a Stage: The Detail Panel

Clicking any block in the treemap selects it and opens a detail panel below. For active inference stages (solid border), the panel shows:

Input / Output Tokens

Exact token counts for this stage's LLM calls. The SYNTHESIS stage for the prompt injection run: 7,077 input, 902 output — an 8:1 compression.

LLM Calls

How many separate inference calls this stage made. Best-practices synthesis typically uses 1 call; multi-pass research tasks may use 3–6.

% of Window

This stage's share of the model's context limit. SYNTHESIS at 21.60% means nearly a quarter of the 32k context was consumed by the single synthesis call.

Total / Prompt / Eval Time

Wall time split into prompt processing (filling the context) and eval time (generating the output). Useful for diagnosing whether slowness is in reading or writing.

Context Window with SYNTHESIS segment selected showing detail panel: 7,077 input tokens, 902 output, 1 LLM call, 21.60% of window, 327.68s total time

SYNTHESIS selected. 7,077 input tokens, 902 output, 1 call, 21.60% of the 32k context window, 327.68s total time. The VIEW INPUT / OUTPUT button expands the raw prompt and response when the run's message log is available.

Context injection stages (dashed border) show only "Est. tokens" and "% of window" — no call count or timing, because they don't make LLM calls. They represent memory summaries, compressed context, or system prompt text that was injected into the prompt but not generated by a model call in this run.

Why Context Window Fill Matters

A run that fills 95% of the context window is a run that's close to truncating its own prompt. When a model hits its context limit mid-synthesis, it silently drops the oldest tokens — usually the task instructions or retrieved context, not the most recent text. The result is a synthesis that appears to complete but is working from an incomplete prompt. The fill gauge makes this failure mode visible before you read the output.

Conversely, a run at 28% fill (like the cost-envelope example) has clear headroom — you could add more memory context, longer search results, or a more detailed system prompt without risking truncation. The treemap turns that headroom into a design decision rather than a guess.

Diagnosing Errors

The captured session shows several ERROR runs from the intelligence-per-watt lit review batch. When a run errors, the pipeline stage diagram freezes at the failed stage rather than clearing. Clicking an ERROR run opens its detail panel showing the stage that failed and any error message the harness surfaced. The most common cause during batch runs is inference timeout: the Ollama server is shared across all concurrent jobs, and when eight heavy synthesis tasks compete for GPU time, later stages hit their timeout windows.

The --resume flag on the batch runner script skips runs whose output files already exist, so recovering from an error batch is a single re-run: completed outputs are preserved and only the failed tasks re-execute.

What the Runs View Tells You

The runs list is the answer to "what did the harness actually do today." The pipeline diagram is the answer to "why is this run taking so long." The OVERVIEW tab answers "was this output worth the compute." The COMPUTE tab answers "which stage consumed the most budget." The Context Window tab answers "how close was this run to running out of context." Together they close the full feedback loop — execution, quality, efficiency, and headroom — without leaving the dashboard or grepping through log files.

The next view in the series — Analytics — aggregates all 1,442 runs into score trends, daily token spend, and a distribution histogram that makes the 8.0 pass-threshold behavior immediately visible.