May 30, 2026 • 6 min read • Agentic Harness Engineering

The Explorer View: Per-Run Pipeline DAG Inspector

The Explorer renders every completed run as a clickable directed acyclic graph — Task, Memory, Plan, Search, Synthesis, Eval, Output. Clicking any node opens a detail inspector with token counts, Wiggum dimension scores, evaluator reasoning, and an inline RLHF feedback panel.

The Runs view shows a table of completed runs. The Explorer shows what happened inside one. Each run record in runs.jsonl carries enough structured data — planned queries, tool calls, per-stage token counts, Wiggum eval rounds with dimension scores and issues, chain-of-thought fragments — to reconstruct the pipeline execution as a graph. The Explorer reads that data and builds the DAG client-side, with no additional API calls beyond the initial runs load.

Layout: columns as pipeline stages

The DAG is column-based rather than force-directed. Each column corresponds to a pipeline stage, and nodes within a column stack vertically. This keeps the visual reading order left-to-right, matching the temporal order of execution, and avoids the layout instability of spring-based graphs when node counts vary between runs.

The column sequence is always: Task → Memory → Plan (if present) → Search × N → Synthesis → Eval × rounds → Output. Search and Eval columns expand horizontally — a run with five tool calls gets five search nodes in a single column, all connected to the next stage. The SVG viewport is computed from the actual column count and maximum column height, so short runs and deep runs both fill the canvas cleanly.

Node types

Task

The entry point. Shows what was asked, which model ran it, total wall-clock time, and whether it was part of an orchestrated multi-task sequence.

Memory

What the pipeline already knew before searching. Lists the injected ChromaDB observations by title — a non-empty Memory node means this run was faster and cheaper than a cold start.

Plan

The two planner queries that seeded search, plus any notes injected into synthesis. A "gaps: []" entry here means the memory store covered the task fully and no searches ran.

One node per web search round. Shows the exact query, how many characters of results it returned, and how many tokens it consumed — useful for spotting wide or redundant search rounds.

Synthesis

The production pass. Token counts, wall-clock time, output file size, and an inline preview of the generated report — expandable on click without leaving the Explorer.

Eval

One node per Wiggum revision round. Six colored dimension bars (relevance, completeness, depth, specificity, structure, grounded) make score breakdown instantly readable. Issues and evaluator reasoning are in the inspector panel.

Output

The terminal node. PASS/FAIL/ERROR badge, the output file path, final composite score, leverage ratio, and a second inline preview — so you can compare synthesis input and final output side by side.

The node inspector

Clicking a node opens a 340px inspector panel on the right side of the canvas. The inspector shows all structured data for that stage, organized into labeled rows. The Eval node inspector is the most information-dense: dimension scores render as colored progress bars (each dimension gets a fixed color — relevance blue, completeness green, depth purple, specificity cyan, structure orange, grounded amber), the issues list scrolls independently, and the evaluator's chain-of-thought reasoning is shown in a bordered block when present.

The Synthesis node includes a collapsible output preview that fetches the markdown file content from GET /api/runs/{run_id}/content on first expand. This avoids loading large output files until the user requests them — runs can produce multi-thousand-word documents that would slow the initial page load if eagerly fetched.

RLHF feedback at every stage

Every node inspector includes an RLHF feedback panel below the stage data. Thumbs-up or thumbs-down ratings are submitted to POST /api/feedback with a node_id that identifies which pipeline stage the rating applies to — not just the run as a whole. This makes it possible to distinguish "the search was good but the synthesis was weak" from "the plan had wrong queries" as separate training signals. Ratings are persisted to data/feedback.jsonl alongside run records.

When the inspector opens for a node that already has feedback, the existing rating and comment are loaded from GET /api/feedback/{run_id} and pre-populated. The panel shows a checkmark when the rating has been saved.

Building the DAG from run data

The DAG is constructed entirely from the fields already present in a RunRecord. No separate trace format or instrumentation is required — the existing structured fields map directly to node types:

tool_calls[] → one Search node per non-execution tool call (Python execution tools are filtered out)
plan.search_queries → fallback Search nodes when tool_calls is empty (plan-only runs)
wiggum_eval_log[] → one Eval node per round with full dimension data (preferred over the scalar wiggum_scores[] fallback)
tokens_by_stage → per-stage token counts surfaced in the relevant inspector panel

Edges connect every node in column i to every node in column i+1 — a many-to-many fan-out/fan-in pattern. This is accurate for the actual pipeline: all search results merge into a single synthesis context, and all Wiggum rounds feed into the same output decision.

The Explorer defaults to the most recent run that has tool calls or Wiggum scores — blank runs and stub records are skipped. Runs that completed before the wiggum_eval_log field was added fall back to the scalar wiggum_scores array, which still renders Eval nodes but without dimension bars or issue text.

Subtask trees: orchestrated runs

When the orchestrator decomposes a complex task into parallel subtasks, each subtask is a separate run written to runs.jsonl with a parent_run_id field pointing to the parent run. The Explorer surfaces this relationship via GET /api/runs/{run_id}/children, which returns all runs whose parent_run_id matches the requested ID.

In practice: select a run in the Explorer, and if it has children (identifiable by a non-null parent_run_id on sibling rows), the inspector panel shows a “Subtasks” row listing each child run’s task string, final status, and Wiggum score. This makes it possible to diagnose orchestrator decompositions at a glance — if a parent PASS contains a child FAIL, the failing subtask narrows the problem immediately.

The children endpoint scans the same runs.jsonl file used by all other endpoints and applies the same stub-record filter. Children are returned newest-first. The parent run itself is not included in the response — only its direct children.

The parent_run_id field is set by orchestrator.py when it dispatches subtasks. Runs produced by the main agent pipeline (non-orchestrated) have no parent_run_id and will return an empty array from the children endpoint. Autoresearch eval runs use a separate run_id scheme and are also not linked via this field.

Layout: columns as pipeline stages

Node types

The node inspector

RLHF feedback at every stage

Building the DAG from run data

Subtask trees: orchestrated runs

Related posts