May 31, 2026 • 6 min read • Agentic Harness Engineering

The Diagnostic Skills: /debug and /troubleshoot

Two skills for diagnosing harness failures without manually reading run logs. /debug loads the last matching ERROR or FAIL runs, reads their trace event sequences, maps task_type to relevant source anchors, and returns a structured Diagnosis / Evidence / Fix. /troubleshoot combines that diagnosis with project state and delivers a four-part report — Issue, Root cause, Fix, Next task — in a single LLM call.

Harness failures come in two categories. Code crashes (final: ERROR) mean an exception escaped the pipeline — wrong argument, missing file, model timeout. Quality failures (final: FAIL) mean the run completed but the Wiggum evaluator scored it below PASS_THRESHOLD. The fix for each is different: code crashes need a diff, quality failures need a prompt or configuration change. Both skills detect which kind of failure occurred and adjust their guidance accordingly.

/debug: targeted failure diagnosis

Invoked as /debug [filter] where filter is any of: a task_type, the literal strings ERROR or FAIL, a partial run ID, or a model name. With no filter, it defaults to the last failure of any kind.

# Diagnose the most recent failure
python agent.py "/debug"

# Diagnose only FAIL-scored research runs
python agent.py "/debug research"

# Diagnose failures from a specific model
python agent.py "/debug qwen3.6-35b"

The implementation loads runs.jsonl, filters to ERROR and FAIL records (skipping meta-runs — debug, suggest, orientation, re-orient — to avoid circular diagnosis), and selects the last two matching runs. Two runs rather than one enables pattern detection: if the same issue repeats across consecutive failures it's more likely a systematic problem than noise.

Context assembly

For each target run the skill assembles a structured context block:

After the run blocks, it appends relevant source code. A _SOURCE_MAP dict maps each task_type to the file and anchor most likely to contain the bug:

_SOURCE_MAP = {
    "research":       [("agent.py",           "SYNTH_INSTRUCTION"),
                       ("wiggum.py",           "def loop")],
    "enumerated":     [("agent.py",           "SYNTH_INSTRUCTION_COUNT"),
                       ("wiggum.py",           "def loop")],
    "annotate":       [("skills/__init__.py", "run_annotate_standalone")],
    "orientation":    [("orientation_skill.py", "def build_orientation")],
    "re-orient":      [("agent.py",           "_handle_reorient")],
    # ...
}

The skill extracts 1,800 characters starting from 100 characters before each anchor — enough to include the function signature and the most relevant logic without overwhelming the context window.

Error type branching

The synthesis prompt changes based on whether any target run has final: ERROR:

Code crash (ERROR)

  • Prompt asks for exact lines to change as a minimal diff or replacement block
  • Trace event sequence is the primary diagnostic — the failing stage and its error field identify the crash site

Quality failure (FAIL)

  • Prompt asks for specific changes to synthesis instruction, prompt, or harness config
  • If a SYNTH_INSTRUCTION change is recommended, the full replacement string is requested
  • Wiggum dimension scores and issues are the primary signal

The structured output format is enforced in the prompt:

**Diagnosis:** <one sentence root cause>

**Evidence:** <2-3 specific observations from the run data above>

**Fix:**
<concrete code change or config change, ready to apply>

/troubleshoot: diagnosis + next step in one call

/troubleshoot is a higher-order skill that loads the same failure context as /debug — last two ERROR/FAIL runs, trace events, source excerpts — and adds the full project state context used by /suggest: orientation cache, git log, and autoresearch progress. Both are assembled into a single prompt and resolved in one LLM call.

# Full troubleshoot — diagnose + recommend next step
python agent.py "/troubleshoot"

# Filter to annotation failures only
python agent.py "/troubleshoot annotate"

When no ERROR or FAIL runs are found, /troubleshoot falls back to suggest-only mode — it skips the diagnosis section and returns just a next-step recommendation. This makes it safe to run speculatively: if everything is passing, it still returns something useful.

The output format is four parts:

**Issue:** <one sentence description of the failure>

**Root cause:** <2-3 sentences of analysis>

**Fix:** <concrete change ready to apply>

**Next task after fix:** `<runnable command>`

/troubleshoot uses temperature: 0.1 — lower than the research pipeline's default. Diagnostic responses benefit from determinism: the same run logs should produce the same root cause analysis. The structured output format also helps — the model is constrained to four labeled sections rather than free-form prose.

Both skills skip meta-runs (debug, suggest, orientation, re-orient) when scanning for failures. Without this guard, a failed /debug run would recursively diagnose itself — a degenerate loop with no useful output. The filter is applied before candidate selection, not after.

The live task indicator: LIVE_TASK_FILE

Two files in data/ track what is currently executing. live_run.json is written by RunTrace.set_stage() on every pipeline stage transition and holds a snapshot of the in-progress run. live_task.json is written by the task queue executor and holds only one thing: the item_id of the queue item currently executing.

# tasks.py, inside the task thread
LIVE_TASK_FILE.write_text(json.dumps({"item_id": item.item_id}), encoding="utf-8")
try:
    run_agent(item.task, ...)
finally:
    LIVE_TASK_FILE.unlink(missing_ok=True)  # always cleaned up on exit

GET /api/runs/live merges both files: it reads the stage snapshot from live_run.json and, if live_task.json exists, attaches the item_id to the response. This lets the dashboard correlate a live run snapshot (which has stage and token data but no queue identity) with the queue item that launched it (which has a user-visible task string but no run data yet).

For /debug specifically, the LIVE_TASK_FILE presence is a signal to skip the currently-running item when scanning for FAIL candidates — diagnosing a still-running task would read incomplete trace data. The file’s write/unlink lifecycle is atomic enough for this purpose: absence of the file means no task is executing, presence means one is.

If the process is killed mid-task without reaching the finally block, live_task.json will persist as a stale file. On next startup the file is not automatically cleaned — the runs endpoint will attempt to merge a stale item_id into the live snapshot. The safe fix is to rm data/live_task.json before restarting after a crash.