June 1, 2026 • 5 min read • Agentic Harness Engineering

The Developer Utility Skills: /scratchpad and /test-harness

Two explicit-only skills for development workflows. /scratchpad forces the Python tool loop on for any task type and injects a synthesis instruction that forbids LLM estimation — all numbers in the output must come from actual code execution. /test-harness runs the pytest suite, saves structured output for downstream agent retrieval, and parses the summary line into a structured result.

/scratchpad: grounding synthesis in executed code

The research pipeline skips the Python tool loop for research and best_practices task types — searching and synthesizing prose doesn't require code execution. But some research tasks need exact values: "how many runs passed last week?", "what's the distribution of Wiggum scores?", "how many tokens did the autoresearch loop consume?". Without code, the model estimates. With /scratchpad, it computes.

# Compute exact stats from runs.jsonl before synthesizing
python agent.py "/scratchpad Read autoresearch.tsv and report the score trajectory since experiment 40"

# Any task type — /scratchpad forces the tool loop regardless
python agent.py "/scratchpad How many PASS runs did qwen3.6-35b produce in the last 7 days?"

/scratchpad is a pre_synthesis hook. It doesn't run code itself — it changes two things about the pipeline:

  1. Forces the tool loop. The check plan.task_type in ("research", "best_practices") and not _scratchpad_active gates the Python tool loop. When /scratchpad is active, that gate opens regardless of task type.
  2. Injects a strict synthesis instruction. The SYNTH_INSTRUCTION_SCRATCHPAD prompt injection is prepended to the synthesis context:
IMPORTANT: This task requires concrete, deterministic values.
Do NOT estimate or approximate any numbers — use only exact values
from the Code execution results block above.
If a value was computed and saved via save_result(), treat it as ground truth.
Reference computed values directly (e.g. 'exactly 42 runs matched' not 'approximately 40').
If a required value was not computed, state that explicitly rather than guessing.

The synthesis model sees the code execution output in the context window and this instruction forbids it from hedging or paraphrasing the numbers. If the code didn't compute something, the model must say so explicitly rather than approximating.

Prior result injection

When /scratchpad is active, the pipeline also loads previous scratch results for the same topic from agent-workspace/scratch/. scratch_helpers.load_recent(n=3, task_hint=task[:60]) retrieves the three most recent result files whose names are semantically similar to the current task. These are prepended to the code context as "Prior scratchpad results" — allowing the model to reference previously computed values without recomputing them.

Prior results are passed through strip_injection_candidates() before injection — the same security scanner that protects the research context from prompt injection. This matters because prior scratch files could contain attacker-controlled content if a previous run fetched data from an untrusted source.

/test-harness: running the test suite from inside the agent loop

# Fast suite (skips the ~14-min Ollama integration test)
python agent.py "/test-harness"

# Full suite including slow integration tests
python agent.py "/test-harness --full"

# Keyword filter (passed through to pytest -k)
python agent.py "/test-harness -k oversearch"
python agent.py "/test-harness -k wiggum"

The skill runs python run_tests.py (with --fast appended unless --full is present) via subprocess.run() with capture_output=True. Any flags beyond --full and --fast are passed through to pytest — including -k <keyword> for test name filtering.

Output is saved to two locations: agent-workspace/test-results/latest.txt (always overwritten) and a timestamped copy agent-workspace/test-results/20260602_143021.txt. The latest.txt path is stable — it's what the agent reads when asked to interpret test results without re-running the suite.

The skill parses the pytest summary line with a regex that handles deselected, xfailed, xpassed, and warning counts:

summary_m = re.search(r"=+ (.+?) in ([\d.]+)s =+", output)
# → "22 passed, 0 failed in 14.3s"

passed   = int(m.group(1)) if (m := re.search(r"(\d+) passed",  body)) else 0
failed   = int(m.group(1)) if (m := re.search(r"(\d+) failed",  body)) else 0
warnings = int(m.group(1)) if (m := re.search(r"(\d+) warning", body)) else 0

It also extracts all lines beginning with FAILED from the output for quick reporting. The return dict includes passed, failed, warnings, duration_s, summary, failed_lines, output_path, fast, and returncode — structured enough for /debug to consume without parsing raw text.

The fast suite skips tests marked with the slow pytest marker — primarily the end-to-end Ollama integration test that runs a full research pipeline against a live model. On an RTX 5000 Ada this takes 10–15 minutes. Run /test-harness --full only when changing inference routing or the tool loop, where the integration test is the relevant gate.

Both skills are explicit-only — /scratchpad has no auto-activation predicate so it never fires on research tasks where the model's estimates are acceptable. The opt-in design is intentional: forcing code execution for every task that mentions a number would add unnecessary latency and risk execution errors on tasks where approximate values are fine.