June 1, 2026 • 5 min read • Agentic Harness Engineering

The Developer Utility Skills: /scratchpad and /test-harness

Two explicit-only skills for development workflows. /scratchpad forces the Python tool loop on for any task type and injects a synthesis instruction that forbids LLM estimation — all numbers in the output must come from actual code execution. /test-harness runs the pytest suite, saves structured output for downstream agent retrieval, and parses the summary line into a structured result.

/scratchpad: grounding synthesis in executed code

The research pipeline skips the Python tool loop for research and best_practices task types — searching and synthesizing prose doesn't require code execution. But some research tasks need exact values: "how many runs passed last week?", "what's the distribution of Wiggum scores?", "how many tokens did the autoresearch loop consume?". Without code, the model estimates. With /scratchpad, it computes.

# Compute exact stats from runs.jsonl before synthesizing
python agent.py "/scratchpad Read autoresearch.tsv and report the score trajectory since experiment 40"

# Any task type — /scratchpad forces the tool loop regardless
python agent.py "/scratchpad How many PASS runs did qwen3.6-35b produce in the last 7 days?"

/scratchpad is a pre_synthesis hook that modifies two aspects of the pipeline rather than running code directly:

1
Forces the tool loop

The check plan.task_type in ("research", "best_practices") and not _scratchpad_active gates the Python tool loop. When /scratchpad is active, that gate opens regardless of task type — every synthesis turn gets access to code execution.

2
Injects a strict synthesis instruction

SYNTH_INSTRUCTION_SCRATCHPAD is prepended to the synthesis context, replacing the default prompt. It forbids the model from hedging or paraphrasing computed values — if a number wasn't computed, it must say so explicitly rather than approximating.

IMPORTANT: This task requires concrete, deterministic values.
Do NOT estimate or approximate any numbers — use only exact values
from the Code execution results block above.
If a value was computed and saved via save_result(), treat it as ground truth.
Reference computed values directly (e.g. 'exactly 42 runs matched' not 'approximately 40').
If a required value was not computed, state that explicitly rather than guessing.

Prior result injection

When /scratchpad is active, the pipeline also loads previous scratch results for the same topic from agent-workspace/scratch/. scratch_helpers.load_recent(n=3, task_hint=task[:60]) retrieves the three most recent result files whose names are semantically similar to the current task. These are prepended to the code context as "Prior scratchpad results" — allowing the model to reference previously computed values without recomputing them.

Prior results are passed through strip_injection_candidates() before injection — the same security scanner that protects the research context from prompt injection. This matters because prior scratch files could contain attacker-controlled content if a previous run fetched data from an untrusted source.

/test-harness: running the test suite from inside the agent loop

# Fast suite (skips the ~14-min Ollama integration test)
python agent.py "/test-harness"

# Full suite including slow integration tests
python agent.py "/test-harness --full"

# Keyword filter (passed through to pytest -k)
python agent.py "/test-harness -k oversearch"
python agent.py "/test-harness -k wiggum"

The skill runs python run_tests.py (with --fast appended unless --full is present) via subprocess.run() with capture_output=True. Any flags beyond --full and --fast are passed through to pytest — including -k <keyword> for test name filtering.

Output is saved to two locations: agent-workspace/test-results/latest.txt (always overwritten) and a timestamped copy agent-workspace/test-results/20260602_143021.txt. The latest.txt path is stable — it's what the agent reads when asked to interpret test results without re-running the suite.

The skill parses the pytest summary line with a regex that handles deselected, xfailed, xpassed, and warning counts:

summary_m = re.search(r"=+ (.+?) in ([\d.]+)s =+", output)
# → "22 passed, 0 failed in 14.3s"

passed   = int(m.group(1)) if (m := re.search(r"(\d+) passed",  body)) else 0
failed   = int(m.group(1)) if (m := re.search(r"(\d+) failed",  body)) else 0
warnings = int(m.group(1)) if (m := re.search(r"(\d+) warning", body)) else 0

It also extracts all lines beginning with FAILED from the output for quick reporting. The return dict includes passed, failed, warnings, duration_s, summary, failed_lines, output_path, fast, and returncode — structured enough for /debug to consume without parsing raw text.

The fast suite skips tests marked with the slow pytest marker — primarily the end-to-end Ollama integration test that runs a full research pipeline against a live model. On an RTX 5000 Ada this takes 10–15 minutes. Run /test-harness --full only when changing inference routing or the tool loop, where the integration test is the relevant gate.

Both skills are explicit-only — /scratchpad has no auto-activation predicate so it never fires on research tasks where the model's estimates are acceptable. The opt-in design is intentional: forcing code execution for every task that mentions a number would add unnecessary latency and risk execution errors on tasks where approximate values are fine.