The Developer Utility Skills: /scratchpad and /test-harness
Two explicit-only skills for development workflows. /scratchpad forces the Python tool loop on for any task type and injects a synthesis instruction that forbids LLM estimation — all numbers in the output must come from actual code execution. /test-harness runs the pytest suite, saves structured output for downstream agent retrieval, and parses the summary line into a structured result.
/scratchpad: grounding synthesis in executed code
The research pipeline skips the Python tool loop for research and best_practices task types — searching and synthesizing prose doesn't require code execution. But some research tasks need exact values: "how many runs passed last week?", "what's the distribution of Wiggum scores?", "how many tokens did the autoresearch loop consume?". Without code, the model estimates. With /scratchpad, it computes.
# Compute exact stats from runs.jsonl before synthesizing
python agent.py "/scratchpad Read autoresearch.tsv and report the score trajectory since experiment 40"
# Any task type — /scratchpad forces the tool loop regardless
python agent.py "/scratchpad How many PASS runs did qwen3.6-35b produce in the last 7 days?"
/scratchpad is a pre_synthesis hook that modifies two aspects of the pipeline rather than running code directly:
The check plan.task_type in ("research", "best_practices") and not _scratchpad_active gates the Python tool loop. When /scratchpad is active, that gate opens regardless of task type — every synthesis turn gets access to code execution.
SYNTH_INSTRUCTION_SCRATCHPAD is prepended to the synthesis context, replacing the default prompt. It forbids the model from hedging or paraphrasing computed values — if a number wasn't computed, it must say so explicitly rather than approximating.
IMPORTANT: This task requires concrete, deterministic values.
Do NOT estimate or approximate any numbers — use only exact values
from the Code execution results block above.
If a value was computed and saved via save_result(), treat it as ground truth.
Reference computed values directly (e.g. 'exactly 42 runs matched' not 'approximately 40').
If a required value was not computed, state that explicitly rather than guessing.
Prior result injection
When /scratchpad is active, the pipeline also loads previous scratch results for the same topic from agent-workspace/scratch/. scratch_helpers.load_recent(n=3, task_hint=task[:60]) retrieves the three most recent result files whose names are semantically similar to the current task. These are prepended to the code context as "Prior scratchpad results" — allowing the model to reference previously computed values without recomputing them.
Prior results are passed through strip_injection_candidates() before injection — the same security scanner that protects the research context from prompt injection. This matters because prior scratch files could contain attacker-controlled content if a previous run fetched data from an untrusted source.
/test-harness: running the test suite from inside the agent loop
# Fast suite (skips the ~14-min Ollama integration test)
python agent.py "/test-harness"
# Full suite including slow integration tests
python agent.py "/test-harness --full"
# Keyword filter (passed through to pytest -k)
python agent.py "/test-harness -k oversearch"
python agent.py "/test-harness -k wiggum"
The skill runs python run_tests.py (with --fast appended unless --full is present) via subprocess.run() with capture_output=True. Any flags beyond --full and --fast are passed through to pytest — including -k <keyword> for test name filtering.
Output is saved to two locations: agent-workspace/test-results/latest.txt (always overwritten) and a timestamped copy agent-workspace/test-results/20260602_143021.txt. The latest.txt path is stable — it's what the agent reads when asked to interpret test results without re-running the suite.
The skill parses the pytest summary line with a regex that handles deselected, xfailed, xpassed, and warning counts:
summary_m = re.search(r"=+ (.+?) in ([\d.]+)s =+", output)
# → "22 passed, 0 failed in 14.3s"
passed = int(m.group(1)) if (m := re.search(r"(\d+) passed", body)) else 0
failed = int(m.group(1)) if (m := re.search(r"(\d+) failed", body)) else 0
warnings = int(m.group(1)) if (m := re.search(r"(\d+) warning", body)) else 0
It also extracts all lines beginning with FAILED from the output for quick reporting. The return dict includes passed, failed, warnings, duration_s, summary, failed_lines, output_path, fast, and returncode — structured enough for /debug to consume without parsing raw text.
The fast suite skips tests marked with the slow pytest marker — primarily the end-to-end Ollama integration test that runs a full research pipeline against a live model. On an RTX 5000 Ada this takes 10–15 minutes. Run /test-harness --full only when changing inference routing or the tool loop, where the integration test is the relevant gate.
Both skills are explicit-only — /scratchpad has no auto-activation predicate so it never fires on research tasks where the model's estimates are acceptable. The opt-in design is intentional: forcing code execution for every task that mentions a number would add unnecessary latency and risk execution errors on tasks where approximate values are fine.