May 28, 2026 • 8 min read • Agentic Harness Engineering

Inside agent.py: The Three-Turn Research Pipeline

agent.py is the central file in the harness — the orchestrator that sequences vision preprocessing, novelty-gated web search, LLM synthesis, and the Wiggum evaluation loop from a single task string.

Every research run, eval task, skill invocation, and autoresearch experiment passes through agent.py. It's not a large framework — it's a tight pipeline with clear stages and explicit constants. Understanding this file is understanding how the harness works.

Three turns

Turn 1

Gather

Detect image paths in task string
Extract vision context via vision.py if images found
Run planner (make_plan) to get targeted queries
Execute novelty-gated web search rounds (2–5)
Optionally enrich top-N URLs for full page content

Turn 2

Synthesize

Build synthesis prompt with memory context + planner notes
Inject SYNTH_INSTRUCTION (technical/count/prose variant)
Call producer model (default: qwen3.6-35b)
Write output directly to the path specified in the task
Log token counts, model, duration to runs.jsonl

Turn 3

Verify

Run Wiggum evaluation on the written file
On FAIL: extract issue, revise synthesis
Loop until PASS or MAX_WIGGUM_ROUNDS
Store output in memory store on PASS
Skip entirely if --no-wiggum flag set

Novelty gating

The search loop in Turn 1 doesn't just run N fixed rounds — it uses novelty scoring to decide whether another search round would add meaningful information. After each round, the new results are scored 0–10 against the accumulated knowledge state. If the score falls below NOVELTY_THRESHOLD, the loop stops early.

Constant	Value	Meaning
`SEARCHES_PER_TASK`	2	Minimum rounds before novelty gating kicks in
`MAX_SEARCH_ROUNDS`	5	Hard cap regardless of novelty scores
`NOVELTY_THRESHOLD`	3	0–10 score below which a round is skipped
`NOVELTY_EPSILON`	0.15	ε-greedy pass-through: 15% of sub-threshold rounds still run
`SEARCH_QUALITY_FLOOR`	1800 chars	If total merged content is below this, run one more round
`MAX_RESULTS_PER_SEARCH`	5	DDGS results per query
`URL_ENRICH_COUNT`	2	Top-N URLs to fetch for full-page enrichment
`URL_ENRICH_MAX_CHARS`	8000	Per-URL character cap to prevent context bloat

The epsilon term prevents the loop from permanently abandoning a search direction based on one unlucky round. A 15% pass-through means sub-threshold queries occasionally run, which catches cases where the novelty scorer itself was wrong about the marginal value of a search direction. /deep disables this gate entirely, forcing all 5 rounds regardless of scores — used by mine_knowledge.py for authoritative reference mining.

SYNTH_INSTRUCTION: the autoresearch target

The synthesis prompt always ends with SYNTH_INSTRUCTION. This is the string that autoresearch.py reads, mutates, and writes back between experiments — it's the only variable the autoresearch loop changes between runs.

There are actually three instruction variants, selected based on task classification:

SYNTH_INSTRUCTION — technical tasks (keyword match against ~45 technical terms: "api", "deploy", "cuda", "docker", etc.)
SYNTH_INSTRUCTION_COUNT — "top N" / enumerated tasks (matched by planner's task_type == "enumerated")
SYNTH_INSTRUCTION_PROSE — non-technical analysis tasks (data analysis, file-reading tasks where explicit coding keywords are absent)
SYNTH_INSTRUCTION_RESEARCH — economic/policy research tasks (matched by _is_research_task() — verbs like "analyze", "assess", "what drove", "how has"). Enforces citation discipline: every empirical claim must be anchored to retrieved context with inline citations ([FRED:SERIES:DATE], [BEA:...], Beige Book passages). Prevents vague characterizations — requires verbatim figures and named districts.
SYNTH_INSTRUCTION_TRADING — trading thesis tasks (matched by _is_trading_task() — keywords like "alpaca", "long thesis", "trade setup"). Enforces a structured thesis format with [THESIS:...] citation anchors, price anchoring from live [YF:ticker:snapshot] context, and a portfolio summary section.

The prose variant exists because data-analysis tasks — "read autoresearch.tsv and report score trajectories" — shouldn't be prompted to include code examples and API signatures. The research and trading variants enforce source-citation discipline appropriate to their domains. The _is_technical_task(), _is_research_task(), and _is_trading_task() classifiers are checked in priority order: trading → technical → research → prose.

All three instruction strings are guarded by # AUTORESEARCH:SYNTH_INSTRUCTION:BEGIN/END sentinels. The autoresearch loop writes new values between the sentinels using regex substitution — renaming them or moving them to a different line would break the experiment loop.

Thinking model detection

Some Ollama models (Qwen3, QwQ) default to a chain-of-thought "thinking" mode that produces reasoning tokens before the actual response. This consumes the model's num_predict budget before any output appears — on an 8192-token limit, a thinking run can exhaust the budget entirely and produce no response body.

_synth_options() detects thinking models by name and sets think=False by default. An override via HARNESS_PRODUCER_THINK=1 re-enables thinking and doubles num_predict to 16,384 to accommodate the reasoning token overhead.

_THINKING_MODELS = {"qwen3", "qwq"}

def _synth_options(producer_model: str) -> dict:
    opts = {"temperature": 0.1, "num_predict": 8192}
    think_override = os.environ.get("HARNESS_PRODUCER_THINK", "")
    if think_override == "1":
        opts["think"] = True
        opts["num_predict"] = 16384
    elif _is_thinking_model(producer_model):
        opts["think"] = False
    return opts

Keep-alive estimation

Ollama unloads a model from GPU memory after a period of inactivity. If the keep-alive is too short, the model reloads mid-pipeline — adding 30–90 seconds of latency between synthesis and evaluation. If it's too long, GPU memory stays committed to a model that won't be used again for hours.

_estimate_keep_alive() computes a per-run value from the last 100 entries in runs.jsonl:

Read last 100 runs.jsonl entries → Filter by task_type → 90th-percentile duration × 1.2 → keep_alive seconds

If fewer than 5 matching runs exist, it falls back to all task types. If history is entirely absent, a skill-aware heuristic applies: 90 seconds for short skills (github, email), indefinite (-1) for lit-review, 300–450 seconds for standard research, more for /deep runs. The OLLAMA_KEEP_ALIVE env var overrides everything — set it to -1 to keep the model loaded permanently, which is useful during autoresearch experiments where back-to-back eval runs would otherwise trigger repeated cold reloads.

Python code execution tool

The synthesis step has access to a run_python tool — a sandboxed Python executor with a 10-second timeout. When the model requests code execution (for data processing, computation, or analysis), the agent runs the code via subprocess, captures stdout/stderr, and feeds the output back into the next synthesis turn. The tool is available for up to 3 rounds (PYTHON_TOOL_ROUNDS) before the loop ends. Security checks from security.py run on the code before execution to catch obvious injection attempts.

Skill dispatch

Before the research loop begins, parse_skills() checks the task string for slash commands (/github, /email, /lit-review, etc.). If a standalone skill is detected, execution branches immediately to that skill and skips the research pipeline entirely. auto_activate() activates implicit skills based on task content — an image path in the task auto-activates the vision skill even without an explicit /vision command. For the full skill registry, see The op.py CLI.

The producer model defaults to HARNESS_PRODUCER_MODEL → qwen3.6-35b. On startup, agent.py probes the vLLM /models endpoint and falls back to whatever model is currently loaded if the configured model isn't available — preventing 404 errors when switching model configs without a server restart.

Three turns

Gather

Synthesize

Verify

Novelty gating

SYNTH_INSTRUCTION: the autoresearch target

Thinking model detection

Keep-alive estimation

Python code execution tool

Skill dispatch

Related posts