May 28, 2026 • 8 min read • Agentic Harness Engineering

Inside agent.py: The Three-Turn Research Pipeline

agent.py is the central file in the harness — the orchestrator that sequences vision preprocessing, novelty-gated web search, LLM synthesis, and the Wiggum evaluation loop from a single task string.

Every research run, eval task, skill invocation, and autoresearch experiment passes through agent.py. It's not a large framework — it's a tight pipeline with clear stages and explicit constants. Understanding this file is understanding how the harness works.

Three turns

Turn 1

Gather

  • Detect image paths in task string
  • Extract vision context via vision.py if images found
  • Run planner (make_plan) to get targeted queries
  • Execute novelty-gated web search rounds (2–5)
  • Optionally enrich top-N URLs for full page content
Turn 2

Synthesize

  • Build synthesis prompt with memory context + planner notes
  • Inject SYNTH_INSTRUCTION (technical/count/prose variant)
  • Call producer model (default: qwen3.6-35b)
  • Write output directly to the path specified in the task
  • Log token counts, model, duration to runs.jsonl
Turn 3

Verify

  • Run Wiggum evaluation on the written file
  • On FAIL: extract issue, revise synthesis
  • Loop until PASS or MAX_WIGGUM_ROUNDS
  • Store output in memory store on PASS
  • Skip entirely if --no-wiggum flag set

Novelty gating

The search loop in Turn 1 doesn't just run N fixed rounds — it uses novelty scoring to decide whether another search round would add meaningful information. After each round, the new results are scored 0–10 against the accumulated knowledge state. If the score falls below NOVELTY_THRESHOLD, the loop stops early.

ConstantValueMeaning
SEARCHES_PER_TASK2Minimum rounds before novelty gating kicks in
MAX_SEARCH_ROUNDS5Hard cap regardless of novelty scores
NOVELTY_THRESHOLD30–10 score below which a round is skipped
NOVELTY_EPSILON0.15ε-greedy pass-through: 15% of sub-threshold rounds still run
SEARCH_QUALITY_FLOOR1800 charsIf total merged content is below this, run one more round
MAX_RESULTS_PER_SEARCH5DDGS results per query
URL_ENRICH_COUNT2Top-N URLs to fetch for full-page enrichment
URL_ENRICH_MAX_CHARS8000Per-URL character cap to prevent context bloat

The epsilon term prevents the loop from permanently abandoning a search direction based on one unlucky round. A 15% pass-through means sub-threshold queries occasionally run, which catches cases where the novelty scorer itself was wrong about the marginal value of a search direction. /deep disables this gate entirely, forcing all 5 rounds regardless of scores — used by mine_knowledge.py for authoritative reference mining.

SYNTH_INSTRUCTION: the autoresearch target

The synthesis prompt always ends with SYNTH_INSTRUCTION. This is the string that autoresearch.py reads, mutates, and writes back between experiments — it's the only variable the autoresearch loop changes between runs.

There are actually three instruction variants, selected based on task classification:

The prose variant exists because data-analysis tasks — "read autoresearch.tsv and report score trajectories" — shouldn't be prompted to include code examples and API signatures. The research and trading variants enforce source-citation discipline appropriate to their domains. The _is_technical_task(), _is_research_task(), and _is_trading_task() classifiers are checked in priority order: trading → technical → research → prose.

All three instruction strings are guarded by # AUTORESEARCH:SYNTH_INSTRUCTION:BEGIN/END sentinels. The autoresearch loop writes new values between the sentinels using regex substitution — renaming them or moving them to a different line would break the experiment loop.

Thinking model detection

Some Ollama models (Qwen3, QwQ) default to a chain-of-thought "thinking" mode that produces reasoning tokens before the actual response. This consumes the model's num_predict budget before any output appears — on an 8192-token limit, a thinking run can exhaust the budget entirely and produce no response body.

_synth_options() detects thinking models by name and sets think=False by default. An override via HARNESS_PRODUCER_THINK=1 re-enables thinking and doubles num_predict to 16,384 to accommodate the reasoning token overhead.

_THINKING_MODELS = {"qwen3", "qwq"}

def _synth_options(producer_model: str) -> dict:
    opts = {"temperature": 0.1, "num_predict": 8192}
    think_override = os.environ.get("HARNESS_PRODUCER_THINK", "")
    if think_override == "1":
        opts["think"] = True
        opts["num_predict"] = 16384
    elif _is_thinking_model(producer_model):
        opts["think"] = False
    return opts

Keep-alive estimation

Ollama unloads a model from GPU memory after a period of inactivity. If the keep-alive is too short, the model reloads mid-pipeline — adding 30–90 seconds of latency between synthesis and evaluation. If it's too long, GPU memory stays committed to a model that won't be used again for hours.

_estimate_keep_alive() computes a per-run value from the last 100 entries in runs.jsonl:

Read last 100 runs.jsonl entries Filter by task_type 90th-percentile duration × 1.2 keep_alive seconds

If fewer than 5 matching runs exist, it falls back to all task types. If history is entirely absent, a skill-aware heuristic applies: 90 seconds for short skills (github, email), indefinite (-1) for lit-review, 300–450 seconds for standard research, more for /deep runs. The OLLAMA_KEEP_ALIVE env var overrides everything — set it to -1 to keep the model loaded permanently, which is useful during autoresearch experiments where back-to-back eval runs would otherwise trigger repeated cold reloads.

Python code execution tool

The synthesis step has access to a run_python tool — a sandboxed Python executor with a 10-second timeout. When the model requests code execution (for data processing, computation, or analysis), the agent runs the code via subprocess, captures stdout/stderr, and feeds the output back into the next synthesis turn. The tool is available for up to 3 rounds (PYTHON_TOOL_ROUNDS) before the loop ends. Security checks from security.py run on the code before execution to catch obvious injection attempts.

Skill dispatch

Before the research loop begins, parse_skills() checks the task string for slash commands (/github, /email, /lit-review, etc.). If a standalone skill is detected, execution branches immediately to that skill and skips the research pipeline entirely. auto_activate() activates implicit skills based on task content — an image path in the task auto-activates the vision skill even without an explicit /vision command. For the full skill registry, see The op.py CLI.

The producer model defaults to HARNESS_PRODUCER_MODELqwen3.6-35b. On startup, agent.py probes the vLLM /models endpoint and falls back to whatever model is currently loaded if the configured model isn't available — preventing 404 errors when switching model configs without a server restart.