Inside agent.py: The Three-Turn Research Pipeline
agent.py is the central file in the harness — the orchestrator that sequences vision preprocessing, novelty-gated web search, LLM synthesis, and the Wiggum evaluation loop from a single task string.
Every research run, eval task, skill invocation, and autoresearch experiment passes through agent.py. It's not a large framework — it's a tight pipeline with clear stages and explicit constants. Understanding this file is understanding how the harness works.
Three turns
Gather
- Detect image paths in task string
- Extract vision context via
vision.pyif images found - Run planner (
make_plan) to get targeted queries - Execute novelty-gated web search rounds (2–5)
- Optionally enrich top-N URLs for full page content
Synthesize
- Build synthesis prompt with memory context + planner notes
- Inject
SYNTH_INSTRUCTION(technical/count/prose variant) - Call producer model (default:
qwen3.6-35b) - Write output directly to the path specified in the task
- Log token counts, model, duration to
runs.jsonl
Verify
- Run Wiggum evaluation on the written file
- On FAIL: extract issue, revise synthesis
- Loop until PASS or
MAX_WIGGUM_ROUNDS - Store output in memory store on PASS
- Skip entirely if
--no-wiggumflag set
Novelty gating
The search loop in Turn 1 doesn't just run N fixed rounds — it uses novelty scoring to decide whether another search round would add meaningful information. After each round, the new results are scored 0–10 against the accumulated knowledge state. If the score falls below NOVELTY_THRESHOLD, the loop stops early.
| Constant | Value | Meaning |
|---|---|---|
SEARCHES_PER_TASK | 2 | Minimum rounds before novelty gating kicks in |
MAX_SEARCH_ROUNDS | 5 | Hard cap regardless of novelty scores |
NOVELTY_THRESHOLD | 3 | 0–10 score below which a round is skipped |
NOVELTY_EPSILON | 0.15 | ε-greedy pass-through: 15% of sub-threshold rounds still run |
SEARCH_QUALITY_FLOOR | 1800 chars | If total merged content is below this, run one more round |
MAX_RESULTS_PER_SEARCH | 5 | DDGS results per query |
URL_ENRICH_COUNT | 2 | Top-N URLs to fetch for full-page enrichment |
URL_ENRICH_MAX_CHARS | 8000 | Per-URL character cap to prevent context bloat |
The epsilon term prevents the loop from permanently abandoning a search direction based on one unlucky round. A 15% pass-through means sub-threshold queries occasionally run, which catches cases where the novelty scorer itself was wrong about the marginal value of a search direction. /deep disables this gate entirely, forcing all 5 rounds regardless of scores — used by mine_knowledge.py for authoritative reference mining.
SYNTH_INSTRUCTION: the autoresearch target
The synthesis prompt always ends with SYNTH_INSTRUCTION. This is the string that autoresearch.py reads, mutates, and writes back between experiments — it's the only variable the autoresearch loop changes between runs.
There are actually three instruction variants, selected based on task classification:
SYNTH_INSTRUCTION— technical tasks (keyword match against ~45 technical terms: "api", "deploy", "cuda", "docker", etc.)SYNTH_INSTRUCTION_COUNT— "top N" / enumerated tasks (matched by planner'stask_type == "enumerated")SYNTH_INSTRUCTION_PROSE— non-technical analysis tasks (data analysis, file-reading tasks where explicit coding keywords are absent)SYNTH_INSTRUCTION_RESEARCH— economic/policy research tasks (matched by_is_research_task()— verbs like "analyze", "assess", "what drove", "how has"). Enforces citation discipline: every empirical claim must be anchored to retrieved context with inline citations ([FRED:SERIES:DATE],[BEA:...], Beige Book passages). Prevents vague characterizations — requires verbatim figures and named districts.SYNTH_INSTRUCTION_TRADING— trading thesis tasks (matched by_is_trading_task()— keywords like "alpaca", "long thesis", "trade setup"). Enforces a structured thesis format with[THESIS:...]citation anchors, price anchoring from live[YF:ticker:snapshot]context, and a portfolio summary section.
The prose variant exists because data-analysis tasks — "read autoresearch.tsv and report score trajectories" — shouldn't be prompted to include code examples and API signatures. The research and trading variants enforce source-citation discipline appropriate to their domains. The _is_technical_task(), _is_research_task(), and _is_trading_task() classifiers are checked in priority order: trading → technical → research → prose.
All three instruction strings are guarded by # AUTORESEARCH:SYNTH_INSTRUCTION:BEGIN/END sentinels. The autoresearch loop writes new values between the sentinels using regex substitution — renaming them or moving them to a different line would break the experiment loop.
Thinking model detection
Some Ollama models (Qwen3, QwQ) default to a chain-of-thought "thinking" mode that produces reasoning tokens before the actual response. This consumes the model's num_predict budget before any output appears — on an 8192-token limit, a thinking run can exhaust the budget entirely and produce no response body.
_synth_options() detects thinking models by name and sets think=False by default. An override via HARNESS_PRODUCER_THINK=1 re-enables thinking and doubles num_predict to 16,384 to accommodate the reasoning token overhead.
_THINKING_MODELS = {"qwen3", "qwq"}
def _synth_options(producer_model: str) -> dict:
opts = {"temperature": 0.1, "num_predict": 8192}
think_override = os.environ.get("HARNESS_PRODUCER_THINK", "")
if think_override == "1":
opts["think"] = True
opts["num_predict"] = 16384
elif _is_thinking_model(producer_model):
opts["think"] = False
return opts
Keep-alive estimation
Ollama unloads a model from GPU memory after a period of inactivity. If the keep-alive is too short, the model reloads mid-pipeline — adding 30–90 seconds of latency between synthesis and evaluation. If it's too long, GPU memory stays committed to a model that won't be used again for hours.
_estimate_keep_alive() computes a per-run value from the last 100 entries in runs.jsonl:
If fewer than 5 matching runs exist, it falls back to all task types. If history is entirely absent, a skill-aware heuristic applies: 90 seconds for short skills (github, email), indefinite (-1) for lit-review, 300–450 seconds for standard research, more for /deep runs. The OLLAMA_KEEP_ALIVE env var overrides everything — set it to -1 to keep the model loaded permanently, which is useful during autoresearch experiments where back-to-back eval runs would otherwise trigger repeated cold reloads.
Python code execution tool
The synthesis step has access to a run_python tool — a sandboxed Python executor with a 10-second timeout. When the model requests code execution (for data processing, computation, or analysis), the agent runs the code via subprocess, captures stdout/stderr, and feeds the output back into the next synthesis turn. The tool is available for up to 3 rounds (PYTHON_TOOL_ROUNDS) before the loop ends. Security checks from security.py run on the code before execution to catch obvious injection attempts.
Skill dispatch
Before the research loop begins, parse_skills() checks the task string for slash commands (/github, /email, /lit-review, etc.). If a standalone skill is detected, execution branches immediately to that skill and skips the research pipeline entirely. auto_activate() activates implicit skills based on task content — an image path in the task auto-activates the vision skill even without an explicit /vision command. For the full skill registry, see The op.py CLI.
The producer model defaults to HARNESS_PRODUCER_MODEL → qwen3.6-35b. On startup, agent.py probes the vLLM /models endpoint and falls back to whatever model is currently loaded if the configured model isn't available — preventing 404 errors when switching model configs without a server restart.