May 27, 2026 • 6 min read • Agentic Harness Engineering

The Subagent Demo Suite: Orchestrating Multi-Task Research Portfolios

Two orchestration scripts show the harness producing a coherent portfolio across five to six related research tasks — sequential throughput via the Flask server queue, or parallel execution via the MCP HTTP server — followed by deterministic HTML rendering of all outputs.

A single research task demonstrates the agent loop; a portfolio of related tasks demonstrates orchestration. The subagent demos are reproducible examples of both: hand them to someone evaluating the harness for the first time, or run them after significant changes to the agent loop to verify end-to-end behavior at scale.

v1 vs v2: what changed

subagent_demo.py (v1)

6 tasks covering "The State of Agentic AI in 2026" — public literature synthesis
5 research reports + 1 landing page task (LLM-rendered HTML with JS expand/collapse)
Sequential only — Flask queue at localhost:8765
Waits for idle by polling /api/runs and /api/queue
Outputs to subagent-test/

subagent_demo_v2.py (v2)

5 tasks grounded in actual repo files — not literature, but self-analysis
Reads autoresearch.tsv, runs.jsonl, wiki docs, source code
Sequential or parallel — --parallel fires tasks through MCP HTTP server
Landing page rendered deterministically via render_html.py, no LLM involved
Outputs to subagent-test-v2/

The shift from v1 to v2 reflects a deeper lesson from running the harness: public literature synthesis is easy to fake (vague summaries are hard to detect as shallow), but grounded self-analysis is verifiable. A task that asks the agent to read autoresearch.tsv and report score trajectories can be checked against the actual data.

The v2 task set

Each v2 task specifies exactly which files to read, making outputs auditable:

autoresearch-analysis — read autoresearch.tsv, analyze the experiment history: what was kept vs discarded, score trajectories, what types of changes improved scores
eval-score-trends — read runs.jsonl, extract Wiggum scores per task and per dimension, identify consistently weak dimensions and patterns in evaluator feedback
harness-architecture — read wiki/architecture.md, wiki/pipeline.md, wiki/skills.md, and skills.py, synthesize an accurate characterization that cites actual function names and design decisions
new-eval-tasks — read wiki/eval-framework.md and autoresearch.tsv, propose T_F through T_H that stress-test failure modes the current suite misses
synthesis-instruction-history — trace the evolution of SYNTH_INSTRUCTION from agent.py, wiki/synthesis-instructions.md, and autoresearch_program.md

Sequential mode: Flask queue

POST /api/queue × N tasks poll /api/runs + /api/queue → finish() when idle

All tasks are enqueued upfront. The server processes them one at a time in FIFO order. The demo polls every 5 seconds and prints nothing between polls — the dashboard at localhost:8765 is the status view. A Ctrl-C stops the watcher but leaves the tasks running in the server. On completion, finish() lists all output files and calls render_html.py to build the landing page.

python subagent_demo_v2.py              # sequential (default)
python subagent_demo_v2.py --sequential # explicit

Parallel mode: MCP HTTP server

The --parallel flag routes tasks through the MCP streamable-HTTP server instead of the Flask queue. Each task opens its own MCP session, sends an initialize + notifications/initialized handshake, calls tools/call with run_task, and waits for the SSE response. A ThreadPoolExecutor with configurable --workers fires up to MCP_MAX_CONCURRENCY tasks simultaneously.

# Start the MCP server with concurrency limit
MCP_MAX_CONCURRENCY=3 python mcp_server.py --http

# Run all 5 tasks concurrently
python subagent_demo_v2.py --parallel --workers 3

Parallel mode is GPU-constrained in practice. Running 3 tasks concurrently means 3 simultaneous model inference calls, which saturates a single consumer GPU. The --workers flag lets you tune concurrency to match available hardware — 2 workers on a 24 GB card, 3–4 on a 48 GB card with smaller models.

Deterministic landing page rendering

In v1, the landing page was the sixth task: an LLM call that read the five research reports and generated HTML with inline CSS and a JavaScript expand/collapse interaction. This worked, but the output was unpredictable in structure and occasionally hallucinated synopses that didn't match the actual report content.

In v2, render_html.py reads all .md files in the output directory, extracts the first paragraph of each as a synopsis, and builds a static HTML landing page with a consistent card grid layout. No LLM call, no hallucinated synopses, reproducible structure every time. The trade-off is that the landing page looks the same regardless of what the research tasks found — but accuracy beats aesthetics here.

Usage

# Prerequisites: server must be running
python server.py

# Run v1 (public literature synthesis)
python subagent_demo.py

# Run v2 self-analysis, sequential
python subagent_demo_v2.py

# Run v2 self-analysis, parallel via MCP
MCP_MAX_CONCURRENCY=3 python mcp_server.py --http &
python subagent_demo_v2.py --parallel

Both demos check server reachability before enqueueing and print a clear error with the startup command if the server isn't running. For the MCP parallel mode, the server check opens a full initialize/initialized handshake to verify the MCP protocol is responding, not just that the port is open.

v1 vs v2: what changed

subagent_demo.py (v1)

subagent_demo_v2.py (v2)

The v2 task set

Sequential mode: Flask queue

Parallel mode: MCP HTTP server

Deterministic landing page rendering

Usage

Related posts