The Subagent Demo Suite: Orchestrating Multi-Task Research Portfolios
Two orchestration scripts show the harness producing a coherent portfolio across five to six related research tasks — sequential throughput via the Flask server queue, or parallel execution via the MCP HTTP server — followed by deterministic HTML rendering of all outputs.
A single research task demonstrates the agent loop; a portfolio of related tasks demonstrates orchestration. The subagent demos are reproducible examples of both: hand them to someone evaluating the harness for the first time, or run them after significant changes to the agent loop to verify end-to-end behavior at scale.
v1 vs v2: what changed
subagent_demo.py (v1)
- 6 tasks covering "The State of Agentic AI in 2026" — public literature synthesis
- 5 research reports + 1 landing page task (LLM-rendered HTML with JS expand/collapse)
- Sequential only — Flask queue at
localhost:8765 - Waits for idle by polling
/api/runsand/api/queue - Outputs to
subagent-test/
subagent_demo_v2.py (v2)
- 5 tasks grounded in actual repo files — not literature, but self-analysis
- Reads
autoresearch.tsv,runs.jsonl, wiki docs, source code - Sequential or parallel —
--parallelfires tasks through MCP HTTP server - Landing page rendered deterministically via
render_html.py, no LLM involved - Outputs to
subagent-test-v2/
The shift from v1 to v2 reflects a deeper lesson from running the harness: public literature synthesis is easy to fake (vague summaries are hard to detect as shallow), but grounded self-analysis is verifiable. A task that asks the agent to read autoresearch.tsv and report score trajectories can be checked against the actual data.
The v2 task set
Each v2 task specifies exactly which files to read, making outputs auditable:
- autoresearch-analysis — read
autoresearch.tsv, analyze the experiment history: what was kept vs discarded, score trajectories, what types of changes improved scores - eval-score-trends — read
runs.jsonl, extract Wiggum scores per task and per dimension, identify consistently weak dimensions and patterns in evaluator feedback - harness-architecture — read
wiki/architecture.md,wiki/pipeline.md,wiki/skills.md, andskills.py, synthesize an accurate characterization that cites actual function names and design decisions - new-eval-tasks — read
wiki/eval-framework.mdandautoresearch.tsv, propose T_F through T_H that stress-test failure modes the current suite misses - synthesis-instruction-history — trace the evolution of
SYNTH_INSTRUCTIONfromagent.py,wiki/synthesis-instructions.md, andautoresearch_program.md
Sequential mode: Flask queue
All tasks are enqueued upfront. The server processes them one at a time in FIFO order. The demo polls every 5 seconds and prints nothing between polls — the dashboard at localhost:8765 is the status view. A Ctrl-C stops the watcher but leaves the tasks running in the server. On completion, finish() lists all output files and calls render_html.py to build the landing page.
python subagent_demo_v2.py # sequential (default)
python subagent_demo_v2.py --sequential # explicit
Parallel mode: MCP HTTP server
The --parallel flag routes tasks through the MCP streamable-HTTP server instead of the Flask queue. Each task opens its own MCP session, sends an initialize + notifications/initialized handshake, calls tools/call with run_task, and waits for the SSE response. A ThreadPoolExecutor with configurable --workers fires up to MCP_MAX_CONCURRENCY tasks simultaneously.
# Start the MCP server with concurrency limit
MCP_MAX_CONCURRENCY=3 python mcp_server.py --http
# Run all 5 tasks concurrently
python subagent_demo_v2.py --parallel --workers 3
Parallel mode is GPU-constrained in practice. Running 3 tasks concurrently means 3 simultaneous model inference calls, which saturates a single consumer GPU. The --workers flag lets you tune concurrency to match available hardware — 2 workers on a 24 GB card, 3–4 on a 48 GB card with smaller models.
Deterministic landing page rendering
In v1, the landing page was the sixth task: an LLM call that read the five research reports and generated HTML with inline CSS and a JavaScript expand/collapse interaction. This worked, but the output was unpredictable in structure and occasionally hallucinated synopses that didn't match the actual report content.
In v2, render_html.py reads all .md files in the output directory, extracts the first paragraph of each as a synopsis, and builds a static HTML landing page with a consistent card grid layout. No LLM call, no hallucinated synopses, reproducible structure every time. The trade-off is that the landing page looks the same regardless of what the research tasks found — but accuracy beats aesthetics here.
Usage
# Prerequisites: server must be running
python server.py
# Run v1 (public literature synthesis)
python subagent_demo.py
# Run v2 self-analysis, sequential
python subagent_demo_v2.py
# Run v2 self-analysis, parallel via MCP
MCP_MAX_CONCURRENCY=3 python mcp_server.py --http &
python subagent_demo_v2.py --parallel
Both demos check server reachability before enqueueing and print a clear error with the startup command if the server isn't running. For the MCP parallel mode, the server check opens a full initialize/initialized handshake to verify the MCP protocol is responding, not just that the port is open.