Agentic System Design Patterns

Foundation

Foundation Patterns

Two patterns that define the core evaluation architecture. Everything else depends on these being in place first.

C1

Producer-Evaluator Separation

The Wiggum Loop

Evaluate synthesis output with a model categorically different from the one that produced it.

Problem

A model that evaluates its own output scores it 0.9 points higher on average than an independent evaluator, with the gap largest on groundedness — the dimension where self-generated errors are hardest to detect. Self-evaluation is not a quality check; it is a confidence amplifier.

Solution

Assign evaluation to a separate model that has never seen the synthesis context. The evaluator runs the same dimensional rubric independently; the producer receives only the score and per-dimension feedback, not the evaluator's reasoning trace. A second inner loop (Wiggum) performs targeted revision on low-scoring dimensions.

Structure

Outer loop (Ralph): producer generates output → evaluator scores it → revision if below threshold
Inner loop (Wiggum): per-dimension feedback → targeted revision prompt → re-score on changed dimensions only
Guard: producer and evaluator must be different model checkpoints; the shim enforces this at call time
Score threshold: configurable composite floor before output is accepted

A2

Model Role Separation

Assign distinct, separately-configured model instances to each pipeline role rather than routing all calls to a single model.

Problem

A single model used for planning, synthesis, evaluation, and instruction proposal accumulates conflicting optimization pressures. A model chosen to be a good producer is rarely the best evaluator; a fast planner is rarely the most accurate synthesizer. Role conflation also makes it impossible to upgrade one role without re-validating all others.

Solution

Define named roles (PLANNER_MODEL, PRODUCER_MODEL, EVALUATOR_MODEL, PROPOSER_MODEL) as independently configurable constants. Each role's model is selected based on the role's requirements — speed for planning, accuracy for evaluation, instruction-following for synthesis — and can be swapped without touching the others.

Structure

PLANNER_MODEL — small/fast; produces subtask graph
PRODUCER_MODEL — primary synthesis capability
EVALUATOR_MODEL — must differ from PRODUCER_MODEL (enforced by shim)
PROPOSER_MODEL — instruction optimization; typically the strongest available model

Substrate

Substrate Patterns

Infrastructure beneath the pipeline. These patterns govern how models are loaded, called, and kept available without exhausting hardware resources.

A1

Inference Shim

Provide a uniform call interface over heterogeneous inference backends so that backend selection is a runtime configuration decision, not a code change.

Problem

Code that calls a specific backend (Ollama, OpenAI-compatible API, vLLM) is tightly coupled to that backend's request schema, error codes, and streaming protocol. Switching backends or running hybrid local/cloud configurations requires changes throughout the codebase.

Solution

A thin adapter layer normalizes request and response shape across backends. Callers pass a model name and a list of messages; the shim resolves the backend, handles retries and streaming, and returns a normalized response object. Backend is selected by the model name prefix (e.g., atla/ routes to the API, no prefix routes to Ollama).

Structure

call(model, messages, **opts) — unified entry point
Backend resolver — maps model name → OllamaBackend or APIBackend
Response normalizer — {"content": str, "tokens": {...}} regardless of backend
Retry wrapper — handles transient failures with exponential backoff

A3

Evaluator Pool

Maintain a pool of warm evaluator instances to eliminate cold-start latency on each evaluation pass.

Problem

A 30B evaluator model takes 8–12 seconds to load from disk. Running the Wiggum loop after every synthesis pass makes evaluation the dominant latency term — not synthesis. Cold-loading the evaluator on each pass makes the inner loop unusable in production.

Solution

Keep one or more evaluator instances loaded and available. The shim routes evaluation calls to the pool; under concurrent load, calls round-robin across warm instances. Pool size is bounded by VRAM budget; the Keep-Alive Budget pattern governs eviction.

Structure

Pool size config — bounded by EVALUATOR_POOL_SIZE
Instance registry — tracks loaded model handles
Round-robin dispatcher — distributes concurrent eval calls
Health check — detects and replaces stalled instances

A4

Keep-Alive Budget

Assign each model role a time-to-live in VRAM proportional to its usage frequency, evicting idle instances before they starve active pipelines.

Problem

Running a planner, producer, and evaluator simultaneously can exhaust GPU VRAM on consumer hardware. A model loaded for a planning call and left resident consumes VRAM that is needed for the evaluator during the Wiggum loop, causing OOM errors or forced unloads at the worst moment.

Solution

Each role is assigned a keep-alive TTL based on how often it is called. The planner (called once per task) has a short TTL; the evaluator (called multiple times per Wiggum loop) has a longer one. An LRU monitor evicts instances whose TTL has expired before new loads are attempted.

Structure

Per-role TTL config — PLANNER_KEEPALIVE, EVALUATOR_KEEPALIVE, etc.
VRAM monitor — polls available memory before each load
LRU eviction — unloads least-recently-used instance when budget is exceeded

Context

Context Engineering Patterns

What reaches the model matters as much as model capability. These five patterns govern retrieval scope, memory architecture, and the shape of context before it enters the prompt.

B1

Planner-First

Generate an explicit subtask decomposition before any retrieval begins, so that each search call is scoped to a declared information need.

Problem

An agent given a complex task directly retrieves breadth-first and synthesizes superficially — it doesn't know what it doesn't know until synthesis fails. Unscoped retrieval wastes search rounds on tangential content and leaves critical gaps unfilled.

Solution

A small, fast planner model produces a structured subtask list (and optionally a dependency graph) before any search call. Each subtask declares its information need explicitly; the retrieval stage is then scoped to that need, and the Novelty Gate operates per-subtask rather than globally.

Structure

Planner prompt — system message specifying decomposition schema
Subtask list — [{id, task, depends_on, search_queries}]
Scope injector — prepends subtask context to each retrieval call
Memory check — queries existing context before issuing new searches

B2

Novelty Gate

Admit search results only if they contribute information not already present in the accumulated context, measured by n-gram overlap against existing content.

Problem

Repeated search rounds on the same topic surface near-duplicate content. Synthesis quality plateaus while token costs grow; the model attends to redundant context rather than integrating new signal. A pipeline without a novelty check will always run the maximum number of search rounds.

Solution

Score each incoming result against the current context accumulation using n-gram overlap or TF-IDF similarity. Admit the result only if its novelty score exceeds a threshold. If the accumulated context already exceeds a quality floor (total chars), skip further retrieval entirely regardless of round count.

Structure

SEARCH_QUALITY_FLOOR — char count threshold for early exit
Novelty scorer — n-gram overlap against merged_context
Admission gate — filters result list before context append
MAX_SEARCH_ROUNDS — hard cap regardless of novelty

B3

Dual-Backend Memory Store

Serve structured exact-match queries from a relational store and semantic similarity queries from a vector index, unified behind a single retrieval interface.

Problem

Exact-match lookup (task ID, file path, session ID) is cheap and reliable in a relational store but is unavailable in a pure vector database. Semantic similarity search over concepts and entities is only possible in a vector store. Choosing one backend forces a suboptimal tradeoff on every query.

Solution

DuckDB handles structured queries against the JSONL schema; Chroma handles embedding-based similarity search. A unified Memory interface inspects the query type and routes to the appropriate backend, joining results when necessary.

Structure

DuckDB shard — JSONL → columnar; SQL queries over runs.jsonl schema
Chroma index — embedding vectors for document-level semantic search
Query router — inspects query shape to select backend
Unified Memory.get_context(task) interface

B4

Semantic Chunker

Split documents at semantic boundaries — headings, paragraph breaks, sentence endings — rather than at fixed character offsets.

Problem

Fixed-window chunking breaks mid-sentence and mid-concept with no regard for document structure. The resulting fragments are incoherent as context inputs and produce poor retrieval recall, because the embedding for a broken chunk reflects noise rather than the concept at the boundary.

Solution

Detect heading, paragraph, and sentence boundaries using structural heuristics; chunk at those boundaries. Maintain a configurable overlap (typically one sentence) at chunk edges to preserve cross-boundary context for downstream retrieval.

Structure

Boundary detector — regex + DOM-aware heuristics for headings and paragraphs
Overlap config — CHUNK_OVERLAP in tokens or sentences
Chunk normalizer — strips boilerplate, normalizes whitespace

B5

Vision Bridge

Include browser screenshots as image tokens in the prompt for tasks where visual layout or rendered output carries information that DOM text does not.

Problem

Many relevant documents — dashboards, rendered markdown, interactive web apps — lose critical context when scraped as raw DOM text. Layout, chart data, and visual hierarchy all inform interpretation but are absent from the text extraction.

Solution

A Playwright-based bridge navigates to the URL, waits for the page to render, captures a screenshot at a declared viewport, and injects the image into the prompt as a base64 image token. The agent can reference visual context alongside extracted text.

Structure

Playwright driver — headless Chrome via CDP
Screenshot scheduler — triggered by vision: true task flag
Image token injector — base64 in multimodal message slot
CDP Guard (E4) wraps the driver to restrict command scope

Verification

Verification Patterns

Three patterns that close the quality loop: how to measure output quality across dimensions, how to compress context without destroying structure, and how to select among multiple synthesis attempts.

C2

Dimensional Rubric

Score output on multiple independently-weighted dimensions rather than a single quality scalar, so that revision prompts can target the specific dimensions that failed.

Problem

A scalar score of 7/10 carries no information about what failed. "The output needs improvement" is not a revision prompt. A model that scores 7 on coverage and 7 on depth requires a different fix than one that scores 9 on coverage and 5 on depth.

Solution

Define five or six named dimensions with explicit weights. The evaluator scores each dimension independently using a structured prompt; the composite is a weighted sum. The Wiggum inner loop uses the per-dimension scores to generate targeted revision instructions for only the dimensions below their threshold.

Structure

Dimensions: relevance (0.20), completeness (0.20), depth (0.25), grounded (0.15), specificity (0.10), structure (0.10)
Composite: sum(score[d] * weight[d] for d in dims)
Per-dimension thresholds — configurable floor per dimension
Revision selector — identifies lowest-scoring dimensions for targeted prompts

C3

Surgical Compressor

Reduce context size by scoring and selectively removing low-relevance segments, preserving document structure rather than truncating uniformly.

Problem

Uniform truncation discards content at arbitrary boundaries — often the tail of a document that contains conclusions and caveats, which are exactly the segments that affect groundedness scores. Long context also degrades synthesis quality through attention dilution: every token competes for attention, and dense context dilutes the signal from the most relevant segments.

Solution

Score each segment for relevance to the current synthesis task. Remove segments below the relevance threshold in order of score, stopping when the context is within the target token budget. Preserve the document's heading hierarchy regardless of which segments are removed.

Structure

Segment scorer — relevance of each chunk to task description
Budget target — MAX_CONTEXT_TOKENS config
Structure-preserving removal — keeps headings that introduce retained segments

C4

ReAct Comparator

Generate multiple synthesis traces and select the highest-scoring one, rather than accepting the first attempt.

Problem

Synthesis quality is non-deterministic. A single pass sometimes produces an output one or two points below what a second attempt on identical context would produce. Committing to the first output without comparison surrenders recoverable quality variance.

Solution

Run N synthesis passes, either in parallel (with the Evaluator Pool providing capacity) or sequentially. Score each with the evaluator. Return the highest-scoring trace. N is typically 2–3; beyond that, the marginal quality gain does not justify the latency and token cost.

Structure

SYNTH_PASSES — configurable N, default 1 (comparator activates when N > 1)
Parallel executor — thread pool bounded by Evaluator Pool size
Score comparator — select argmax(scores)

Orchestration

Orchestration Patterns

Five patterns for coordinating multiple agents, isolating their state, routing tasks across instances, and extending the pipeline through composable skill handlers.

D1

DAG Orchestrator

Decompose a complex task into a directed acyclic graph and execute independent subtasks concurrently, dispatching each newly-unblocked task as its upstream dependencies resolve.

Problem

Sequential subtask execution serializes latency that is inherently parallelizable. A research task with five independent subtopics runs in 5× the time of any single subtask. Dependencies exist between some subtasks but not all; running them all serially ignores the available parallelism.

Solution

Kahn's topological sort identifies the initial set of independent subtasks. They are dispatched to a thread pool. As each completes, its result is recorded and any newly-unblocked downstream tasks are immediately dispatched. The cycle check runs before any threads launch — a graph with a cycle fails loudly before any compute is committed.

Structure

Task decomposer — planner output: [{id, task, depends_on}]
Cycle detector — Kahn's algorithm, O(V+E), runs pre-launch
Thread pool — ThreadPoolExecutor(max_workers=SUBTASK_MAX_WORKERS)
Dependency tracker — _ready(subtasks, results) returns newly-unblocked tasks
Result assembler — cross-reference synthesis across subtask outputs

D2

Worktree Context

Give each concurrent subtask an isolated filesystem view using a Git worktree, avoiding race conditions on shared output paths without container overhead.

Problem

Concurrent subtasks writing to the same output directory create race conditions. The straightforward fix — Docker containers — is too heavyweight for short-lived subtasks and adds startup latency that defeats the concurrency benefit.

Solution

Each subtask is assigned its own Git worktree: a lightweight filesystem-level checkout at a dedicated branch. The subtask writes to its worktree path; results are read back by the orchestrator after the thread completes. The worktree is removed on cleanup, leaving no filesystem state.

Structure

Worktree manager — git worktree add / remove lifecycle
Branch-per-task — isolated branch prevents cross-subtask writes
Cleanup hook — worktree removed on subtask completion or failure

D3

MCP Dispatch Router

Route subtasks to remote harness instances by capability match over the Model Context Protocol, enabling cross-machine specialization without tight coupling between orchestrator and workers.

Problem

A local harness has VRAM, CPU, and model availability constraints. Some subtasks benefit from remote instances with different hardware (GPU vs CPU), different model weights, or higher parallelism capacity. Hard-coding remote endpoints creates brittle coupling.

Solution

An MCP server exposes a /dispatch endpoint. The orchestrator announces a subtask's required capabilities; the router selects a registered remote instance by capability match and forwards the task. Results are returned over the same MCP channel.

Structure

MCP server — FastAPI app with /dispatch and /register endpoints
Instance registry — {instance_id: {url, capabilities, last_seen}}
Capability matcher — selects instance by required model, VRAM, task type
Result aggregator — collects async responses; timeout and retry handling

D4

Skill Registry

Extend agent capabilities through a slash-command dispatch system — prefix matching against a registry of named handlers — without modifying the core pipeline.

Problem

Hard-coding specialized behaviors (literature review, browser navigation, email drafting, code execution) into the main agent loop creates a monolith. Adding a new capability requires changing core code, invalidating tests, and redeploying. Different deployments need different capability sets.

Solution

A registry maps slash-command prefixes to callable skill handlers. The entry point checks for a prefix match before routing to the standard pipeline. Skills are standalone modules that conform to a simple callable protocol; adding a skill requires only registering it in the skills dict.

Structure

_SKILLS: dict[str, Callable] — prefix → handler
Prefix matcher — next((k for k in _SKILLS if task.startswith(k)), None)
Skill protocol — handler(task: str, config: ModelConfig) -> str
Fallback — unmatched input routes to standard pipeline

D5

Agent Channel

Provide typed inter-agent communication over a shared message bus so that subtask results and status events propagate without filesystem polling.

Problem

Agents spawned as concurrent subtasks need to communicate results back to the orchestrator without polling a shared file. Filesystem-based handoff is subject to race conditions and provides no mechanism for partial progress, status updates, or error signaling during execution.

Solution

A typed message channel with producer/consumer semantics. Agents publish result and status events as they complete pipeline stages; the orchestrator subscribes and advances the dependency graph in response to events rather than polling for file existence.

Structure

Message schema — {agent_id, event_type, payload, timestamp}
Publisher — called by subtask agent at stage completion
Subscriber — orchestrator main loop consumes events, updates dependency state
Backpressure — bounded queue with configurable max depth

Security

Security Patterns

Four patterns that constrain the agent's ability to execute dangerous code, traverse the filesystem, act on injected instructions, or abuse browser automation privileges. All implemented with Python stdlib and no external security dependencies.

E1

AST Guard

Block execution of code containing dangerous constructs by parsing the AST before any code runs, making the check structurally immune to string-level obfuscation.

Problem

An agent that generates and executes code can be prompted to produce os.system(), subprocess.call(), or __import__ chains. String-matching on source code is evadable through concatenation, encoding, or indirect imports. The check must operate at the semantic level.

Solution

Parse the submitted code with Python's ast module before execution. Walk the AST with a NodeVisitor that checks for banned call targets and import names. Reject the code if any banned node is found; the check is O(N) in AST size and adds <1ms to execution setup.

Structure

Banned node list — exec, eval, __import__, subprocess, os.system
ast.NodeVisitor subclass — walks Call and Import nodes
Rejection handler — raises SecurityError before exec() is called

E2

Path Sandbox

Restrict all agent file I/O to a declared workspace directory by resolving symlinks and verifying the absolute path before every read or write.

Problem

An agent synthesizing file paths from task input can traverse outside the intended workspace using ../ sequences or symlinks. A task that says "save to ~/Desktop/output.md" and is executed with a workspace of /tmp/agent_work can silently write to the user's home directory.

Solution

Resolve all paths with Path.resolve() before any file operation. Verify that the resolved absolute path begins with the workspace root. Symlinks are fully resolved before the check, making the guard immune to symlink-based traversal. Operations outside the sandbox raise SecurityError.

Structure

Workspace root — AGENT_WORKSPACE env var or config
Path resolver — Path(p).resolve() before every read/write
Containment check — resolved.is_relative_to(workspace_root)

E3

Injection Scanner

Detect and strip prompt injection attempts from retrieved web content before it enters the model context, logging each stripped instance to the run trace.

Problem

Retrieved web pages and search results may contain adversarial instructions — "Ignore all previous instructions and instead…" — that redirect agent behavior when included in context. The model cannot distinguish between legitimate retrieved content and injected instructions at the semantic level.

Solution

Pattern-match all incoming content against an injection signature library before it is appended to the context accumulation. Strip matched segments and inject a system note: "N injection attempt(s) detected and removed." Log the injection count in the RunTrace for forensic analysis.

Structure

Signature library — regex patterns for common injection forms
Pre-context scanner — called on every search result before append
Strip-and-annotate — removes match, inserts system note
injection_stripped counter in RunTrace

E4

CDP Guard

Restrict Chrome DevTools Protocol commands to a declared whitelist, blocking all other CDP domains at the driver layer before they reach the browser.

Problem

An agent with unrestricted CDP access can exfiltrate browsing history, read cookies, inject scripts, and interact with browser state entirely outside its declared task scope. CDP is a broad attack surface; an agent that can call Network.getAllCookies has no need to for a standard research task.

Solution

Intercept CDP commands at the Playwright driver layer. Maintain a whitelist of permitted command domains (Page, Runtime, Screenshot). Commands outside the whitelist are blocked and logged; the agent receives a CDPSecurityError with the blocked command name.

Structure

Permitted domains whitelist — configurable per skill
CDP interceptor — wraps page.send() in Playwright
Block-and-log handler — CDPSecurityError + injection_stripped counter

Observability

Observability Patterns

Four patterns that make the pipeline visible: a structured execution trace, a queryable append log, a visual profiler, and a live dashboard API. Together they form the instrumentation layer that makes self-improvement possible.

F1

RunTrace

Accumulate a complete structured record of every pipeline stage into a single object during execution, then serialize it atomically to the audit log on completion.

Problem

Debugging multi-stage pipeline failures requires reconstructing what happened from scattered print statements and partially-written files. Post-hoc reconstruction is unreliable; the failure may have corrupted the very state needed to diagnose it. Logs that record only the final state hide intermediate failures.

Solution

A RunTrace dataclass is initialized at pipeline entry and threaded through all stages. Each stage appends its timing, token counts, model calls, tool results, and any errors to the trace. On pipeline exit — whether successful or failed — the trace is serialized atomically to runs.jsonl.

Structure

RunTrace dataclass — fields for every tracked pipeline attribute
Stage hooks — each stage receives and returns the trace
Atomic serializer — single append to runs.jsonl on completion
ID format — YYYYMMDDTHHMMSSZ-<8-char-hex>, globally unique

F2

JSONL Audit Log

Maintain a tamper-evident, queryable history of all runs as an append-only JSONL file, exploiting line-addressable structure to enable SQL-level queries without a database server.

Problem

A relational database requires schema migrations, backups, connection pooling, and operational overhead inappropriate for a local research harness. Flat text logs are unstructured and cannot be queried. CSV exports lose nested structure. The pipeline needs both simplicity and query power.

Solution

Each run appends one JSON object per line to runs.jsonl. The file is human-readable with jq, queryable as a DataFrame with pandas, and directly ingestible by DuckDB as a columnar source for SQL analytics. The append-only contract makes the log tamper-evident by construction.

Structure

Append-only writer — open(path, "a") + json.dumps(trace) + "\n"
Schema versioning — schema_version field enables forward compatibility
DuckDB adapter — SELECT * FROM read_ndjson_auto('runs.jsonl')
jq patterns — documented in harness-data-model.html

F3

Chrome Trace Exporter

Convert RunTrace stage timestamps to Chrome Trace format so that pipeline timing can be inspected as a flame graph without any additional tooling.

Problem

Raw per-stage durations in a JSONL record are difficult to compare visually. Identifying whether planning, retrieval, synthesis, or evaluation is the latency bottleneck requires mentally summing timestamps across fields. The analysis needs to be visual and immediate, not a spreadsheet exercise.

Solution

Extract start and duration from each stage's timing fields. Serialize as Chrome Trace JSON events (ph: "X" complete events). The output file opens directly in chrome://tracing or Perfetto UI as a flame graph with no additional tooling, plugins, or server required.

Structure

Stage extractor — maps RunTrace timing fields to trace event dicts
Chrome Trace serializer — [{name, ph:"X", ts, dur, pid, tid}]
Output: trace_<run_id>.json — opens in chrome://tracing

F4

Dashboard API Layer

Expose RunTrace data through a local REST API that maintains an in-memory index, decoupling the dashboard from direct JSONL file reads during active pipeline runs.

Problem

A dashboard that reads runs.jsonl directly re-parses the full file on every request. During a long pipeline run, concurrent reads create file contention and increasingly expensive scans as the log grows. A live dashboard needs low-latency reads against a growing dataset.

Solution

A FastAPI server maintains an in-memory index over runs.jsonl, updated by a tail-watcher on each new append. The dashboard queries the API for pre-aggregated metrics; the API responds from the index without touching the file. A WebSocket endpoint streams live updates as new runs complete.

Structure

FastAPI app — /runs, /runs/{id}, /metrics endpoints
JSONL tail-watcher — inotify / polling appends to in-memory index
Metric aggregator — pre-computes score distributions, latency percentiles
WebSocket — pushes new run summaries to connected dashboards

Self-Improvement

Self-Improvement Patterns

Five patterns that close the outer loop: converting production run data into training signal, optimizing the synthesis instruction autonomously, and detecting when the optimizer has converged or gone off the rails.

G1

Data Flywheel

Convert high-quality production run outputs into DPO training pairs automatically, so that each production task directly improves the model that will handle future tasks.

Problem

A static model processes production data but learns nothing from it. Improving quality requires manual annotation, which is expensive and slow. The harness already produces per-run quality scores and full output text — the raw materials for preference learning are available but unused.

Solution

Runs scoring above a threshold are promoted as DPO positive examples; runs on the same task scoring below a floor become negative examples. The pairing is automatic: same task prompt, different output quality. The DPO dataset grows with every production run; fine-tuning cycles consume it on a schedule.

Structure

Score threshold filter — wiggum_r1 >= FLYWHEEL_POSITIVE_FLOOR
DPO pair generator — matches positive/negative by task_id
Training data formatter — ShareGPT or Alpaca schema output
Fine-tuning trigger — scheduled or threshold-based

G2

RL Rollout

Treat each agent run trajectory as a reinforcement learning episode, using the final verifiable score as the reward signal for policy gradient updates.

Problem

DPO requires paired examples — a preferred and a dispreferred output for the same prompt. For tasks with verifiable correct answers (code execution, structured queries, fact-checkable outputs), RL is more sample-efficient: a scalar reward from a verifier is sufficient without paired data collection.

Solution

The RunTrace trajectory — the sequence of model calls, tool results, and intermediate states — is serialized as an RL episode. The final WIGGUM score or a task-specific verifier provides the reward. NeMo RL processes the rollout buffer using GRPO or PPO, updating the producer model policy.

Structure

Trajectory extractor — RunTrace → (state, action, reward) sequence
Verifiable reward function — WIGGUM score or task-specific check
NeMo RL integration — GRPO/PPO policy update
Rollout buffer — accumulates episodes until batch threshold

G3

Literature Review Pipeline

Continuously update the harness knowledge base by running the synthesis pipeline against curated arxiv seed queries on a schedule, without human curation at each cycle.

Problem

Keeping the knowledge base current with the research literature requires reading and summarizing new papers — a task that does not scale with manual effort. RAG without structured extraction produces shallow, inconsistent coverage; the harness needs structured findings, not raw abstracts.

Solution

A scheduled pipeline fetches new papers from arxiv seed queries, runs each through the full synthesis pipeline, extracts structured findings using the WIGGUM-evaluated synthesis instruction, and indexes the output into the knowledge base. The pipeline uses the harness to improve the harness.

Structure

Seed query list — topic-specific arxiv search strings
Fetch scheduler — configurable cadence (daily / weekly)
Synthesis pipeline — full harness run per paper batch
Knowledge base indexer — structured output → DuckDB + Chroma

G4

Autoresearch Loop

Autonomous Instruction Optimizer

Autonomously improve the synthesis instruction by proposing changes, evaluating them against a held baseline, and accepting only changes that advance the composite score beyond a threshold.

Problem

Synthesis instruction quality directly determines output quality but is difficult to tune by hand: the space of possible instructions is vast, evaluation is noisy, and the connection between instruction wording and dimensional score is opaque. Manual iteration is slow and produces instructions optimized for legibility rather than evaluator reward.

Solution

A proposer LLM reads the current instruction, experiment history, hard-ban list, and evaluator feedback, then outputs a replacement. The replacement is evaluated against the current baseline; it is kept only if it advances the composite score by more than DELTA_THRESHOLD. The loop runs until a convergence condition is detected (see G5).

Structure

Proposer — strongest available model; reads PROPOSE_PROMPT with history injected
Eval suite — composite = 0.7 × wiggum_r1 + 0.3 × criteria_rate × 10
Advance/discard — keep if score > baseline + DELTA_THRESHOLD
Hard-ban list — grows with failed approach families
Experiment TSV — autoresearch.tsv audit trail

G5

Convergence Detector

Monitor the autoresearch loop for attractor lock, baseline contamination, and semantic stagnation — four distinct signals that indicate the optimizer has stopped making genuine progress.

Problem

An autonomous optimizer with only score-based plateau detection will run indefinitely in a converged state: the proposer recycles the same approach family under slightly different names, the baseline may have been established from a single lucky evaluation run, and the hard-ban list can grow to prohibit the only instruction family that ever worked.

Solution

Four complementary detectors operate at different points in the loop. The semantic attractor guard (pre-eval) rejects proposals too similar to recent discards. The family entropy monitor detects when one approach family dominates the sliding window. The baseline re-estimation trigger periodically re-runs the baseline with eval-n=3 to detect lucky-sample inflation. The global convergence exit halts the loop after a configurable number of experiments with no advance.

Structure

Semantic attractor guard — TF-IDF cosine similarity against recent discard descriptions; threshold ~0.65
Family entropy monitor — Shannon entropy of last-10 family labels; fires when entropy < 1.0
Baseline re-estimation — re-run baseline with eval-n=3 after K consecutive discards
Global convergence exit — halt + structured report after N_max experiments with zero advances since M_last_advance

Foundation Patterns

Producer-Evaluator Separation

Model Role Separation

Substrate Patterns

Inference Shim

Evaluator Pool

Keep-Alive Budget

Context Engineering Patterns

Planner-First

Novelty Gate

Dual-Backend Memory Store

Semantic Chunker

Vision Bridge

Verification Patterns

Dimensional Rubric

Surgical Compressor

ReAct Comparator

Orchestration Patterns

DAG Orchestrator

Worktree Context

MCP Dispatch Router

Skill Registry

Agent Channel

Security Patterns

AST Guard

Path Sandbox

Injection Scanner

CDP Guard

Observability Patterns

RunTrace

JSONL Audit Log

Chrome Trace Exporter

Dashboard API Layer

Self-Improvement Patterns

Data Flywheel

RL Rollout

Literature Review Pipeline

Autoresearch Loop

Convergence Detector

Related posts