← Back to Blog
May 25, 2026 • Agentic Harness Engineering Series • Pattern Catalog

Agentic System Design Patterns

A catalog of 27 named patterns for building reliable LLM-powered agents. Each pattern names a recurring design problem, explains why naive solutions fail, and specifies a reusable solution structure. Patterns are organized by the system concern they address; cross-references show how they compose.

27
Named patterns
8
Categories
11
Series posts
1,500+
Production runs
Foundation

Foundation Patterns

Two patterns that define the core evaluation architecture. Everything else depends on these being in place first.

C1

Producer-Evaluator Separation

The Wiggum Loop

Evaluate synthesis output with a model categorically different from the one that produced it.

Problem

A model that evaluates its own output scores it 0.9 points higher on average than an independent evaluator, with the gap largest on groundedness — the dimension where self-generated errors are hardest to detect. Self-evaluation is not a quality check; it is a confidence amplifier.

Solution

Assign evaluation to a separate model that has never seen the synthesis context. The evaluator runs the same dimensional rubric independently; the producer receives only the score and per-dimension feedback, not the evaluator's reasoning trace. A second inner loop (Wiggum) performs targeted revision on low-scoring dimensions.

Structure
  • Outer loop (Ralph): producer generates output → evaluator scores it → revision if below threshold
  • Inner loop (Wiggum): per-dimension feedback → targeted revision prompt → re-score on changed dimensions only
  • Guard: producer and evaluator must be different model checkpoints; the shim enforces this at call time
  • Score threshold: configurable composite floor before output is accepted
A2

Model Role Separation

Assign distinct, separately-configured model instances to each pipeline role rather than routing all calls to a single model.

Problem

A single model used for planning, synthesis, evaluation, and instruction proposal accumulates conflicting optimization pressures. A model chosen to be a good producer is rarely the best evaluator; a fast planner is rarely the most accurate synthesizer. Role conflation also makes it impossible to upgrade one role without re-validating all others.

Solution

Define named roles (PLANNER_MODEL, PRODUCER_MODEL, EVALUATOR_MODEL, PROPOSER_MODEL) as independently configurable constants. Each role's model is selected based on the role's requirements — speed for planning, accuracy for evaluation, instruction-following for synthesis — and can be swapped without touching the others.

Structure
  • PLANNER_MODEL — small/fast; produces subtask graph
  • PRODUCER_MODEL — primary synthesis capability
  • EVALUATOR_MODEL — must differ from PRODUCER_MODEL (enforced by shim)
  • PROPOSER_MODEL — instruction optimization; typically the strongest available model
Substrate

Substrate Patterns

Infrastructure beneath the pipeline. These patterns govern how models are loaded, called, and kept available without exhausting hardware resources.

A1

Inference Shim

Provide a uniform call interface over heterogeneous inference backends so that backend selection is a runtime configuration decision, not a code change.

Problem

Code that calls a specific backend (Ollama, OpenAI-compatible API, vLLM) is tightly coupled to that backend's request schema, error codes, and streaming protocol. Switching backends or running hybrid local/cloud configurations requires changes throughout the codebase.

Solution

A thin adapter layer normalizes request and response shape across backends. Callers pass a model name and a list of messages; the shim resolves the backend, handles retries and streaming, and returns a normalized response object. Backend is selected by the model name prefix (e.g., atla/ routes to the API, no prefix routes to Ollama).

Structure
  • call(model, messages, **opts) — unified entry point
  • Backend resolver — maps model name → OllamaBackend or APIBackend
  • Response normalizer — {"content": str, "tokens": {...}} regardless of backend
  • Retry wrapper — handles transient failures with exponential backoff
A3

Evaluator Pool

Maintain a pool of warm evaluator instances to eliminate cold-start latency on each evaluation pass.

Problem

A 30B evaluator model takes 8–12 seconds to load from disk. Running the Wiggum loop after every synthesis pass makes evaluation the dominant latency term — not synthesis. Cold-loading the evaluator on each pass makes the inner loop unusable in production.

Solution

Keep one or more evaluator instances loaded and available. The shim routes evaluation calls to the pool; under concurrent load, calls round-robin across warm instances. Pool size is bounded by VRAM budget; the Keep-Alive Budget pattern governs eviction.

Structure
  • Pool size config — bounded by EVALUATOR_POOL_SIZE
  • Instance registry — tracks loaded model handles
  • Round-robin dispatcher — distributes concurrent eval calls
  • Health check — detects and replaces stalled instances
A4

Keep-Alive Budget

Assign each model role a time-to-live in VRAM proportional to its usage frequency, evicting idle instances before they starve active pipelines.

Problem

Running a planner, producer, and evaluator simultaneously can exhaust GPU VRAM on consumer hardware. A model loaded for a planning call and left resident consumes VRAM that is needed for the evaluator during the Wiggum loop, causing OOM errors or forced unloads at the worst moment.

Solution

Each role is assigned a keep-alive TTL based on how often it is called. The planner (called once per task) has a short TTL; the evaluator (called multiple times per Wiggum loop) has a longer one. An LRU monitor evicts instances whose TTL has expired before new loads are attempted.

Structure
  • Per-role TTL config — PLANNER_KEEPALIVE, EVALUATOR_KEEPALIVE, etc.
  • VRAM monitor — polls available memory before each load
  • LRU eviction — unloads least-recently-used instance when budget is exceeded
Context

Context Engineering Patterns

What reaches the model matters as much as model capability. These five patterns govern retrieval scope, memory architecture, and the shape of context before it enters the prompt.

B1

Planner-First

Generate an explicit subtask decomposition before any retrieval begins, so that each search call is scoped to a declared information need.

Problem

An agent given a complex task directly retrieves breadth-first and synthesizes superficially — it doesn't know what it doesn't know until synthesis fails. Unscoped retrieval wastes search rounds on tangential content and leaves critical gaps unfilled.

Solution

A small, fast planner model produces a structured subtask list (and optionally a dependency graph) before any search call. Each subtask declares its information need explicitly; the retrieval stage is then scoped to that need, and the Novelty Gate operates per-subtask rather than globally.

Structure
  • Planner prompt — system message specifying decomposition schema
  • Subtask list — [{id, task, depends_on, search_queries}]
  • Scope injector — prepends subtask context to each retrieval call
  • Memory check — queries existing context before issuing new searches
B2

Novelty Gate

Admit search results only if they contribute information not already present in the accumulated context, measured by n-gram overlap against existing content.

Problem

Repeated search rounds on the same topic surface near-duplicate content. Synthesis quality plateaus while token costs grow; the model attends to redundant context rather than integrating new signal. A pipeline without a novelty check will always run the maximum number of search rounds.

Solution

Score each incoming result against the current context accumulation using n-gram overlap or TF-IDF similarity. Admit the result only if its novelty score exceeds a threshold. If the accumulated context already exceeds a quality floor (total chars), skip further retrieval entirely regardless of round count.

Structure
  • SEARCH_QUALITY_FLOOR — char count threshold for early exit
  • Novelty scorer — n-gram overlap against merged_context
  • Admission gate — filters result list before context append
  • MAX_SEARCH_ROUNDS — hard cap regardless of novelty
B3

Dual-Backend Memory Store

Serve structured exact-match queries from a relational store and semantic similarity queries from a vector index, unified behind a single retrieval interface.

Problem

Exact-match lookup (task ID, file path, session ID) is cheap and reliable in a relational store but is unavailable in a pure vector database. Semantic similarity search over concepts and entities is only possible in a vector store. Choosing one backend forces a suboptimal tradeoff on every query.

Solution

DuckDB handles structured queries against the JSONL schema; Chroma handles embedding-based similarity search. A unified Memory interface inspects the query type and routes to the appropriate backend, joining results when necessary.

Structure
  • DuckDB shard — JSONL → columnar; SQL queries over runs.jsonl schema
  • Chroma index — embedding vectors for document-level semantic search
  • Query router — inspects query shape to select backend
  • Unified Memory.get_context(task) interface
B4

Semantic Chunker

Split documents at semantic boundaries — headings, paragraph breaks, sentence endings — rather than at fixed character offsets.

Problem

Fixed-window chunking breaks mid-sentence and mid-concept with no regard for document structure. The resulting fragments are incoherent as context inputs and produce poor retrieval recall, because the embedding for a broken chunk reflects noise rather than the concept at the boundary.

Solution

Detect heading, paragraph, and sentence boundaries using structural heuristics; chunk at those boundaries. Maintain a configurable overlap (typically one sentence) at chunk edges to preserve cross-boundary context for downstream retrieval.

Structure
  • Boundary detector — regex + DOM-aware heuristics for headings and paragraphs
  • Overlap config — CHUNK_OVERLAP in tokens or sentences
  • Chunk normalizer — strips boilerplate, normalizes whitespace
B5

Vision Bridge

Include browser screenshots as image tokens in the prompt for tasks where visual layout or rendered output carries information that DOM text does not.

Problem

Many relevant documents — dashboards, rendered markdown, interactive web apps — lose critical context when scraped as raw DOM text. Layout, chart data, and visual hierarchy all inform interpretation but are absent from the text extraction.

Solution

A Playwright-based bridge navigates to the URL, waits for the page to render, captures a screenshot at a declared viewport, and injects the image into the prompt as a base64 image token. The agent can reference visual context alongside extracted text.

Structure
  • Playwright driver — headless Chrome via CDP
  • Screenshot scheduler — triggered by vision: true task flag
  • Image token injector — base64 in multimodal message slot
  • CDP Guard (E4) wraps the driver to restrict command scope
Verification

Verification Patterns

Three patterns that close the quality loop: how to measure output quality across dimensions, how to compress context without destroying structure, and how to select among multiple synthesis attempts.

C2

Dimensional Rubric

Score output on multiple independently-weighted dimensions rather than a single quality scalar, so that revision prompts can target the specific dimensions that failed.

Problem

A scalar score of 7/10 carries no information about what failed. "The output needs improvement" is not a revision prompt. A model that scores 7 on coverage and 7 on depth requires a different fix than one that scores 9 on coverage and 5 on depth.

Solution

Define five or six named dimensions with explicit weights. The evaluator scores each dimension independently using a structured prompt; the composite is a weighted sum. The Wiggum inner loop uses the per-dimension scores to generate targeted revision instructions for only the dimensions below their threshold.

Structure
  • Dimensions: relevance (0.20), completeness (0.20), depth (0.25), grounded (0.15), specificity (0.10), structure (0.10)
  • Composite: sum(score[d] * weight[d] for d in dims)
  • Per-dimension thresholds — configurable floor per dimension
  • Revision selector — identifies lowest-scoring dimensions for targeted prompts
C3

Surgical Compressor

Reduce context size by scoring and selectively removing low-relevance segments, preserving document structure rather than truncating uniformly.

Problem

Uniform truncation discards content at arbitrary boundaries — often the tail of a document that contains conclusions and caveats, which are exactly the segments that affect groundedness scores. Long context also degrades synthesis quality through attention dilution: every token competes for attention, and dense context dilutes the signal from the most relevant segments.

Solution

Score each segment for relevance to the current synthesis task. Remove segments below the relevance threshold in order of score, stopping when the context is within the target token budget. Preserve the document's heading hierarchy regardless of which segments are removed.

Structure
  • Segment scorer — relevance of each chunk to task description
  • Budget target — MAX_CONTEXT_TOKENS config
  • Structure-preserving removal — keeps headings that introduce retained segments
C4

ReAct Comparator

Generate multiple synthesis traces and select the highest-scoring one, rather than accepting the first attempt.

Problem

Synthesis quality is non-deterministic. A single pass sometimes produces an output one or two points below what a second attempt on identical context would produce. Committing to the first output without comparison surrenders recoverable quality variance.

Solution

Run N synthesis passes, either in parallel (with the Evaluator Pool providing capacity) or sequentially. Score each with the evaluator. Return the highest-scoring trace. N is typically 2–3; beyond that, the marginal quality gain does not justify the latency and token cost.

Structure
  • SYNTH_PASSES — configurable N, default 1 (comparator activates when N > 1)
  • Parallel executor — thread pool bounded by Evaluator Pool size
  • Score comparator — select argmax(scores)
Orchestration

Orchestration Patterns

Five patterns for coordinating multiple agents, isolating their state, routing tasks across instances, and extending the pipeline through composable skill handlers.

D1

DAG Orchestrator

Decompose a complex task into a directed acyclic graph and execute independent subtasks concurrently, dispatching each newly-unblocked task as its upstream dependencies resolve.

Problem

Sequential subtask execution serializes latency that is inherently parallelizable. A research task with five independent subtopics runs in 5× the time of any single subtask. Dependencies exist between some subtasks but not all; running them all serially ignores the available parallelism.

Solution

Kahn's topological sort identifies the initial set of independent subtasks. They are dispatched to a thread pool. As each completes, its result is recorded and any newly-unblocked downstream tasks are immediately dispatched. The cycle check runs before any threads launch — a graph with a cycle fails loudly before any compute is committed.

Structure
  • Task decomposer — planner output: [{id, task, depends_on}]
  • Cycle detector — Kahn's algorithm, O(V+E), runs pre-launch
  • Thread pool — ThreadPoolExecutor(max_workers=SUBTASK_MAX_WORKERS)
  • Dependency tracker — _ready(subtasks, results) returns newly-unblocked tasks
  • Result assembler — cross-reference synthesis across subtask outputs
D2

Worktree Context

Give each concurrent subtask an isolated filesystem view using a Git worktree, avoiding race conditions on shared output paths without container overhead.

Problem

Concurrent subtasks writing to the same output directory create race conditions. The straightforward fix — Docker containers — is too heavyweight for short-lived subtasks and adds startup latency that defeats the concurrency benefit.

Solution

Each subtask is assigned its own Git worktree: a lightweight filesystem-level checkout at a dedicated branch. The subtask writes to its worktree path; results are read back by the orchestrator after the thread completes. The worktree is removed on cleanup, leaving no filesystem state.

Structure
  • Worktree manager — git worktree add / remove lifecycle
  • Branch-per-task — isolated branch prevents cross-subtask writes
  • Cleanup hook — worktree removed on subtask completion or failure
D3

MCP Dispatch Router

Route subtasks to remote harness instances by capability match over the Model Context Protocol, enabling cross-machine specialization without tight coupling between orchestrator and workers.

Problem

A local harness has VRAM, CPU, and model availability constraints. Some subtasks benefit from remote instances with different hardware (GPU vs CPU), different model weights, or higher parallelism capacity. Hard-coding remote endpoints creates brittle coupling.

Solution

An MCP server exposes a /dispatch endpoint. The orchestrator announces a subtask's required capabilities; the router selects a registered remote instance by capability match and forwards the task. Results are returned over the same MCP channel.

Structure
  • MCP server — FastAPI app with /dispatch and /register endpoints
  • Instance registry — {instance_id: {url, capabilities, last_seen}}
  • Capability matcher — selects instance by required model, VRAM, task type
  • Result aggregator — collects async responses; timeout and retry handling
D4

Skill Registry

Extend agent capabilities through a slash-command dispatch system — prefix matching against a registry of named handlers — without modifying the core pipeline.

Problem

Hard-coding specialized behaviors (literature review, browser navigation, email drafting, code execution) into the main agent loop creates a monolith. Adding a new capability requires changing core code, invalidating tests, and redeploying. Different deployments need different capability sets.

Solution

A registry maps slash-command prefixes to callable skill handlers. The entry point checks for a prefix match before routing to the standard pipeline. Skills are standalone modules that conform to a simple callable protocol; adding a skill requires only registering it in the skills dict.

Structure
  • _SKILLS: dict[str, Callable] — prefix → handler
  • Prefix matcher — next((k for k in _SKILLS if task.startswith(k)), None)
  • Skill protocol — handler(task: str, config: ModelConfig) -> str
  • Fallback — unmatched input routes to standard pipeline
D5

Agent Channel

Provide typed inter-agent communication over a shared message bus so that subtask results and status events propagate without filesystem polling.

Problem

Agents spawned as concurrent subtasks need to communicate results back to the orchestrator without polling a shared file. Filesystem-based handoff is subject to race conditions and provides no mechanism for partial progress, status updates, or error signaling during execution.

Solution

A typed message channel with producer/consumer semantics. Agents publish result and status events as they complete pipeline stages; the orchestrator subscribes and advances the dependency graph in response to events rather than polling for file existence.

Structure
  • Message schema — {agent_id, event_type, payload, timestamp}
  • Publisher — called by subtask agent at stage completion
  • Subscriber — orchestrator main loop consumes events, updates dependency state
  • Backpressure — bounded queue with configurable max depth
Security

Security Patterns

Four patterns that constrain the agent's ability to execute dangerous code, traverse the filesystem, act on injected instructions, or abuse browser automation privileges. All implemented with Python stdlib and no external security dependencies.

E1

AST Guard

Block execution of code containing dangerous constructs by parsing the AST before any code runs, making the check structurally immune to string-level obfuscation.

Problem

An agent that generates and executes code can be prompted to produce os.system(), subprocess.call(), or __import__ chains. String-matching on source code is evadable through concatenation, encoding, or indirect imports. The check must operate at the semantic level.

Solution

Parse the submitted code with Python's ast module before execution. Walk the AST with a NodeVisitor that checks for banned call targets and import names. Reject the code if any banned node is found; the check is O(N) in AST size and adds <1ms to execution setup.

Structure
  • Banned node list — exec, eval, __import__, subprocess, os.system
  • ast.NodeVisitor subclass — walks Call and Import nodes
  • Rejection handler — raises SecurityError before exec() is called
E2

Path Sandbox

Restrict all agent file I/O to a declared workspace directory by resolving symlinks and verifying the absolute path before every read or write.

Problem

An agent synthesizing file paths from task input can traverse outside the intended workspace using ../ sequences or symlinks. A task that says "save to ~/Desktop/output.md" and is executed with a workspace of /tmp/agent_work can silently write to the user's home directory.

Solution

Resolve all paths with Path.resolve() before any file operation. Verify that the resolved absolute path begins with the workspace root. Symlinks are fully resolved before the check, making the guard immune to symlink-based traversal. Operations outside the sandbox raise SecurityError.

Structure
  • Workspace root — AGENT_WORKSPACE env var or config
  • Path resolver — Path(p).resolve() before every read/write
  • Containment check — resolved.is_relative_to(workspace_root)
E3

Injection Scanner

Detect and strip prompt injection attempts from retrieved web content before it enters the model context, logging each stripped instance to the run trace.

Problem

Retrieved web pages and search results may contain adversarial instructions — "Ignore all previous instructions and instead…" — that redirect agent behavior when included in context. The model cannot distinguish between legitimate retrieved content and injected instructions at the semantic level.

Solution

Pattern-match all incoming content against an injection signature library before it is appended to the context accumulation. Strip matched segments and inject a system note: "N injection attempt(s) detected and removed." Log the injection count in the RunTrace for forensic analysis.

Structure
  • Signature library — regex patterns for common injection forms
  • Pre-context scanner — called on every search result before append
  • Strip-and-annotate — removes match, inserts system note
  • injection_stripped counter in RunTrace
E4

CDP Guard

Restrict Chrome DevTools Protocol commands to a declared whitelist, blocking all other CDP domains at the driver layer before they reach the browser.

Problem

An agent with unrestricted CDP access can exfiltrate browsing history, read cookies, inject scripts, and interact with browser state entirely outside its declared task scope. CDP is a broad attack surface; an agent that can call Network.getAllCookies has no need to for a standard research task.

Solution

Intercept CDP commands at the Playwright driver layer. Maintain a whitelist of permitted command domains (Page, Runtime, Screenshot). Commands outside the whitelist are blocked and logged; the agent receives a CDPSecurityError with the blocked command name.

Structure
  • Permitted domains whitelist — configurable per skill
  • CDP interceptor — wraps page.send() in Playwright
  • Block-and-log handler — CDPSecurityError + injection_stripped counter
Observability

Observability Patterns

Four patterns that make the pipeline visible: a structured execution trace, a queryable append log, a visual profiler, and a live dashboard API. Together they form the instrumentation layer that makes self-improvement possible.

F1

RunTrace

Accumulate a complete structured record of every pipeline stage into a single object during execution, then serialize it atomically to the audit log on completion.

Problem

Debugging multi-stage pipeline failures requires reconstructing what happened from scattered print statements and partially-written files. Post-hoc reconstruction is unreliable; the failure may have corrupted the very state needed to diagnose it. Logs that record only the final state hide intermediate failures.

Solution

A RunTrace dataclass is initialized at pipeline entry and threaded through all stages. Each stage appends its timing, token counts, model calls, tool results, and any errors to the trace. On pipeline exit — whether successful or failed — the trace is serialized atomically to runs.jsonl.

Structure
  • RunTrace dataclass — fields for every tracked pipeline attribute
  • Stage hooks — each stage receives and returns the trace
  • Atomic serializer — single append to runs.jsonl on completion
  • ID format — YYYYMMDDTHHMMSSZ-<8-char-hex>, globally unique
F2

JSONL Audit Log

Maintain a tamper-evident, queryable history of all runs as an append-only JSONL file, exploiting line-addressable structure to enable SQL-level queries without a database server.

Problem

A relational database requires schema migrations, backups, connection pooling, and operational overhead inappropriate for a local research harness. Flat text logs are unstructured and cannot be queried. CSV exports lose nested structure. The pipeline needs both simplicity and query power.

Solution

Each run appends one JSON object per line to runs.jsonl. The file is human-readable with jq, queryable as a DataFrame with pandas, and directly ingestible by DuckDB as a columnar source for SQL analytics. The append-only contract makes the log tamper-evident by construction.

Structure
  • Append-only writer — open(path, "a") + json.dumps(trace) + "\n"
  • Schema versioning — schema_version field enables forward compatibility
  • DuckDB adapter — SELECT * FROM read_ndjson_auto('runs.jsonl')
  • jq patterns — documented in harness-data-model.html
F3

Chrome Trace Exporter

Convert RunTrace stage timestamps to Chrome Trace format so that pipeline timing can be inspected as a flame graph without any additional tooling.

Problem

Raw per-stage durations in a JSONL record are difficult to compare visually. Identifying whether planning, retrieval, synthesis, or evaluation is the latency bottleneck requires mentally summing timestamps across fields. The analysis needs to be visual and immediate, not a spreadsheet exercise.

Solution

Extract start and duration from each stage's timing fields. Serialize as Chrome Trace JSON events (ph: "X" complete events). The output file opens directly in chrome://tracing or Perfetto UI as a flame graph with no additional tooling, plugins, or server required.

Structure
  • Stage extractor — maps RunTrace timing fields to trace event dicts
  • Chrome Trace serializer — [{name, ph:"X", ts, dur, pid, tid}]
  • Output: trace_<run_id>.json — opens in chrome://tracing
F4

Dashboard API Layer

Expose RunTrace data through a local REST API that maintains an in-memory index, decoupling the dashboard from direct JSONL file reads during active pipeline runs.

Problem

A dashboard that reads runs.jsonl directly re-parses the full file on every request. During a long pipeline run, concurrent reads create file contention and increasingly expensive scans as the log grows. A live dashboard needs low-latency reads against a growing dataset.

Solution

A FastAPI server maintains an in-memory index over runs.jsonl, updated by a tail-watcher on each new append. The dashboard queries the API for pre-aggregated metrics; the API responds from the index without touching the file. A WebSocket endpoint streams live updates as new runs complete.

Structure
  • FastAPI app — /runs, /runs/{id}, /metrics endpoints
  • JSONL tail-watcher — inotify / polling appends to in-memory index
  • Metric aggregator — pre-computes score distributions, latency percentiles
  • WebSocket — pushes new run summaries to connected dashboards
Self-Improvement

Self-Improvement Patterns

Five patterns that close the outer loop: converting production run data into training signal, optimizing the synthesis instruction autonomously, and detecting when the optimizer has converged or gone off the rails.

G1

Data Flywheel

Convert high-quality production run outputs into DPO training pairs automatically, so that each production task directly improves the model that will handle future tasks.

Problem

A static model processes production data but learns nothing from it. Improving quality requires manual annotation, which is expensive and slow. The harness already produces per-run quality scores and full output text — the raw materials for preference learning are available but unused.

Solution

Runs scoring above a threshold are promoted as DPO positive examples; runs on the same task scoring below a floor become negative examples. The pairing is automatic: same task prompt, different output quality. The DPO dataset grows with every production run; fine-tuning cycles consume it on a schedule.

Structure
  • Score threshold filter — wiggum_r1 >= FLYWHEEL_POSITIVE_FLOOR
  • DPO pair generator — matches positive/negative by task_id
  • Training data formatter — ShareGPT or Alpaca schema output
  • Fine-tuning trigger — scheduled or threshold-based
G2

RL Rollout

Treat each agent run trajectory as a reinforcement learning episode, using the final verifiable score as the reward signal for policy gradient updates.

Problem

DPO requires paired examples — a preferred and a dispreferred output for the same prompt. For tasks with verifiable correct answers (code execution, structured queries, fact-checkable outputs), RL is more sample-efficient: a scalar reward from a verifier is sufficient without paired data collection.

Solution

The RunTrace trajectory — the sequence of model calls, tool results, and intermediate states — is serialized as an RL episode. The final WIGGUM score or a task-specific verifier provides the reward. NeMo RL processes the rollout buffer using GRPO or PPO, updating the producer model policy.

Structure
  • Trajectory extractor — RunTrace → (state, action, reward) sequence
  • Verifiable reward function — WIGGUM score or task-specific check
  • NeMo RL integration — GRPO/PPO policy update
  • Rollout buffer — accumulates episodes until batch threshold
G3

Literature Review Pipeline

Continuously update the harness knowledge base by running the synthesis pipeline against curated arxiv seed queries on a schedule, without human curation at each cycle.

Problem

Keeping the knowledge base current with the research literature requires reading and summarizing new papers — a task that does not scale with manual effort. RAG without structured extraction produces shallow, inconsistent coverage; the harness needs structured findings, not raw abstracts.

Solution

A scheduled pipeline fetches new papers from arxiv seed queries, runs each through the full synthesis pipeline, extracts structured findings using the WIGGUM-evaluated synthesis instruction, and indexes the output into the knowledge base. The pipeline uses the harness to improve the harness.

Structure
  • Seed query list — topic-specific arxiv search strings
  • Fetch scheduler — configurable cadence (daily / weekly)
  • Synthesis pipeline — full harness run per paper batch
  • Knowledge base indexer — structured output → DuckDB + Chroma
G4

Autoresearch Loop

Autonomous Instruction Optimizer

Autonomously improve the synthesis instruction by proposing changes, evaluating them against a held baseline, and accepting only changes that advance the composite score beyond a threshold.

Problem

Synthesis instruction quality directly determines output quality but is difficult to tune by hand: the space of possible instructions is vast, evaluation is noisy, and the connection between instruction wording and dimensional score is opaque. Manual iteration is slow and produces instructions optimized for legibility rather than evaluator reward.

Solution

A proposer LLM reads the current instruction, experiment history, hard-ban list, and evaluator feedback, then outputs a replacement. The replacement is evaluated against the current baseline; it is kept only if it advances the composite score by more than DELTA_THRESHOLD. The loop runs until a convergence condition is detected (see G5).

Structure
  • Proposer — strongest available model; reads PROPOSE_PROMPT with history injected
  • Eval suite — composite = 0.7 × wiggum_r1 + 0.3 × criteria_rate × 10
  • Advance/discard — keep if score > baseline + DELTA_THRESHOLD
  • Hard-ban list — grows with failed approach families
  • Experiment TSV — autoresearch.tsv audit trail
G5

Convergence Detector

Monitor the autoresearch loop for attractor lock, baseline contamination, and semantic stagnation — four distinct signals that indicate the optimizer has stopped making genuine progress.

Problem

An autonomous optimizer with only score-based plateau detection will run indefinitely in a converged state: the proposer recycles the same approach family under slightly different names, the baseline may have been established from a single lucky evaluation run, and the hard-ban list can grow to prohibit the only instruction family that ever worked.

Solution

Four complementary detectors operate at different points in the loop. The semantic attractor guard (pre-eval) rejects proposals too similar to recent discards. The family entropy monitor detects when one approach family dominates the sliding window. The baseline re-estimation trigger periodically re-runs the baseline with eval-n=3 to detect lucky-sample inflation. The global convergence exit halts the loop after a configurable number of experiments with no advance.

Structure
  • Semantic attractor guard — TF-IDF cosine similarity against recent discard descriptions; threshold ~0.65
  • Family entropy monitor — Shannon entropy of last-10 family labels; fires when entropy < 1.0
  • Baseline re-estimation — re-run baseline with eval-n=3 after K consecutive discards
  • Global convergence exit — halt + structured report after N_max experiments with zero advances since M_last_advance

Each pattern in this catalog is documented in one of the eleven series posts. Pattern IDs (A1–G5) correspond to the section lettering used in the posts; the section letter indicates which pipeline concern the pattern addresses, and the number indicates the order in which it is introduced within that section. Start with Post 1 →