May 22, 2026 • 16 min read • Agentic Harness Engineering Series

Inference Patterns: The Substrate Layer

Four patterns governing how language model calls are issued, routed, and kept warm. Everything else in the pipeline depends on getting this substrate right.

Every agentic pipeline eventually touches a language model. The questions are mundane on the surface — which backend? which model? how long does it stay loaded? — but the answers echo through every downstream component. Get the inference substrate wrong and you inherit compounding latency, self-evaluation bias, and VRAM thrashing in the patterns that depend on it. Get it right and the rest of the catalog snaps into place cleanly.

Section A of the pattern catalog covers four patterns that together form the substrate layer: The Inference Shim (A1), which hides backend diversity behind a single interface; Model Role Separation (A2), which enforces that no model evaluates its own output; The Evaluator Pool (A3), which converts systematic per-model scoring bias into measurable variance; and The Keep-Alive Budget (A4), which eliminates the cold-start gaps that dominate pipeline latency traces.

These four patterns are prerequisite infrastructure — they do not produce visible outputs, they prevent invisible failures. None of them are optional once the pipeline reaches production scale.

A1 — The Inference Shim

The fastest way to accumulate technical debt in an agentic system is to scatter backend-specific inference calls throughout the codebase. Three months of development later, the agent module is importing ollama directly, the evaluator is calling openai.ChatCompletion, and the planner is using a subprocess call to llama-cli. Switching any one backend requires a surgical search across a dozen files.

The Inference Shim closes that off at the start. harness/inference.py exposes exactly one public function:

def chat(model: str, messages: list[dict]) -> OllamaLike:
    endpoint = HARNESS_ENDPOINTS.get(model)
    if endpoint["type"] == "ollama":
        return _call_ollama(endpoint["url"], model, messages)
    elif endpoint["type"] == "vllm":
        return _call_vllm(endpoint["url"], model, messages)
    elif endpoint["type"] == "openai":
        return _call_openai(endpoint["url"], model, messages)
    raise ValueError(f"unknown backend type: {endpoint['type']}")

The return type — OllamaLike — is a minimal dataclass that normalizes every backend's response envelope to expose .message.content and .message.role. The agent, evaluator, and planner never import backend SDKs. Every component receives the same interface regardless of where the model is actually running.

A1 — Inference Shim: Routing Architecture

All calling code issues a single chat(model, messages) call. The shim routes to the appropriate backend and normalizes the response.

HARNESS_ENDPOINTS is a JSON file (or HARNESS_ENDPOINTS environment variable) mapping model name tags to backend URLs and types. Swapping a model from Ollama to vLLM is a one-line config change. The shim adds one function-call layer of overhead — negligible against inference latency, which starts at several hundred milliseconds even for small models.

A2 — Model Role Separation

If the Inference Shim is infrastructure, Model Role Separation is policy. The rule is simple: within a single run, no model evaluates its own output. The harness enforces this structurally through the ModelConfig dataclass:

@dataclass
class ModelConfig:
    producer: str
    evaluator: str
    planner: str

    def __post_init__(self):
        if self.producer == self.evaluator:
            raise ConfigurationError(
                "producer and evaluator must be different models; "
                "self-evaluation bias measured at +0.9 points mean inflation"
            )

The three roles are non-overlapping by design. The producer is the largest available model, optimized for long-form synthesis — coherent multi-section documents that integrate dozens of sources. The evaluator is always a different model family, not just a different model — cross-family selection (e.g., GLM4 evaluating Qwen3 output) reduces shared training distribution bias at the cost of some calibration variance. The planner is the smallest fast model available, since planning latency affects perceived responsiveness: a 9B model returning a structured plan in 12 seconds outperforms a 70B model returning one in 45 seconds.

A2 — Three-Role Architecture: ModelConfig

The three roles use separate model instances and never share a model within the same run. The evaluator is always a different family from the producer.

Why does the self-evaluation bias number matter? Across 1,500 logged runs, the mean Wiggum score when the same model evaluates its own synthesis is 0.9 points higher than cross-model evaluation on identical outputs, measured using blind reruns with evaluator identity withheld from scoring analytics. A pipeline that uses the same model as both producer and evaluator will systematically pass outputs it should reject. The ConfigurationError at construction time is the most important error in the codebase.

A3 — The Evaluator Pool

Role separation prevents self-evaluation, but it doesn't prevent per-model bias. A single evaluator model, used exclusively across thousands of runs, accumulates systematic scoring patterns that are invisible until you run calibration experiments. One model consistently over-scores Groundedness by 0.4 points. Another inflates Completeness for outputs that cite many sources regardless of whether those sources support the claims. These biases are not bugs in the models — they reflect calibration differences in how each model was trained to use the rubric.

The Evaluator Pool converts systematic per-model bias into measurable variance. The implementation is small:

HARNESS_EVALUATOR_POOL = os.getenv(
    "HARNESS_EVALUATOR_POOL", "glm4:latest,mistral:latest"
).split(",")

def select_evaluator(seed: str) -> str:
    idx = int(hashlib.md5(seed.encode()).hexdigest(), 16) % len(HARNESS_EVALUATOR_POOL)
    return HARNESS_EVALUATOR_POOL[idx]

The seed is typically the run ID — a UTC timestamp plus UUID4 hex prefix. Hash-based selection means the same run ID always selects the same evaluator (reproducibility is preserved), but different runs are distributed across the pool. The key insight: a bias that inflates one dimension by 0.5 points for all runs is invisible to analytics because it looks like the baseline. The same bias distributed across four evaluators shows up as evaluator-correlated score variance — a pattern that is diagnosable, quantifiable, and correctable.

A3 — Evaluator Pool: Hash-Based Rotation

Run IDs hash deterministically to evaluator models. The same run always uses the same evaluator. Different runs distribute across the pool, converting systematic bias into measurable variance.

To diagnose evaluator drift, group runs.jsonl by evaluator_model and compare dimensional score distributions:

import pandas as pd, json

runs = pd.DataFrame(
    json.loads(line) for line in open("data/runs.jsonl")
)
dims = ["relevance","completeness","depth","specificity","structure","groundedness"]

# per-evaluator mean per dimension — look for columns that diverge
pivot = (
    runs.explode("wiggum_dimensions")
        .assign(**{d: lambda r, d=d: r.wiggum_dimensions.str[d] for d in dims})
        .groupby("evaluator_model")[dims].mean()
)
print(pivot.round(2))

A4 — The Keep-Alive Budget

Cold-start latency is the silent killer of pipeline throughput. Loading a 7B model into VRAM takes 5–10 seconds. Loading a 30B model takes 20–45 seconds. In a pipeline that uses three separate model roles — planner, producer, evaluator — an unmanaged warm cache means every stage transition potentially pays a cold-start penalty. On a standard research run, that adds up to 60–90 seconds of pure overhead per run, visible as flat gaps in the Chrome Trace flame graph between stage blocks.

Ollama's keep_alive parameter controls VRAM residency. Set to -1, a model stays loaded indefinitely. The _estimate_keep_alive() function in harness/agent.py sets it adaptively based on how soon the model will be called again:

def _estimate_keep_alive(role: str, task_type: str) -> int:
    """Returns keep_alive seconds. -1 = indefinite."""
    if role == "planner":
        # Planner finishes before synthesis; release VRAM for producer
        return 120 if task_type == "orchestrated" else 60
    if role == "producer":
        # Producer stays warm through evaluation (revision may follow)
        return -1
    if role == "evaluator":
        # Evaluator stays warm through all revision rounds
        return -1
    return 300  # default
A4 — Keep-Alive Budget: VRAM Residency Timeline

Standard 24 GB VRAM budget. Planner releases after planning phase; producer and evaluator stay warm through the Wiggum Loop's revision rounds. Cold-start overhead drops from ~75 s to ~8 s.

The three-model sequencing strategy applies when VRAM is constrained (e.g., 8–16 GB). Each model loads, runs its stage, then releases — planner first, then producer, then evaluator. Each transition costs 5–15 seconds instead of 30–60 for a cold start because the model weights are still in system RAM (paged out of VRAM) and don't need to be re-read from disk. On 24+ GB VRAM, all three models stay warm simultaneously and the Keep-Alive Budget reduces to managing when to release the planner to free headroom for the producer's context window.

VRAM Tier Strategy Cold-Start Overhead Notes
8 GB Sequential load/unload per stage ~45–75 s per run Each transition still benefits from system RAM cache
16 GB Producer + evaluator warm; planner sequential ~15–25 s per run Planner loads once, plans, then releases before synthesis
24 GB All three roles warm simultaneously ~5–10 s per run Planner released after planning to free headroom for producer KV cache
40+ GB All three warm + vision model for image tasks <5 s per run Vision Bridge (B5) adds a fourth model slot to the budget

The diagnostic: open a run's Chrome Trace file in Perfetto UI (ui.perfetto.dev). Cold-start overhead appears as a flat white block at the start of each stage block in the flame graph — no computation, just model loading. If the block between "plan complete" and "research start" is wider than 10 seconds, the planner is not staying warm across the planning call. If the block between "synthesis complete" and "eval start" is wider than 15 seconds, the evaluator is not pre-loaded. Both are fixable with one environment variable change.

How the Four Patterns Compose

The four inference patterns form a dependency chain. The Inference Shim is the foundation — every model call in every other pattern goes through it. Model Role Separation partitions those calls into three non-overlapping functional slots, preventing self-evaluation bias at the structural level. The Evaluator Pool rotates which model fills the evaluator slot across runs, converting systematic bias into observable variance. The Keep-Alive Budget manages VRAM residency so that all three model slots stay warm without starving each other.

A2 depends on A1 (role-separated calls still go through the shim). A3 depends on A2 (pool selection provides the evaluator model name to ModelConfig). A4 depends on all three (it manages residency for the producer, evaluator, and planner that A2 defines, using the backend-specific keep_alive parameter that A1 abstracts).

The consequence is that inference substrate errors compound upward. A misconfigured HARNESS_ENDPOINTS breaks every call in every pattern. An accidental same-model assignment in ModelConfig biases evaluation scores for every run until it's caught. An unmanaged keep-alive drains throughput and shows up in all latency metrics. The patterns are boring infrastructure. Their absence is not.

The next post covers Section B — Context Engineering Patterns — which address what information reaches the synthesis model before a single token of output is produced. With the inference substrate in place, context quality becomes the primary lever on output quality.

← Previous 3 · Failure Taxonomy Next → 5 · Context Engineering