May 22, 2026 • 18 min read • Agentic Harness Engineering Series

Context Engineering: What Reaches the Model

Five patterns governing the research half of the pipeline: how tasks are decomposed into targeted queries, how redundant content is filtered before synthesis, and how memory persists knowledge across runs.

A language model synthesizes what it is given. The evaluator scores the synthesis. The evaluator's feedback guides revision. But none of that matters if what is given to the model at synthesis time is wrong — incomplete, redundant, stale, or irrelevant. The failure taxonomy showed that Retrieval Failures are the most common class (38%) and that planning failures are the third most common (22%). Both happen before a single synthesis token is generated.

Section B of the pattern catalog addresses the research half of the pipeline: five patterns that govern what information reaches the synthesis model. The Planner-First (B1) intercepts every task and converts it into targeted search queries. The Novelty Gate (B2) discards result batches that add no new information. The Dual-Backend Memory Store (B3) combines vector search and full-text retrieval to inject relevant prior knowledge. The Semantic Chunker (B4) extracts the most relevant sections from documents too large to fit the context window. The Vision Bridge (B5) converts image paths in the task into structured text before any other processing.

These patterns run before synthesis. Their combined effect: the synthesis model receives a context window that is targeted, non-redundant, enriched by prior runs, and bounded to fit available context. The quality ceiling for synthesis is determined here.

B1 — The Planner-First

The simplest possible research loop issues the task string directly as a search query. This works for narrow factual lookups. It fails for every research task that requires judgment about what to look for — which is most of them. The raw task "survey speculative decoding techniques in transformer inference" is a reasonable human prompt and a poor search query. What the pipeline actually needs is: what aspect of speculative decoding? Which deployment context? Which model families? What does the agent already know?

The Planner-First intercepts every task before any search query is issued and produces four outputs:

@dataclass
class Plan:
    search_queries: list[str]   # concrete targeted queries, never the raw task
    known_facts: list[str]      # already in memory — skip re-fetching
    knowledge_gaps: list[str]   # what must be found
    subtasks: list[str]         # decomposition hint for orchestrator

The agent loop uses plan.search_queries exclusively. The raw task string never hits a search API. A typical plan converts one vague task into three to five targeted queries that each address a specific knowledge gap — methods, benchmarks, deployment considerations, limitations.

B1 — Planner-First: Task Decomposition Flow

The planner converts one task string into targeted search queries and a knowledge gap list. The raw task string never reaches a search API.

Logged run comparisons show that planner-guided search reduces the number of Wiggum Loop rounds needed to pass by an average of 0.8 and cuts total token consumption by ~22%. Planning latency — 12–18 seconds for a 9B model — is the cost. That cost is dominated by model load time on cold starts; a warm planner (see Keep-Alive Budget, A4) typically returns the plan in under 5 seconds.

B2 — The Novelty Gate

The Planner-First generates better queries. It doesn't prevent result batches from being redundant with each other. A pipeline that runs five search rounds and accumulates all results before synthesis will almost always present the model with a context window where the same core concepts appear four to six times, rephrased but not substantively different. The evaluator's Completeness dimension is fooled by this: high citation count reads as comprehensive coverage, and the composite score inflates even though no new information was added.

The Novelty Gate scores each batch against the accumulated knowledge state before that batch reaches the synthesis prompt:

async def gather_research(task: str, plan: Plan) -> list[SearchResult]:
    knowledge_state = plan.known_facts.copy()
    passing_results = []

    for query in plan.search_queries:
        batch = await search(query)
        novelty = await memory.assess_novelty(batch, knowledge_state)

        if novelty < NOVELTY_THRESHOLD:  # default: 3.0
            trace.record_tool_call(query, novelty_score=novelty, discarded=True)
            continue

        passing_results.extend(batch)
        knowledge_state.extend([r.snippet for r in batch])

    return passing_results
B2 — Novelty Gate: Incremental Saturation Scoring

Each result batch is scored against accumulated knowledge. Batches below the novelty threshold are discarded before synthesis. The threshold applies incrementally — later batches face a higher bar.

Two implementation details matter. First, the threshold applies incrementally: after each passing batch, the knowledge state is updated. Later batches that overlap with earlier passing content get penalized. Early batches retrieve foundational content; later batches need to introduce something genuinely new to pass. Second, RESEARCH_CACHE=1 caches results keyed by query string so Wiggum revision rounds don't re-fetch content that was already retrieved and assessed in round one.

Novelty scores for each search round are recorded in runs.jsonl tool call records. A run where every round scores above 7.0 suggests the planner is generating maximally diverse queries. A run where rounds 2–5 all score below 3.5 suggests either the topic is narrow (expected) or the planner is generating near-duplicate queries (a planning failure pattern).

B3 — The Dual-Backend Memory Store

Research pipelines that run in isolation are inefficient. The same foundational concepts get re-retrieved on every run. The Dual-Backend Memory Store makes prior runs available as retrievable context, injecting the four most relevant prior observations into the synthesis prompt before any new search occurs.

Neither vector search nor full-text search alone is sufficient for knowledge retrieval. Vector search finds semantically related content even when no words overlap — useful for concept-level similarity. Full-text search finds exact keyword matches — indispensable for proper nouns, model names, and technical terms that don't embed semantically. The two backends have complementary blind spots, which is why the store uses both:

def get_context(task: str, top_k: int = 4) -> list[Observation]:
    # Stage 1: semantic similarity via ChromaDB
    chroma_hits = chroma_collection.query(
        query_texts=[task],
        n_results=top_k * 3  # oversample for re-ranking
    )
    # Stage 2: keyword re-ranking via SQLite FTS5
    candidates = [Observation(**h) for h in chroma_hits["metadatas"][0]]
    fts_scores = fts_db.execute(
        "SELECT rowid, rank FROM observations_fts WHERE observations_fts MATCH ?",
        [fts_query(task)]
    ).fetchall()
    fts_map = {row[0]: abs(row[1]) for row in fts_scores}

    for obs in candidates:
        obs.score = obs.similarity + 0.4 * fts_map.get(obs.rowid, 0)

    return sorted(candidates, key=lambda o: o.score, reverse=True)[:top_k]
B3 — Dual-Backend Memory Store: Hybrid Retrieval

ChromaDB handles semantic similarity; SQLite FTS5 handles exact keyword matching. Both are queried at retrieval time and results are re-ranked using a combined score.

Every completed run writes an observation to both backends. The write path includes an injection scan (see Injection Scanner, E3) before any content reaches the memory store — injected observations have persistent effects across all future runs that retrieve them. A novelty check at write time prevents storing observations that are redundant with the existing store.

The practical effect: the second time the pipeline runs a task on a topic it has seen before, the synthesis prompt arrives enriched with structured summaries of prior findings. The planner's known_facts field draws from memory context, allowing it to skip re-fetching foundational content. Over hundreds of runs on a topic domain, the pipeline accumulates a navigable knowledge base without any explicit knowledge base management.

B4 — The Semantic Chunker

The Novelty Gate and memory retrieval ensure the right documents are selected. The Semantic Chunker ensures the right sections of those documents are extracted. A 40,000-character academic paper cannot fit in a 32K context window alongside memory context, task instructions, and output formatting. Naive truncation discards the conclusion and results sections — often the most relevant parts — in favor of the introduction.

The Chunker auto-detects document structure and applies the appropriate extraction mode:

def extract(content: str, task: str, budget: int = 12_000) -> str:
    headings = re.findall(r'^#{2,3}\s+.+', content, re.MULTILINE)

    if len(headings) >= 3:
        # Structured document: section-priority extraction
        return _structured_extract(content, budget)
    else:
        # Unstructured: embedding-based chunk retrieval
        return _semantic_extract(content, task, budget)

SECTION_PRIORITY = [
    "abstract", "conclusion", "introduction",
    "results", "findings", "methods", "discussion", "other"
]

For structured documents (academic papers, technical reports), sections are extracted in priority order — Abstract first, then Conclusion, Introduction, Results, Methods — until the 12,000-character budget is exhausted. This is the right priority order for research synthesis: the abstract states the contribution, the conclusion evaluates it, the introduction contextualizes it. Methods sections matter less for most synthesis tasks.

For unstructured documents (blog posts, documentation pages, news articles), the document is split into overlapping 600-character chunks with 80-character overlap, each chunk is embedded with sentence-transformers, and the top-K chunks ranked by cosine similarity to the task string are retained. The overlap prevents sentence-boundary artifacts from severing related content.

B4 — Semantic Chunker: Dual-Mode Extraction

Structured documents use section-priority extraction. Unstructured documents use embedding-based chunk retrieval. Both modes operate within a 12,000-character context budget.

The 12,000-character budget was chosen empirically: it leaves room for the task string (~500 chars), memory context (~3,000 chars), synthesis instructions (~1,500 chars), and the extracted document content within a 32K context window, with a safety margin of ~15K characters for the model's output. The budget is configurable via HARNESS_CHUNK_BUDGET.

B5 — The Vision Bridge

The five patterns above handle text-only pipelines. A common case breaks that assumption: the task string contains an image path. Architecture diagrams, benchmark plots, scanned documents, annotated screenshots — all are legitimate research artifacts that should inform synthesis but cannot be processed by text-only synthesis models.

The Vision Bridge resolves this pre-pipeline, before the Planner-First runs:

def inject_vision_context(task: str) -> str:
    """Replace image paths in task with structured text descriptions."""
    image_exts = {'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.webp'}

    def replace_path(match):
        path = match.group(0)
        if Path(path).suffix.lower() not in image_exts:
            return path  # not an image
        if not Path(path).exists():
            return path  # file not found
        description = _extract_with_vision_model(path)
        return f"[Image: {description}]"

    return re.sub(r'\S+\.\w+', replace_path, task)

The vision model (llama3.2-vision, 11B parameters) receives a structured extraction prompt that requests: all visible text in the image, data points from any charts or tables, a layout description, and identified objects. The description replaces the image path in the task context. Every downstream pipeline stage sees only text — they are unaware that the task originally referenced an image.

Two constraints are enforced by design. First, the Vision Bridge only processes local file paths, not remote URLs — this prevents accidental disclosure of credentials or tokens embedded in URLs from being processed by the vision model. Second, the vision model must be accounted for in the Keep-Alive Budget (A4): it is a fourth model slot that competes with the planner, producer, and evaluator for VRAM.

How the Five Patterns Compose

The Section B patterns form a pipeline that runs entirely before the synthesis model is called. The Vision Bridge runs first, converting any image content to text. The Planner-First runs next, converting the task (now text-only) into targeted queries and a knowledge gap list. The Dual-Backend Memory Store is queried in parallel with planning, returning prior observations that populate known_facts. Research rounds run, with each batch scored by the Novelty Gate. Retrieved documents are passed through the Semantic Chunker before being appended to the synthesis prompt.

The quality of what reaches the synthesis model is the ceiling for what synthesis can produce. The Wiggum Loop (C1) raises outputs toward that ceiling through iterative revision — but it cannot add information that was never retrieved, fix planning errors that generated the wrong queries, or recover from a context window stuffed with redundant content. Context engineering is upstream of evaluation, and upstream problems are harder to fix.

Pattern Failure it prevents Key config
B1 Planner-First Planning failures (22%): wrong queries, missed task core Planner model size; planning latency budget
B2 Novelty Gate Synthesis failures from redundant context; false Completeness inflation NOVELTY_THRESHOLD (default 3.0); RESEARCH_CACHE
B3 Dual-Backend Memory Retrieval failures (38%): re-fetching known content; missing prior knowledge ChromaDB embedding model; FTS5 re-ranking weight
B4 Semantic Chunker Context overflow; naive truncation discarding relevant sections HARNESS_CHUNK_BUDGET (default 12,000 chars)
B5 Vision Bridge Image paths blocking text-only pipeline; unprocessed visual artifacts Vision model; only local paths processed (SSRF guard)

The next post covers Section C — Verification Patterns — including the Wiggum Loop in full detail, the Dimensional Rubric, the Surgical Compressor, and the ReAct Comparator. With the inference substrate (Section A) and context quality (Section B) in place, verification is where quality is measured, scored, and improved.

← Previous 4 · Inference Patterns Next → 6 · The Wiggum Loop