May 22, 2026 • 18 min read • Agentic Harness Engineering Series

Context Engineering: What Reaches the Model

Five patterns governing the research half of the pipeline: how tasks are decomposed into targeted queries, how redundant content is filtered before synthesis, and how memory persists knowledge across runs.

A language model synthesizes what it is given. The evaluator scores the synthesis. The evaluator's feedback guides revision. But none of that matters if what is given to the model at synthesis time is wrong — incomplete, redundant, stale, or irrelevant. The failure taxonomy showed that Retrieval Failures are the most common class (38%) and that planning failures are the third most common (22%). Both happen before a single synthesis token is generated.

Section B of the pattern catalog addresses the research half of the pipeline: five patterns that govern what information reaches the synthesis model. The Planner-First (B1) intercepts every task and converts it into targeted search queries. The Novelty Gate (B2) discards result batches that add no new information. The Dual-Backend Memory Store (B3) combines vector search and full-text retrieval to inject relevant prior knowledge. The Semantic Chunker (B4) extracts the most relevant sections from documents too large to fit the context window. The Vision Bridge (B5) converts image paths in the task into structured text before any other processing.

These patterns run before synthesis. Their combined effect: the synthesis model receives a context window that is targeted, non-redundant, enriched by prior runs, and bounded to fit available context. The quality ceiling for synthesis is determined here.

B1 — The Planner-First

The simplest possible research loop issues the task string directly as a search query. This works for narrow factual lookups. It fails for every research task that requires judgment about what to look for — which is most of them. The raw task "survey speculative decoding techniques in transformer inference" is a reasonable human prompt and a poor search query. What the pipeline actually needs is: what aspect of speculative decoding? Which deployment context? Which model families? What does the agent already know?

The Planner-First intercepts every task before any search query is issued and produces four outputs:

@dataclass
class Plan:
    search_queries: list[str]   # concrete targeted queries, never the raw task
    known_facts: list[str]      # already in memory — skip re-fetching
    knowledge_gaps: list[str]   # what must be found
    subtasks: list[str]         # decomposition hint for orchestrator

The agent loop uses plan.search_queries exclusively. The raw task string never hits a search API. A typical plan converts one vague task into three to five targeted queries that each address a specific knowledge gap — methods, benchmarks, deployment considerations, limitations.

B1 — Planner-First: Task Decomposition Flow

The planner converts one task string into targeted search queries and a knowledge gap list. The raw task string never reaches a search API.

Logged run comparisons show that planner-guided search reduces the number of Wiggum Loop rounds needed to pass by an average of 0.8 and cuts total token consumption by ~22%. Planning latency — 12–18 seconds for a 9B model — is the cost. That cost is dominated by model load time on cold starts; a warm planner (see Keep-Alive Budget, A4) typically returns the plan in under 5 seconds.

B2 — The Novelty Gate

The Planner-First generates better queries. It doesn't prevent result batches from being redundant with each other. A pipeline that runs five search rounds and accumulates all results before synthesis will almost always present the model with a context window where the same core concepts appear four to six times, rephrased but not substantively different. The evaluator's Completeness dimension is fooled by this: high citation count reads as comprehensive coverage, and the composite score inflates even though no new information was added.

The Novelty Gate scores each batch against the accumulated knowledge state before that batch reaches the synthesis prompt:

async def gather_research(task: str, plan: Plan) -> list[SearchResult]:
    knowledge_state = plan.known_facts.copy()
    passing_results = []

    for query in plan.search_queries:
        batch = await search(query)
        novelty = await memory.assess_novelty(batch, knowledge_state)

        if novelty < NOVELTY_THRESHOLD:  # default: 3.0
            trace.record_tool_call(query, novelty_score=novelty, discarded=True)
            continue

        passing_results.extend(batch)
        knowledge_state.extend([r.snippet for r in batch])

    return passing_results

B2 — Novelty Gate: Incremental Saturation Scoring

Each result batch is scored against accumulated knowledge. Batches below the novelty threshold are discarded before synthesis. The threshold applies incrementally — later batches face a higher bar.

Two implementation details matter. First, the threshold applies incrementally: after each passing batch, the knowledge state is updated. Later batches that overlap with earlier passing content get penalized. Early batches retrieve foundational content; later batches need to introduce something genuinely new to pass. Second, RESEARCH_CACHE=1 caches results keyed by query string so Wiggum revision rounds don't re-fetch content that was already retrieved and assessed in round one.

Novelty scores for each search round are recorded in runs.jsonl tool call records. A run where every round scores above 7.0 suggests the planner is generating maximally diverse queries. A run where rounds 2–5 all score below 3.5 suggests either the topic is narrow (expected) or the planner is generating near-duplicate queries (a planning failure pattern).

B3 — The Dual-Backend Memory Store

Research pipelines that run in isolation are inefficient. The same foundational concepts get re-retrieved on every run. The Dual-Backend Memory Store makes prior runs available as retrievable context, injecting the four most relevant prior observations into the synthesis prompt before any new search occurs.

Neither vector search nor full-text search alone is sufficient for knowledge retrieval. Vector search finds semantically related content even when no words overlap — useful for concept-level similarity. Full-text search finds exact keyword matches — indispensable for proper nouns, model names, and technical terms that don't embed semantically. The two backends have complementary blind spots, which is why the store uses both:

def get_context(task: str, top_k: int = 4) -> list[Observation]:
    # Stage 1: semantic similarity via ChromaDB
    chroma_hits = chroma_collection.query(
        query_texts=[task],
        n_results=top_k * 3  # oversample for re-ranking
    )
    # Stage 2: keyword re-ranking via SQLite FTS5
    candidates = [Observation(**h) for h in chroma_hits["metadatas"][0]]
    fts_scores = fts_db.execute(
        "SELECT rowid, rank FROM observations_fts WHERE observations_fts MATCH ?",
        [fts_query(task)]
    ).fetchall()
    fts_map = {row[0]: abs(row[1]) for row in fts_scores}

    for obs in candidates:
        obs.score = obs.similarity + 0.4 * fts_map.get(obs.rowid, 0)

    return sorted(candidates, key=lambda o: o.score, reverse=True)[:top_k]

B3 — Dual-Backend Memory Store: Hybrid Retrieval

ChromaDB handles semantic similarity; SQLite FTS5 handles exact keyword matching. Both are queried at retrieval time and results are re-ranked using a combined score.

Every completed run writes an observation to both backends. The write path includes an injection scan (see Injection Scanner, E3) before any content reaches the memory store — injected observations have persistent effects across all future runs that retrieve them. A novelty check at write time prevents storing observations that are redundant with the existing store.

The practical effect: the second time the pipeline runs a task on a topic it has seen before, the synthesis prompt arrives enriched with structured summaries of prior findings. The planner's known_facts field draws from memory context, allowing it to skip re-fetching foundational content. Over hundreds of runs on a topic domain, the pipeline accumulates a navigable knowledge base without any explicit knowledge base management.

B4 — The Semantic Chunker

The Novelty Gate and memory retrieval ensure the right documents are selected. The Semantic Chunker ensures the right sections of those documents are extracted. A 40,000-character academic paper cannot fit in a 32K context window alongside memory context, task instructions, and output formatting. Naive truncation discards the conclusion and results sections — often the most relevant parts — in favor of the introduction.

The Chunker auto-detects document structure and applies the appropriate extraction mode:

def extract(content: str, task: str, budget: int = 12_000) -> str:
    headings = re.findall(r'^#{2,3}\s+.+', content, re.MULTILINE)

    if len(headings) >= 3:
        # Structured document: section-priority extraction
        return _structured_extract(content, budget)
    else:
        # Unstructured: embedding-based chunk retrieval
        return _semantic_extract(content, task, budget)

SECTION_PRIORITY = [
    "abstract", "conclusion", "introduction",
    "results", "findings", "methods", "discussion", "other"
]

For structured documents (academic papers, technical reports), sections are extracted in priority order — Abstract first, then Conclusion, Introduction, Results, Methods — until the 12,000-character budget is exhausted. This is the right priority order for research synthesis: the abstract states the contribution, the conclusion evaluates it, the introduction contextualizes it. Methods sections matter less for most synthesis tasks.

For unstructured documents (blog posts, documentation pages, news articles), the document is split into overlapping 600-character chunks with 80-character overlap, each chunk is embedded with sentence-transformers, and the top-K chunks ranked by cosine similarity to the task string are retained. The overlap prevents sentence-boundary artifacts from severing related content.

B4 — Semantic Chunker: Dual-Mode Extraction

Structured documents use section-priority extraction. Unstructured documents use embedding-based chunk retrieval. Both modes operate within a 12,000-character context budget.

The 12,000-character budget was chosen empirically: it leaves room for the task string (~500 chars), memory context (~3,000 chars), synthesis instructions (~1,500 chars), and the extracted document content within a 32K context window, with a safety margin of ~15K characters for the model's output. The budget is configurable via HARNESS_CHUNK_BUDGET.

Every assembled block — whether from section extraction or chunk retrieval — carries an inline provenance tag in the format [source:file.pdf | url:https://... | p.3 | ¶12 | §Abstract | @4,567]. These tags are embedded as header lines before each extracted block so the synthesis model can cite specific passages rather than paraphrasing the document as a whole. The tag includes the source filename, web URL (if applicable), estimated page number, paragraph index, section label (structured mode only), and character offset in the original document.

The unstructured retrieval path uses a chromadb.EphemeralClient() — an in-memory ChromaDB instance created per document and discarded after extraction. This avoids polluting the persistent memory store with document chunks that are relevant only to a single retrieval operation. If ChromaDB is unavailable (not installed or import error), the chunker falls back to a head-plus-tail truncation: 60% of the budget from the document start and 40% from the end, which preserves the introduction and conclusion at the cost of the middle.

B5 — The Vision Bridge

The five patterns above handle text-only pipelines. A common case breaks that assumption: the task string contains an image path. Architecture diagrams, benchmark plots, scanned documents, annotated screenshots — all are legitimate research artifacts that should inform synthesis but cannot be processed by text-only synthesis models.

The Vision Bridge resolves this pre-pipeline, before the Planner-First runs:

def inject_vision_context(task: str) -> str:
    """Replace image paths in task with structured text descriptions."""
    image_exts = {'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.webp'}

    def replace_path(match):
        path = match.group(0)
        if Path(path).suffix.lower() not in image_exts:
            return path  # not an image
        if not Path(path).exists():
            return path  # file not found
        description = _extract_with_vision_model(path)
        return f"[Image: {description}]"

    return re.sub(r'\S+\.\w+', replace_path, task)

The vision model (llama3.2-vision, 11B parameters) receives a structured extraction prompt that requests: all visible text in the image, data points from any charts or tables, a layout description, and identified objects. The description replaces the image path in the task context. Every downstream pipeline stage sees only text — they are unaware that the task originally referenced an image.

Two constraints are enforced by design. First, the Vision Bridge only processes local file paths, not remote URLs — this prevents accidental disclosure of credentials or tokens embedded in URLs from being processed by the vision model. Second, the vision model must be accounted for in the Keep-Alive Budget (A4): it is a fourth model slot that competes with the planner, producer, and evaluator for VRAM.

How the Five Patterns Compose

The Section B patterns form a pipeline that runs entirely before the synthesis model is called. The Vision Bridge runs first, converting any image content to text. The Planner-First runs next, converting the task (now text-only) into targeted queries and a knowledge gap list. The Dual-Backend Memory Store is queried in parallel with planning, returning prior observations that populate known_facts. Research rounds run, with each batch scored by the Novelty Gate. Retrieved documents are passed through the Semantic Chunker before being appended to the synthesis prompt.

The quality of what reaches the synthesis model is the ceiling for what synthesis can produce. The Wiggum Loop (C1) raises outputs toward that ceiling through iterative revision — but it cannot add information that was never retrieved, fix planning errors that generated the wrong queries, or recover from a context window stuffed with redundant content. Context engineering is upstream of evaluation, and upstream problems are harder to fix.

Pattern	Failure it prevents	Key config
B1 Planner-First	Planning failures (22%): wrong queries, missed task core	Planner model size; planning latency budget
B2 Novelty Gate	Synthesis failures from redundant context; false Completeness inflation	`NOVELTY_THRESHOLD` (default 3.0); `RESEARCH_CACHE`
B3 Dual-Backend Memory	Retrieval failures (38%): re-fetching known content; missing prior knowledge	ChromaDB embedding model; FTS5 re-ranking weight
B4 Semantic Chunker	Context overflow; naive truncation discarding relevant sections	`HARNESS_CHUNK_BUDGET` (default 12,000 chars)
B5 Vision Bridge	Image paths blocking text-only pipeline; unprocessed visual artifacts	Vision model; only local paths processed (SSRF guard)

The next post covers Section C — Verification Patterns — including the Wiggum Loop in full detail, the Dimensional Rubric, the Surgical Compressor, and the ReAct Comparator. With the inference substrate (Section A) and context quality (Section B) in place, verification is where quality is measured, scored, and improved.

← Previous 4 · Inference Patterns Next → 6 · The Wiggum Loop

B1 — The Planner-First

B2 — The Novelty Gate

B3 — The Dual-Backend Memory Store

B4 — The Semantic Chunker

B5 — The Vision Bridge

How the Five Patterns Compose

Related in this series