May 26, 2026 • 5 min read • Agentic Harness Engineering

Mining a Ground-Truth Knowledge Base for the Eval Suite

Five deep research runs with the novelty gate disabled build authoritative reference documents injected as file_context when the eval suite runs — giving the producer verified facts instead of relying on search snippets alone.

The eval suite runs standardized tasks to measure the harness's output quality. But the producer model only has access to what it can find in a few search rounds. For tasks with specific technical requirements — "include real library names, version numbers, actual API signatures" — DDGS snippets often fall short on depth and grounding. mine_knowledge.py solves this by pre-building a reference document for each eval task using the harness's own deep research mode.

The five eval tasks

T_A

Context engineering techniques

Top 5 techniques in production LLM agents. Real library names, API signatures, working code examples, concrete production trade-offs.

T_B

Cost envelope management

Best practices for token budget management. Real pricing, caching APIs, model routing patterns, budget enforcement code, and monitoring tooling.

T_C

Multi-agent failure modes

The three most common failure modes. Real incident examples, detection patterns, mitigation code, specific framework behaviors in LangGraph, AutoGen, and CrewAI.

T_D

Context window management

Top 3 strategies with benchmarked trade-offs. Chunking library APIs, working RAG code, and summarization examples using real models.

T_E

Prompt injection defense

Best practices with working code. Covers Rebuff, Guardrails AI, OWASP LLM Top 10 references, and production detection patterns.

Each task description is written to be source-specific: "include real library names, version numbers, actual API signatures." This is the grounding level the eval rubric rewards — outputs that cite specific tools and working code rather than describing patterns abstractly.

Deep research mode

Each mining run invokes agent.py as a subprocess with two key flags:

agent.py + --no-wiggum + /deep <task> → knowledge_base/T_X.md

/deep forces MAX_SEARCH_ROUNDS and disables the novelty gate. In normal operation, the research loop stops early when new results score below the novelty threshold. With /deep, every round runs regardless — the loop saturates at the hard cap rather than stopping when diminishing returns are detected. For topics like "prompt injection defense", this means finding sources that would be cut off in a standard research run.

--no-wiggum skips the quality evaluation loop. Mining is a one-shot operation that doesn't need iterative revision — it's generating reference material, not a final deliverable. Skipping Wiggum reduces the mining time by 50–70% per task.

Injection at eval time

When the eval suite runs a task for which a knowledge base file exists, it injects the file as file_context into the producer's prompt. The effect is significant: instead of relying on whatever DDGS returns in 2–3 search rounds, the producer has a pre-verified reference document with specific API names, real code, and authoritative sources already assembled.

This creates a fair comparison baseline. An autoresearch experiment that beats the control by improving the synthesis instruction is only meaningful if the producer had access to the same raw information both times. The knowledge base makes the information ceiling stable across experiments.

Status and maintenance

# Show which KB files exist and their age
python mine_knowledge.py --status

T_A    T_A.md             18432  2026-05-31 14:22
T_B    T_B.md             21104  2026-05-31 14:38
T_C    T_C.md             19856  2026-05-31 15:01
T_D    T_D.md             17280  2026-05-31 15:19
T_E    (not mined)            --

# Mine a single task (T_E is missing above)
python mine_knowledge.py T_E

# Re-mine everything — refresh stale KB files
python mine_knowledge.py

KB files have a natural staleness: API signatures change, libraries release new versions, frameworks add features. The --status display shows file age so you can spot KB files that are months old and might be citing deprecated APIs. There's no automated staleness check — the decision to re-mine is manual, based on how much the topic landscape has shifted.

Mining is the only harness operation that spawns agent.py as a child subprocess rather than calling it as a module. This isolation is intentional: deep research runs are long (10–20 minutes per task) and might exhaust memory or produce runaway loops. Running in a subprocess means a mining failure doesn't crash the parent process and can be detected by non-zero return code.

Relation to the eval suite

The knowledge base is one of three context enrichment paths the eval suite can use. The others are the memory store (observations from past runs on similar topics) and live research (DDGS calls during the eval run itself). When all three are available, the producer's context window contains verified reference material, historical observations, and fresh search results — a significantly richer grounding than search alone. For the full eval suite design, see The Regression Harness.

The five eval tasks

Context engineering techniques

Cost envelope management

Multi-agent failure modes

Context window management

Prompt injection defense

Deep research mode

Injection at eval time

Status and maintenance

Relation to the eval suite

Related posts