Mining a Ground-Truth Knowledge Base for the Eval Suite
Five deep research runs with the novelty gate disabled build authoritative reference documents injected as file_context when the eval suite runs — giving the producer verified facts instead of relying on search snippets alone.
The eval suite runs standardized tasks to measure the harness's output quality. But the producer model only has access to what it can find in a few search rounds. For tasks with specific technical requirements — "include real library names, version numbers, actual API signatures" — DDGS snippets often fall short on depth and grounding. mine_knowledge.py solves this by pre-building a reference document for each eval task using the harness's own deep research mode.
The five eval tasks
Context engineering techniques
Top 5 techniques in production LLM agents. Real library names, API signatures, working code examples, concrete production trade-offs.
Cost envelope management
Best practices for token budget management. Real pricing, caching APIs, model routing patterns, budget enforcement code, and monitoring tooling.
Multi-agent failure modes
The three most common failure modes. Real incident examples, detection patterns, mitigation code, specific framework behaviors in LangGraph, AutoGen, and CrewAI.
Context window management
Top 3 strategies with benchmarked trade-offs. Chunking library APIs, working RAG code, and summarization examples using real models.
Prompt injection defense
Best practices with working code. Covers Rebuff, Guardrails AI, OWASP LLM Top 10 references, and production detection patterns.
Each task description is written to be source-specific: "include real library names, version numbers, actual API signatures." This is the grounding level the eval rubric rewards — outputs that cite specific tools and working code rather than describing patterns abstractly.
Deep research mode
Each mining run invokes agent.py as a subprocess with two key flags:
/deep forces MAX_SEARCH_ROUNDS and disables the novelty gate. In normal operation, the research loop stops early when new results score below the novelty threshold. With /deep, every round runs regardless — the loop saturates at the hard cap rather than stopping when diminishing returns are detected. For topics like "prompt injection defense", this means finding sources that would be cut off in a standard research run.
--no-wiggum skips the quality evaluation loop. Mining is a one-shot operation that doesn't need iterative revision — it's generating reference material, not a final deliverable. Skipping Wiggum reduces the mining time by 50–70% per task.
Injection at eval time
When the eval suite runs a task for which a knowledge base file exists, it injects the file as file_context into the producer's prompt. The effect is significant: instead of relying on whatever DDGS returns in 2–3 search rounds, the producer has a pre-verified reference document with specific API names, real code, and authoritative sources already assembled.
This creates a fair comparison baseline. An autoresearch experiment that beats the control by improving the synthesis instruction is only meaningful if the producer had access to the same raw information both times. The knowledge base makes the information ceiling stable across experiments.
Status and maintenance
# Show which KB files exist and their age
python mine_knowledge.py --status
T_A T_A.md 18432 2026-05-31 14:22
T_B T_B.md 21104 2026-05-31 14:38
T_C T_C.md 19856 2026-05-31 15:01
T_D T_D.md 17280 2026-05-31 15:19
T_E (not mined) --
# Mine a single task (T_E is missing above)
python mine_knowledge.py T_E
# Re-mine everything — refresh stale KB files
python mine_knowledge.py
KB files have a natural staleness: API signatures change, libraries release new versions, frameworks add features. The --status display shows file age so you can spot KB files that are months old and might be citing deprecated APIs. There's no automated staleness check — the decision to re-mine is manual, based on how much the topic landscape has shifted.
Mining is the only harness operation that spawns agent.py as a child subprocess rather than calling it as a module. This isolation is intentional: deep research runs are long (10–20 minutes per task) and might exhaust memory or produce runaway loops. Running in a subprocess means a mining failure doesn't crash the parent process and can be detected by non-zero return code.
Relation to the eval suite
The knowledge base is one of three context enrichment paths the eval suite can use. The others are the memory store (observations from past runs on similar topics) and live research (DDGS calls during the eval run itself). When all three are available, the producer's context window contains verified reference material, historical observations, and fresh search results — a significantly richer grounding than search alone. For the full eval suite design, see The Regression Harness.