Knowledge Graphs and Structured Extraction: Beyond the Vector Store
The harness's Dual-Backend Memory Store uses vector similarity for semantic retrieval and SQLite for structured lookup. Both treat knowledge as a flat store of independent items. The knowledge graph literature argues for a third representation: a connected graph where the relationships between items carry as much signal as the items themselves. This post reviews the evidence and maps it to the memory architecture.
The Flat-Store Problem
Vector similarity retrieval treats each document chunk as an independent embedding. A query retrieves the k nearest neighbors in embedding space. The retrieved chunks are then concatenated and passed to the producer. The core limitation: inter-chunk relationships are discarded. Two chunks from the same document that discuss related concepts are retrieved independently; their connection — the fact that claim A from chunk 1 is the prerequisite for claim B from chunk 2 — is invisible to the retrieval step.
This matters when the task requires integrating information across multiple parts of a document rather than finding the single most relevant chunk. The harness's T_B task (best practices for cost envelope management) is exactly this kind of task: good best-practices synthesis requires connecting multiple principles that are spread across different sections of different documents. The evaluator's depth_r1 score of 6.0–6.1 across all four experiments suggests that the retrieval layer is consistently failing to surface the connections that would allow deep synthesis.
Document Semantic Graphs
The document semantic graph paper (arXiv:2206.07296v2) directly addresses this problem in the context of knowledge selection for grounded dialogue systems. The key insight: background knowledge documents contain internal semantic connections among sentences that existing sentence-ranking approaches ignore. By converting documents into semantic graphs — nodes are sentences/concepts, edges are semantic relationships — and performing joint sentence-level and concept-level selection, the system can retrieve not just relevant sentences but relevant clusters of connected sentences.
The approach uses a multi-task learning framework that jointly optimizes sentence-level ranking (which sentences are relevant?) and concept-level selection (which concepts connect the relevant sentences?). On the HollE dataset, this outperforms sentence selection baselines on both knowledge selection accuracy and end-to-end response generation quality. On the WoW (Wizard of Wikipedia) dataset, the method generalizes to unseen topics — evidence that the graph structure captures transferable retrieval patterns rather than task-specific features.
Knowledge-Aware Conversation with Augmented Graphs
The knowledge-aware chatting machine paper (arXiv:1903.10245v4) addresses a complementary problem: how to fuse structured knowledge (graph triples: subject, predicate, object) with unstructured knowledge (document text) for response generation. The key observation: graph paths narrow down vertex candidates — they constrain the space of relevant facts — while document texts provide the rich, expressive content needed for high-quality synthesis. Neither representation alone is sufficient.
The augmented knowledge graph combines triples from structured KG sources with text excerpts from documents, linked by shared concepts and named entities. A knowledge selector first narrows the graph using path traversal; the knowledge-aware response generator then synthesizes a response using both the selected graph paths and the linked text excerpts. The explainability benefit: the reasoning path through the graph is a transparent record of which facts and relationships contributed to the response.
For the harness, this architecture suggests a three-layer retrieval stack:
- Graph layer: Structured triples (from ontology extraction or entity-relationship parsing) constrain the retrieval to the relevant conceptual neighborhood
- Document layer: Dense vector retrieval within the constrained neighborhood provides the text excerpts needed for synthesis
- Metadata layer: SQLite structured lookup provides exact matches (paper ID, date, topic tags) for precise factual claims
The current Dual-Backend Memory Store (B3) implements layers 2 and 3. Layer 1 — the graph layer — is absent. Adding it would require an entity extraction and relationship parsing step at ingestion time (feasible with the MeXtract-style lightweight extraction models described below) and a graph traversal query interface at retrieval time.
Multi-Modal Extraction: OpenChemIE
OpenChemIE (arXiv:2404.01462v1) addresses the hard case of structured extraction: chemistry reaction data distributed across text, tables, and figures within a single document. The pipeline operates in two stages: extract information from each modality independently (text extractor, table parser, figure analyzer), then integrate the modality-level results into a unified list of reactions.
Performance metrics: F1 = 69.5% on a dataset annotated with R-groups (the hardest chemistry extraction task, involving complex substituent patterns); direct comparison accuracy = 64.3% against Reaxys, a commercial chemistry database maintained by expert annotators. These numbers are far below the precision expected of a professional chemist — but the value proposition is scale and automation, not perfection. OpenChemIE can process thousands of papers per hour; human experts cannot.
The harness's Literature Review Pipeline (D5) currently extracts structured annotations from fetched papers using LLM-generated summaries: topic, motivation, contribution, evidence, limitations. OpenChemIE's approach — domain-specific multi-modal extraction with specialized models for each modality — would improve the fidelity of the structured extraction step for technical documents with figures and tables. An open question: whether the improvement in extraction fidelity translates to improvement in synthesis quality, given that T_B synthesis quality appears to be instruction-bounded rather than retrieval-bounded.
Lightweight Metadata Extraction: MeXtract
MeXtract (arXiv:2510.06889v1) introduces a family of lightweight language models (0.5B to 3B parameters, fine-tuned from Qwen 2.5) for metadata extraction from scientific papers. On the MOLE benchmark, MeXtract achieves state-of-the-art performance and effectively transfers to unseen schemas — evidence that a small, fine-tuned extraction model generalizes beyond its training distribution.
The case for lightweight extraction models in the harness is straightforward: metadata extraction does not require the depth reasoning that the producer needs for synthesis. A 0.5B MeXtract-style model extracting author, date, venue, topic tags, and key claims from fetched papers would add structured retrieval hooks to every document without consuming the VRAM budget needed for the producer and evaluator. This is the SLM efficiency insight from Post 15 applied to the extraction role: match the model to the role's binding constraint.
| Model | Parameters | Task | Performance | Harness role |
|---|---|---|---|---|
| MeXtract (0.5B) | 0.5B | Metadata extraction (MOLE benchmark) | State-of-the-art; transfers to unseen schemas | Document ingestion: extract structured metadata for SQLite store |
| MeXtract (3B) | 3B | Metadata extraction + relationship extraction | Higher fidelity on complex schemas | Graph layer construction: extract entity triples for graph store |
| OpenChemIE pipeline | Multi-model | Cross-modal chemistry reaction extraction | F1=69.5%, accuracy=64.3% vs Reaxys | Domain-specific: chemistry papers with figures/tables |
| glm4:9b (current planner) | 9B | Task decomposition + compression | Fast JSON-mode; no extraction fine-tuning | Current extraction proxy; over-parameterized for pure metadata extraction |
Graph Retrieval and the Novelty Gate
The harness's Novelty Gate (B4) filters redundant retrieval results using embedding similarity: if a retrieved chunk is above a cosine similarity threshold to already-cached content, it is discarded. This works for exact and near-duplicate filtering but fails for conceptual redundancy — two chunks that use different vocabulary to express the same claim will have low embedding similarity and both pass the Novelty Gate, consuming context budget unnecessarily.
A graph-structured Novelty Gate would check conceptual overlap in the knowledge graph rather than embedding space. If the incoming document's extracted entity triples overlap significantly with triples already in the graph store, it can be flagged as conceptually redundant even if its embedding is distant. This is the graph-based deduplication strategy that the knowledge-aware conversation paper implicitly uses — the graph structure serves as a normalized representation of semantic content that is more robust to surface-level variation than embedding similarity.
The Literature Review Pipeline as Structured Extraction
The harness's Literature Review Pipeline (D5) already performs a form of structured extraction. Each fetched paper is processed by a LLM to produce an annotated entry: topic, motivation, contribution, evidence, limitations, narrow impact, broad impact. This structured format is stored in Markdown files that serve as the primary knowledge base for synthesis tasks.
MeXtract's approach — fine-tuning a small model specifically for metadata extraction with transferability to unseen schemas — suggests that the harness's current LLM-generated annotation step could be replaced or augmented with a lighter, faster extraction model. The key question is whether the LLM's annotation quality — which includes inference and synthesis ("what is the contribution?") rather than pure extraction ("what does the abstract say the contribution is?") — is necessary for downstream synthesis quality, or whether lightweight extraction would suffice.
The distinction matters because current annotation runs at the speed of the planner model (glm4:9b in 9B-parameter inference). A 0.5B extraction model would be 10-20× faster, enabling real-time annotation of search results during the retrieval phase rather than as a separate preprocessing step. The tradeoff: MeXtract-style models are trained on surface extraction, not on inferring contributions or limitations that are not explicitly stated in the abstract.
Design Implications for the Memory Architecture
| # | Implication | Source | Current gap |
|---|---|---|---|
| 1 | Add a graph layer to the Dual-Backend Memory Store; convert ingested documents to entity-relationship triples at ingestion time | arXiv:2206.07296v2, arXiv:1903.10245v4 | Current memory is a flat vector store + SQLite; no cross-document relationship encoding |
| 2 | Implement graph-constrained retrieval: use graph path traversal to narrow the candidate set before dense vector retrieval | arXiv:1903.10245v4 | Current retrieval is unconstrained nearest-neighbor; no concept-level pre-filtering |
| 3 | Use a lightweight extraction model (0.5B–3B parameters) for structured metadata extraction at ingestion; reserve the planner model for inferential annotation | arXiv:2510.06889v1 | Current: all annotation done by glm4:9b; over-parameterized for surface extraction tasks |
| 4 | Upgrade the Novelty Gate to check conceptual overlap in the graph store, not just embedding similarity | arXiv:2206.07296v2 | Current Novelty Gate uses cosine similarity; misses semantic equivalence with surface variation |
| 5 | For domain-specific corpora with figures and tables, deploy multi-modal extraction pipelines; graph construction can draw on figure-derived data not present in text | arXiv:2404.01462v1 | Current extraction is text-only; figures and tables are ignored |
What the Literature Leaves Open
Several questions raised by this body of research remain unresolved — and bear directly on how the harness memory architecture should evolve:
- How well do current extraction models handle implicit relationships — those inferable from context but not stated as subject-predicate-object triples — and what recall penalty does the harness pay by ignoring them?
- At what graph density does graph-constrained retrieval begin to outperform flat embedding retrieval for multi-hop queries, and does the harness's current vector-first architecture cross that threshold on a mature knowledge store?
- Can a 0.5B–3B model generalize schema structures across scientific subdomains without fine-tuning, or does the harness need domain-specific extraction checkpoints to avoid systematic entity-type errors?
- When the graph store is partially invalidated — because a source document is retracted or superseded — how should the Novelty Gate distinguish genuinely new information from information that merely looks novel because its prior representation was removed?
- What are the failure modes of concept-level deduplication in research corpora where the same idea is expressed across dozens of papers with varying terminology, and how much graph bloat does the current cosine-similarity gate accumulate over a full autoresearch run?