Knowledge Graphs and Structured Extraction: Beyond the Vector Store

May 1, 2026 • 15 min read

The harness's Dual-Backend Memory Store uses vector similarity for semantic retrieval and SQLite for structured lookup. Both treat knowledge as a flat store of independent items. The knowledge graph literature argues for a third representation: a connected graph where the relationships between items carry as much signal as the items themselves. This post reviews the evidence and maps it to the memory architecture.

The Flat-Store Problem

Vector similarity retrieval treats each document chunk as an independent embedding. A query retrieves the k nearest neighbors in embedding space. The retrieved chunks are then concatenated and passed to the producer. The core limitation: inter-chunk relationships are discarded. Two chunks from the same document that discuss related concepts are retrieved independently; their connection — the fact that claim A from chunk 1 is the prerequisite for claim B from chunk 2 — is invisible to the retrieval step.

This matters when the task requires integrating information across multiple parts of a document rather than finding the single most relevant chunk. The harness's T_B task (best practices for cost envelope management) is exactly this kind of task: good best-practices synthesis requires connecting multiple principles that are spread across different sections of different documents. The evaluator's depth_r1 score of 6.0–6.1 across all four experiments suggests that the retrieval layer is consistently failing to surface the connections that would allow deep synthesis.

The retrieval ceiling hypothesis: If depth_r1 = 6.0 across all four experiments regardless of producer model, and the instruction optimization did not break it either, then the bottleneck may be upstream of the synthesis stage — in the retrieval step that determines what information reaches the producer. A graph-structured retrieval layer would surface cross-chunk relationships that flat vector retrieval misses.

Document Semantic Graphs

The document semantic graph paper (arXiv:2206.07296v2) directly addresses this problem in the context of knowledge selection for grounded dialogue systems. The key insight: background knowledge documents contain internal semantic connections among sentences that existing sentence-ranking approaches ignore. By converting documents into semantic graphs — nodes are sentences/concepts, edges are semantic relationships — and performing joint sentence-level and concept-level selection, the system can retrieve not just relevant sentences but relevant clusters of connected sentences.

The approach uses a multi-task learning framework that jointly optimizes sentence-level ranking (which sentences are relevant?) and concept-level selection (which concepts connect the relevant sentences?). On the HollE dataset, this outperforms sentence selection baselines on both knowledge selection accuracy and end-to-end response generation quality. On the WoW (Wizard of Wikipedia) dataset, the method generalizes to unseen topics — evidence that the graph structure captures transferable retrieval patterns rather than task-specific features.

The unseen-topic generalization result is particularly significant for the harness. The research cache is populated by prior runs, but novel queries retrieve from a sparse cache. A graph-structured retrieval layer that generalizes to unseen topics would reduce the quality penalty for novel queries — where the flat vector store's performance degrades most severely due to sparse nearest-neighbor matches.

Knowledge-Aware Conversation with Augmented Graphs

The knowledge-aware chatting machine paper (arXiv:1903.10245v4) addresses a complementary problem: how to fuse structured knowledge (graph triples: subject, predicate, object) with unstructured knowledge (document text) for response generation. The key observation: graph paths narrow down vertex candidates — they constrain the space of relevant facts — while document texts provide the rich, expressive content needed for high-quality synthesis. Neither representation alone is sufficient.

The augmented knowledge graph combines triples from structured KG sources with text excerpts from documents, linked by shared concepts and named entities. A knowledge selector first narrows the graph using path traversal; the knowledge-aware response generator then synthesizes a response using both the selected graph paths and the linked text excerpts. The explainability benefit: the reasoning path through the graph is a transparent record of which facts and relationships contributed to the response.

For the harness, this architecture suggests a three-layer retrieval stack:

Graph layer: Structured triples (from ontology extraction or entity-relationship parsing) constrain the retrieval to the relevant conceptual neighborhood
Document layer: Dense vector retrieval within the constrained neighborhood provides the text excerpts needed for synthesis
Metadata layer: SQLite structured lookup provides exact matches (paper ID, date, topic tags) for precise factual claims

The current Dual-Backend Memory Store (B3) implements layers 2 and 3. Layer 1 — the graph layer — is absent. Adding it would require an entity extraction and relationship parsing step at ingestion time (feasible with the MeXtract-style lightweight extraction models described below) and a graph traversal query interface at retrieval time.

Fig. 1 — Current Dual-Backend Memory Store (left) vs. proposed three-layer architecture with graph augmentation (right). The graph layer constrains vector retrieval to the relevant conceptual neighborhood, reducing retrieval noise on complex multi-concept queries.

Multi-Modal Extraction: OpenChemIE

OpenChemIE (arXiv:2404.01462v1) addresses the hard case of structured extraction: chemistry reaction data distributed across text, tables, and figures within a single document. The pipeline operates in two stages: extract information from each modality independently (text extractor, table parser, figure analyzer), then integrate the modality-level results into a unified list of reactions.

Performance metrics: F1 = 69.5% on a dataset annotated with R-groups (the hardest chemistry extraction task, involving complex substituent patterns); direct comparison accuracy = 64.3% against Reaxys, a commercial chemistry database maintained by expert annotators. These numbers are far below the precision expected of a professional chemist — but the value proposition is scale and automation, not perfection. OpenChemIE can process thousands of papers per hour; human experts cannot.

The harness's Literature Review Pipeline (D5) currently extracts structured annotations from fetched papers using LLM-generated summaries: topic, motivation, contribution, evidence, limitations. OpenChemIE's approach — domain-specific multi-modal extraction with specialized models for each modality — would improve the fidelity of the structured extraction step for technical documents with figures and tables. An open question: whether the improvement in extraction fidelity translates to improvement in synthesis quality, given that T_B synthesis quality appears to be instruction-bounded rather than retrieval-bounded.

Lightweight Metadata Extraction: MeXtract

MeXtract (arXiv:2510.06889v1) introduces a family of lightweight language models (0.5B to 3B parameters, fine-tuned from Qwen 2.5) for metadata extraction from scientific papers. On the MOLE benchmark, MeXtract achieves state-of-the-art performance and effectively transfers to unseen schemas — evidence that a small, fine-tuned extraction model generalizes beyond its training distribution.

The case for lightweight extraction models in the harness is straightforward: metadata extraction does not require the depth reasoning that the producer needs for synthesis. A 0.5B MeXtract-style model extracting author, date, venue, topic tags, and key claims from fetched papers would add structured retrieval hooks to every document without consuming the VRAM budget needed for the producer and evaluator. This is the SLM efficiency insight from Post 15 applied to the extraction role: match the model to the role's binding constraint.

Model	Parameters	Task	Performance	Harness role
MeXtract (0.5B)	0.5B	Metadata extraction (MOLE benchmark)	State-of-the-art; transfers to unseen schemas	Document ingestion: extract structured metadata for SQLite store
MeXtract (3B)	3B	Metadata extraction + relationship extraction	Higher fidelity on complex schemas	Graph layer construction: extract entity triples for graph store
OpenChemIE pipeline	Multi-model	Cross-modal chemistry reaction extraction	F1=69.5%, accuracy=64.3% vs Reaxys	Domain-specific: chemistry papers with figures/tables
glm4:9b (current planner)	9B	Task decomposition + compression	Fast JSON-mode; no extraction fine-tuning	Current extraction proxy; over-parameterized for pure metadata extraction

The efficiency argument: The harness currently uses glm4:9b — a 9B parameter model — to handle planning, compression, and ad hoc extraction tasks. A dedicated MeXtract-style 0.5B model would handle structured metadata extraction at 18× lower parameter cost, freeing glm4:9b for tasks requiring higher-order reasoning. This is the same role-separation logic that motivated the three-model architecture in experiment-04.

Graph Retrieval and the Novelty Gate

The harness's Novelty Gate (B4) filters redundant retrieval results using embedding similarity: if a retrieved chunk is above a cosine similarity threshold to already-cached content, it is discarded. This works for exact and near-duplicate filtering but fails for conceptual redundancy — two chunks that use different vocabulary to express the same claim will have low embedding similarity and both pass the Novelty Gate, consuming context budget unnecessarily.

A graph-structured Novelty Gate would check conceptual overlap in the knowledge graph rather than embedding space. If the incoming document's extracted entity triples overlap significantly with triples already in the graph store, it can be flagged as conceptually redundant even if its embedding is distant. This is the graph-based deduplication strategy that the knowledge-aware conversation paper implicitly uses — the graph structure serves as a normalized representation of semantic content that is more robust to surface-level variation than embedding similarity.

The Literature Review Pipeline as Structured Extraction

The harness's Literature Review Pipeline (D5) already performs a form of structured extraction. Each fetched paper is processed by a LLM to produce an annotated entry: topic, motivation, contribution, evidence, limitations, narrow impact, broad impact. This structured format is stored in Markdown files that serve as the primary knowledge base for synthesis tasks.

MeXtract's approach — fine-tuning a small model specifically for metadata extraction with transferability to unseen schemas — suggests that the harness's current LLM-generated annotation step could be replaced or augmented with a lighter, faster extraction model. The key question is whether the LLM's annotation quality — which includes inference and synthesis ("what is the contribution?") rather than pure extraction ("what does the abstract say the contribution is?") — is necessary for downstream synthesis quality, or whether lightweight extraction would suffice.

The distinction matters because current annotation runs at the speed of the planner model (glm4:9b in 9B-parameter inference). A 0.5B extraction model would be 10-20× faster, enabling real-time annotation of search results during the retrieval phase rather than as a separate preprocessing step. The tradeoff: MeXtract-style models are trained on surface extraction, not on inferring contributions or limitations that are not explicitly stated in the abstract.

The hybrid approach: Use MeXtract-style extraction for objective metadata (author, date, venue, entity triples, explicit claims) and reserve the planner model for inferential annotation (contribution classification, limitation inference). This separates the cheap extraction step from the expensive inference step, allowing the expensive step to run asynchronously without blocking the retrieval pipeline.

Design Implications for the Memory Architecture

#	Implication	Source	Current gap
1	Add a graph layer to the Dual-Backend Memory Store; convert ingested documents to entity-relationship triples at ingestion time	arXiv:2206.07296v2, arXiv:1903.10245v4	Current memory is a flat vector store + SQLite; no cross-document relationship encoding
2	Implement graph-constrained retrieval: use graph path traversal to narrow the candidate set before dense vector retrieval	arXiv:1903.10245v4	Current retrieval is unconstrained nearest-neighbor; no concept-level pre-filtering
3	Use a lightweight extraction model (0.5B–3B parameters) for structured metadata extraction at ingestion; reserve the planner model for inferential annotation	arXiv:2510.06889v1	Current: all annotation done by glm4:9b; over-parameterized for surface extraction tasks
4	Upgrade the Novelty Gate to check conceptual overlap in the graph store, not just embedding similarity	arXiv:2206.07296v2	Current Novelty Gate uses cosine similarity; misses semantic equivalence with surface variation
5	For domain-specific corpora with figures and tables, deploy multi-modal extraction pipelines; graph construction can draw on figure-derived data not present in text	arXiv:2404.01462v1	Current extraction is text-only; figures and tables are ignored

What the Literature Leaves Open

Several questions raised by this body of research remain unresolved — and bear directly on how the harness memory architecture should evolve:

How well do current extraction models handle implicit relationships — those inferable from context but not stated as subject-predicate-object triples — and what recall penalty does the harness pay by ignoring them?
At what graph density does graph-constrained retrieval begin to outperform flat embedding retrieval for multi-hop queries, and does the harness's current vector-first architecture cross that threshold on a mature knowledge store?
Can a 0.5B–3B model generalize schema structures across scientific subdomains without fine-tuning, or does the harness need domain-specific extraction checkpoints to avoid systematic entity-type errors?
When the graph store is partially invalidated — because a source document is retracted or superseded — how should the Novelty Gate distinguish genuinely new information from information that merely looks novel because its prior representation was removed?
What are the failure modes of concept-level deduplication in research corpora where the same idea is expressed across dozens of papers with varying terminology, and how much graph bloat does the current cosine-similarity gate accumulate over a full autoresearch run?

← Previous 17 · Evaluation Uncertainty Next → 19 · Eval Robustness

The Flat-Store Problem

Document Semantic Graphs

Knowledge-Aware Conversation with Augmented Graphs

Multi-Modal Extraction: OpenChemIE

Lightweight Metadata Extraction: MeXtract

Graph Retrieval and the Novelty Gate

The Literature Review Pipeline as Structured Extraction

Design Implications for the Memory Architecture

What the Literature Leaves Open

Related in this series