May 23, 2026 • 15 min read

Structured Knowledge Queries: Ontology, SPARQL, and Grounded Verification

Post 18 showed that vector stores can be augmented with graph layers. This post goes one level deeper: six papers that tackle structured query generation, low-resource ontology extraction, and KB-grounded verification — each addressing a different gap between how LLMs reason and how knowledge graphs are actually queried.

Series context. The harness Dual-Backend Memory Store (Post 6) currently relies on semantic vector similarity. Posts 18 and 22 together make the case for a graph layer — but where Post 18 argued for document-level graphs, this post covers what happens when you try to write SPARQL against them: LLMs get syntax right and semantics wrong, zero-shot exploration outperforms fine-tuning on Wikidata, and grounded verification chains reduce prescription error risk to a level that ungrounded LLMs cannot approach.

The LLM-KG Integration Gap

Knowledge graphs encode facts with machine-queryable structure. Language models encode facts with parametric weights. The combination is appealing in theory — KGs provide grounding and traceability; LLMs provide the natural-language interface. The integration is hard in practice because the query layer requires formal SPARQL, and LLMs were not trained to write semantically correct SPARQL at scale.

This post covers six papers that each address a different slice of that gap:

Paper	Gap addressed	Key result
ODA [2404.07677]	LLMs ignore KG reasoning potential	+12.87% / +8.9% accuracy over baselines
ETLCH [2509.08381]	Structured extraction requires expensive models	1B-param LoRA beats baselines on 100–1000 samples
LLM-KG-Bench [2409.05925]	SPARQL capability is unmeasured	Syntax: fine. Semantic-create: still hard.
Agentic SPARQL [2603.06582]	Single-endpoint KGQA misses federation	MCP-powered federated query across distributed KGs
GRASP [2507.08107]	Fine-tuning SPARQL generation is dataset-specific	Zero-shot SOTA on Wikidata; near best few-shot on Freebase
PharmGraph-Auditor [2603.10891]	LLMs lack traceability for safety-critical verification	HPKB + KB-grounded CoV enables traceable prescription auditing

1. ODA: Recursive Observation Handles Knowledge Explosion

The Observation-Driven Agent (ODA, arXiv:2404.07677v2) starts from a diagnosis that still holds: when you connect an LLM to a KG, the LLM's tendency is to ignore what the KG offers and rely on parametric memory instead. Even when you inject KG facts into the prompt, the LLM selects only the first few hops before context pressure forces it to stop. The resulting answers look grounded but are often missing the third-order facts that change the answer.

ODA's architecture is a three-phase cycle: observe → act → reflect. In the observe phase, the agent executes a global traversal of the KG centered on the query entities. Rather than pulling a fixed-depth neighborhood, it uses a recursive observation mechanism that expands selectively: when a retrieved node contains concepts relevant to the current question state, the mechanism recurses into that node's neighborhood before returning. Nodes that don't contribute to the current reasoning thread are pruned. This controls the combinatorial explosion that makes naive KG traversal unworkable at scale.

The act phase executes operations against the pruned subgraph. The reflect phase checks whether the current answer is consistent with the accumulated observations, or whether a new traversal arc is needed. The cycle repeats until consistency is achieved.

Results. Extensive experiments across multiple KGQA benchmarks give ODA accuracy improvements of 12.87% and 8.9% over existing baselines. Both numbers represent state-of-the-art at time of publication. The recursive pruning mechanism is the critical ingredient — without it, knowledge explosion degrades performance as the context window fills with irrelevant triples.

The harness implication is specific: the Dual-Backend Memory Store currently retrieves flat lists of semantically similar chunks. ODA's observe–act–reflect cycle is a planner-level pattern that could sit above retrieval — deciding which KG neighborhood to expand at each planning step rather than issuing a single vector query. The recursive observation mechanism is architecturally similar to the harness's existing ReAct Comparator, which also loops until an internal consistency criterion is met.

2. ETLCH: Billion-Parameter LoRA Beats the Big Models on Low Data

Structured information extraction from text — converting unstructured documents into JSON objects, knowledge graph triples, or named-entity spans — has historically been owned by either hand-engineered pipelines or very large instruction-tuned models. The assumption is that small models can't handle multi-schema outputs reliably without massive training data.

ETLCH (arXiv:2509.08381v1) challenges that assumption directly. The work fine-tunes a one-billion-parameter LLaMA-based model with low-rank adaptation (LoRA) on three structured extraction tasks simultaneously:

JSON extraction: convert free-form text into schema-conforming JSON objects
Knowledge graph extraction: identify entity-relation-entity triples from documents
Named entity recognition: label entity spans with ontology-grounded types

The low-resource constraint is the headline: training set sizes range from 100 to 1,000 samples per task. At every data scale tested, ETLCH outperforms strong baselines across most evaluation metrics, including baselines that use 7B+ parameter models without LoRA constraint.

Why this works. LoRA's rank constraint prevents catastrophic forgetting across the three tasks. The billion-parameter base provides sufficient representational capacity for multi-schema outputs. And 100–1000 samples is enough to teach the model what the target schema looks like — the hard part is schema adherence, not extraction itself. The finding argues that for domain-specific structured outputs (financial compliance, legal document analytics, multilingual KB construction), a well-tuned small model is a practical choice at a fraction of the compute cost of frontier models.

For the harness, this is the extraction analogue of the SLM efficiency finding from Post 15: model size is not the primary lever for structured extraction quality once you have domain-specific LoRA fine-tuning. The Literature Review Pipeline's annotation step could be upgraded with an ETLCH-style extraction layer that converts fetched paper abstracts directly into structured JSONL records for the knowledge store.

3. LLM-KG-Bench: Syntax Is Easy, Semantics Remains Hard

Before deciding how to use LLMs to generate SPARQL, it helps to know what they can actually do. LLM-KG-Bench (arXiv:2409.05925v2) fills that measurement gap by implementing a benchmarking framework that evaluates GPT, Gemini, and Claude models across four capability dimensions:

Syntax correction: fix a malformed SPARQL query
Semantic read: generate a SELECT query that correctly retrieves specified facts
Semantic create: generate queries that update or insert triples
KG-prompted: generate queries when relevant schema fragments are included in context

The findings reveal a consistent asymmetry. Syntax correction is essentially solved by the best current models. Semantic read is feasible on simple patterns but degrades with query complexity. Semantic create — requiring the model to reason about update semantics, graph naming conventions, and constraint satisfaction simultaneously — remains difficult across all tested models.

The practical implication. If your harness needs to query a KG at runtime, relying on a vanilla LLM to generate arbitrary SPARQL is fragile. The failure mode is not syntax errors (those are easy to catch and retry) — it's semantically plausible but factually wrong queries that silently return incorrect results. Defensive architectures should constrain the query space: template-based SPARQL with LLM-filled slots, schema-prompted context injection, or the exploration-first approach that GRASP takes.

4. Agentic SPARQL: MCP-Powered Federated Query Generation

Single-endpoint SPARQL over a local knowledge graph is a solved enough problem. The harder case is federated query: a question whose answer requires joining facts across multiple distributed knowledge graphs at different endpoints, with potentially different schemas.

Agentic SPARQL (arXiv:2603.06582v2) addresses federated KGQA by combining SPARQL federation with an agentic architecture. The key design choice is exposing SPARQL federation capabilities as MCP (Model Context Protocol) tools that the agent can invoke during reasoning. Rather than generating one monolithic federated query, the agent issues targeted sub-queries to individual endpoints and synthesizes results across calls.

This makes federation robust to schema heterogeneity: the agent can inspect the schema of each endpoint independently, adapt query patterns per endpoint, and aggregate results without requiring a unified global schema. The evaluation uses the Federated KGQA Benchmark, which tests across multiple distributed endpoints with realistic cross-graph questions.

The harness connection is immediate: the MCP Dispatch Router (Post 8) already routes subtasks to remote instances. An SPARQL MCP server that exposes endpoint-specific query tools would integrate directly into that architecture — the agent queries `sparql://wikidata`, `sparql://dbpedia`, and domain-specific endpoints through the same routing layer it already uses for other remote tools.

5. GRASP: Zero-Shot SPARQL via Exploratory IRI Search

Fine-tuning LLMs to generate SPARQL for a specific knowledge graph works, but it has a well-known failure mode: the fine-tuned model memorizes the endpoint's entity URIs and relation IRIs. When the endpoint schema changes, or when you want to query a different graph, you retrain.

GRASP (arXiv:2507.08107v2) takes a different approach entirely. Rather than generating a complete SPARQL query from scratch, GRASP uses the LLM to explore the knowledge graph: it iteratively executes targeted SPARQL queries to find relevant IRIs and literals, then uses those discovered IRIs to build the final answer query. The process is zero-shot — no fine-tuning, no dataset-specific training.

GRASP's iterative IRI exploration: the LLM issues search queries to discover valid IRIs, then constructs the final answer query from confirmed identifiers.

Results. On Wikidata, GRASP achieves state-of-the-art results in a zero-shot setting across multiple benchmarks. On Freebase, it performs close to the best few-shot methods despite having no training signal. On less commonly evaluated knowledge graphs, it performs well overall — suggesting the exploration strategy generalizes across endpoint types, not just the graphs it was developed on.

The insight is that LLMs are better at exploratory search than at memorizing IRI namespaces. If you let the model discover the right identifiers before committing them to a query, you avoid the namespace-memorization failure mode while retaining the semantic reasoning that makes LLMs useful for natural-language question answering.

For the harness, GRASP's pattern is a natural fit: the Planner-First (Post 6) already decomposes tasks into targeted sub-queries. An SPARQL-capable planner using GRASP-style IRI exploration could handle structured knowledge retrieval without requiring a pre-built index of the endpoint's vocabulary.

6. PharmGraph-Auditor: When Grounding Is a Safety Requirement

The previous five papers are about capability — making LLMs better at querying and populating knowledge graphs. PharmGraph-Auditor (arXiv:2603.10891v1) is about a domain where capability is insufficient without traceability: prescription verification.

Direct application of LLMs to prescription auditing fails for three specific reasons: factual unreliability (hallucinated drug interactions), lack of traceability (no audit trail linking decisions to sources), and weakness in complex relational reasoning (drug-drug interactions are graph problems, not text problems). A model that says "this combination is safe" without a source citation is not usable in a clinical setting.

PharmGraph-Auditor's architecture addresses all three with a Hybrid Pharmaceutical Knowledge Base (HPKB) implemented under the Virtual Knowledge Graph paradigm. The HPKB unifies:

Relational constraints: contraindications, dosage limits, allergy interactions encoded in a relational schema
Graph-based topological reasoning: drug-drug interaction networks, metabolic pathway graphs, pharmacokinetic relationships

Schema co-evolution is handled by an Iterative Schema Refinement algorithm that updates the HPKB schema as new medical texts are ingested, without requiring manual schema redesign. The LLM's role is transformed from answer generator to reasoning engine over KB-grounded facts via a KB-grounded Chain of Verification (CoV): each claim in a prescription audit trace is verified against the HPKB before it contributes to the final verdict.

Why this matters beyond pharma. KB-grounded CoV is a general pattern, not a pharmaceutical-specific one. Any domain where answer claims must be traceable to authoritative sources — legal reasoning, financial compliance, scientific fact-checking — can apply the same architecture. The chain-of-verification structure is also directly compatible with the harness's ReAct loop: each verification step is a tool call that returns a grounded fact or a contradiction signal.

Capability coverage across the six papers: which papers address zero-shot generalization, structured extraction, KG integration, SPARQL query generation, and grounded verification.

Reading the Cluster as a Progression

The six papers form a logical progression from measurement to mechanism to application:

LLM-KG-Bench establishes the baseline: LLMs can fix SPARQL syntax but generate wrong semantic queries. This is the gap the other papers respond to.
ODA responds at the agent level: don't generate SPARQL at all — use a structured observe-act-reflect loop that lets the LLM reason over KG facts without writing formal queries.
GRASP responds at the query level: let the LLM discover IRIs via exploration before committing to a final query, eliminating the namespace-memorization failure mode.
Agentic SPARQL extends query generation to federated multi-endpoint settings via MCP tool exposure, making the architecture composable with existing agentic infrastructure.
ETLCH closes the data-population loop: once you have query generation working, you need efficient extraction to keep the KG populated from new documents.
PharmGraph-Auditor shows what a safety-critical deployment looks like: KB-grounded verification with full traceability, iterative schema refinement, and a hybrid relational+graph storage model.

The architectural takeaway. A mature harness memory layer looks less like a vector store and more like the PharmGraph-Auditor HPKB: structured schema for typed facts, graph topology for relational reasoning, and KB-grounded verification before any fact enters the output. The GRASP and Agentic SPARQL patterns handle query-time access; ETLCH handles ingestion-time extraction; ODA handles planning-time KG integration. These aren't competing designs — they occupy different positions in a layered architecture.

Harness Integration Points

Paper	Harness component	Integration pattern
ODA	Planner-First + Memory Store	Replace flat retrieval with recursive KG traversal during observe phase
ETLCH	Literature Review Pipeline	Swap annotation step for 1B LoRA extractor: abstract → structured JSONL
LLM-KG-Bench	Evaluation Framework	Benchmark any SPARQL generation component before deploying to production
Agentic SPARQL	MCP Dispatch Router	Register SPARQL endpoints as MCP tools; route KG queries through existing dispatcher
GRASP	Planner-First query generation	Replace static SPARQL templates with IRI-exploration loop over target endpoint
PharmGraph-Auditor	Verification + Observability	Add KB-grounded CoV to Wiggum evaluator for traceable fact-level verification

← Previous 21 · Alignment Deep Cuts Next → 23 · Judge Benchmarks & Scaling

The LLM-KG Integration Gap

1. ODA: Recursive Observation Handles Knowledge Explosion

2. ETLCH: Billion-Parameter LoRA Beats the Big Models on Low Data

3. LLM-KG-Bench: Syntax Is Easy, Semantics Remains Hard

4. Agentic SPARQL: MCP-Powered Federated Query Generation

5. GRASP: Zero-Shot SPARQL via Exploratory IRI Search

6. PharmGraph-Auditor: When Grounding Is a Safety Requirement

Reading the Cluster as a Progression

Harness Integration Points

Related in this series