Structured Knowledge Queries: Ontology, SPARQL, and Grounded Verification
Post 18 showed that vector stores can be augmented with graph layers. This post goes one level deeper: six papers that tackle structured query generation, low-resource ontology extraction, and KB-grounded verification — each addressing a different gap between how LLMs reason and how knowledge graphs are actually queried.
The LLM-KG Integration Gap
Knowledge graphs encode facts with machine-queryable structure. Language models encode facts with parametric weights. The combination is appealing in theory — KGs provide grounding and traceability; LLMs provide the natural-language interface. The integration is hard in practice because the query layer requires formal SPARQL, and LLMs were not trained to write semantically correct SPARQL at scale.
This post covers six papers that each address a different slice of that gap:
| Paper | Gap addressed | Key result |
|---|---|---|
| ODA [2404.07677] | LLMs ignore KG reasoning potential | +12.87% / +8.9% accuracy over baselines |
| ETLCH [2509.08381] | Structured extraction requires expensive models | 1B-param LoRA beats baselines on 100–1000 samples |
| LLM-KG-Bench [2409.05925] | SPARQL capability is unmeasured | Syntax: fine. Semantic-create: still hard. |
| Agentic SPARQL [2603.06582] | Single-endpoint KGQA misses federation | MCP-powered federated query across distributed KGs |
| GRASP [2507.08107] | Fine-tuning SPARQL generation is dataset-specific | Zero-shot SOTA on Wikidata; near best few-shot on Freebase |
| PharmGraph-Auditor [2603.10891] | LLMs lack traceability for safety-critical verification | HPKB + KB-grounded CoV enables traceable prescription auditing |
1. ODA: Recursive Observation Handles Knowledge Explosion
The Observation-Driven Agent (ODA, arXiv:2404.07677v2) starts from a diagnosis that still holds: when you connect an LLM to a KG, the LLM's tendency is to ignore what the KG offers and rely on parametric memory instead. Even when you inject KG facts into the prompt, the LLM selects only the first few hops before context pressure forces it to stop. The resulting answers look grounded but are often missing the third-order facts that change the answer.
ODA's architecture is a three-phase cycle: observe → act → reflect. In the observe phase, the agent executes a global traversal of the KG centered on the query entities. Rather than pulling a fixed-depth neighborhood, it uses a recursive observation mechanism that expands selectively: when a retrieved node contains concepts relevant to the current question state, the mechanism recurses into that node's neighborhood before returning. Nodes that don't contribute to the current reasoning thread are pruned. This controls the combinatorial explosion that makes naive KG traversal unworkable at scale.
The act phase executes operations against the pruned subgraph. The reflect phase checks whether the current answer is consistent with the accumulated observations, or whether a new traversal arc is needed. The cycle repeats until consistency is achieved.
The harness implication is specific: the Dual-Backend Memory Store currently retrieves flat lists of semantically similar chunks. ODA's observe–act–reflect cycle is a planner-level pattern that could sit above retrieval — deciding which KG neighborhood to expand at each planning step rather than issuing a single vector query. The recursive observation mechanism is architecturally similar to the harness's existing ReAct Comparator, which also loops until an internal consistency criterion is met.
2. ETLCH: Billion-Parameter LoRA Beats the Big Models on Low Data
Structured information extraction from text — converting unstructured documents into JSON objects, knowledge graph triples, or named-entity spans — has historically been owned by either hand-engineered pipelines or very large instruction-tuned models. The assumption is that small models can't handle multi-schema outputs reliably without massive training data.
ETLCH (arXiv:2509.08381v1) challenges that assumption directly. The work fine-tunes a one-billion-parameter LLaMA-based model with low-rank adaptation (LoRA) on three structured extraction tasks simultaneously:
- JSON extraction: convert free-form text into schema-conforming JSON objects
- Knowledge graph extraction: identify entity-relation-entity triples from documents
- Named entity recognition: label entity spans with ontology-grounded types
The low-resource constraint is the headline: training set sizes range from 100 to 1,000 samples per task. At every data scale tested, ETLCH outperforms strong baselines across most evaluation metrics, including baselines that use 7B+ parameter models without LoRA constraint.
For the harness, this is the extraction analogue of the SLM efficiency finding from Post 15: model size is not the primary lever for structured extraction quality once you have domain-specific LoRA fine-tuning. The Literature Review Pipeline's annotation step could be upgraded with an ETLCH-style extraction layer that converts fetched paper abstracts directly into structured JSONL records for the knowledge store.
3. LLM-KG-Bench: Syntax Is Easy, Semantics Remains Hard
Before deciding how to use LLMs to generate SPARQL, it helps to know what they can actually do. LLM-KG-Bench (arXiv:2409.05925v2) fills that measurement gap by implementing a benchmarking framework that evaluates GPT, Gemini, and Claude models across four capability dimensions:
- Syntax correction: fix a malformed SPARQL query
- Semantic read: generate a SELECT query that correctly retrieves specified facts
- Semantic create: generate queries that update or insert triples
- KG-prompted: generate queries when relevant schema fragments are included in context
The findings reveal a consistent asymmetry. Syntax correction is essentially solved by the best current models. Semantic read is feasible on simple patterns but degrades with query complexity. Semantic create — requiring the model to reason about update semantics, graph naming conventions, and constraint satisfaction simultaneously — remains difficult across all tested models.
4. Agentic SPARQL: MCP-Powered Federated Query Generation
Single-endpoint SPARQL over a local knowledge graph is a solved enough problem. The harder case is federated query: a question whose answer requires joining facts across multiple distributed knowledge graphs at different endpoints, with potentially different schemas.
Agentic SPARQL (arXiv:2603.06582v2) addresses federated KGQA by combining SPARQL federation with an agentic architecture. The key design choice is exposing SPARQL federation capabilities as MCP (Model Context Protocol) tools that the agent can invoke during reasoning. Rather than generating one monolithic federated query, the agent issues targeted sub-queries to individual endpoints and synthesizes results across calls.
This makes federation robust to schema heterogeneity: the agent can inspect the schema of each endpoint independently, adapt query patterns per endpoint, and aggregate results without requiring a unified global schema. The evaluation uses the Federated KGQA Benchmark, which tests across multiple distributed endpoints with realistic cross-graph questions.
The harness connection is immediate: the MCP Dispatch Router (Post 8) already routes subtasks to remote instances. An SPARQL MCP server that exposes endpoint-specific query tools would integrate directly into that architecture — the agent queries `sparql://wikidata`, `sparql://dbpedia`, and domain-specific endpoints through the same routing layer it already uses for other remote tools.
5. GRASP: Zero-Shot SPARQL via Exploratory IRI Search
Fine-tuning LLMs to generate SPARQL for a specific knowledge graph works, but it has a well-known failure mode: the fine-tuned model memorizes the endpoint's entity URIs and relation IRIs. When the endpoint schema changes, or when you want to query a different graph, you retrain.
GRASP (arXiv:2507.08107v2) takes a different approach entirely. Rather than generating a complete SPARQL query from scratch, GRASP uses the LLM to explore the knowledge graph: it iteratively executes targeted SPARQL queries to find relevant IRIs and literals, then uses those discovered IRIs to build the final answer query. The process is zero-shot — no fine-tuning, no dataset-specific training.
The insight is that LLMs are better at exploratory search than at memorizing IRI namespaces. If you let the model discover the right identifiers before committing them to a query, you avoid the namespace-memorization failure mode while retaining the semantic reasoning that makes LLMs useful for natural-language question answering.
For the harness, GRASP's pattern is a natural fit: the Planner-First (Post 6) already decomposes tasks into targeted sub-queries. An SPARQL-capable planner using GRASP-style IRI exploration could handle structured knowledge retrieval without requiring a pre-built index of the endpoint's vocabulary.
6. PharmGraph-Auditor: When Grounding Is a Safety Requirement
The previous five papers are about capability — making LLMs better at querying and populating knowledge graphs. PharmGraph-Auditor (arXiv:2603.10891v1) is about a domain where capability is insufficient without traceability: prescription verification.
Direct application of LLMs to prescription auditing fails for three specific reasons: factual unreliability (hallucinated drug interactions), lack of traceability (no audit trail linking decisions to sources), and weakness in complex relational reasoning (drug-drug interactions are graph problems, not text problems). A model that says "this combination is safe" without a source citation is not usable in a clinical setting.
PharmGraph-Auditor's architecture addresses all three with a Hybrid Pharmaceutical Knowledge Base (HPKB) implemented under the Virtual Knowledge Graph paradigm. The HPKB unifies:
- Relational constraints: contraindications, dosage limits, allergy interactions encoded in a relational schema
- Graph-based topological reasoning: drug-drug interaction networks, metabolic pathway graphs, pharmacokinetic relationships
Schema co-evolution is handled by an Iterative Schema Refinement algorithm that updates the HPKB schema as new medical texts are ingested, without requiring manual schema redesign. The LLM's role is transformed from answer generator to reasoning engine over KB-grounded facts via a KB-grounded Chain of Verification (CoV): each claim in a prescription audit trace is verified against the HPKB before it contributes to the final verdict.
Reading the Cluster as a Progression
The six papers form a logical progression from measurement to mechanism to application:
- LLM-KG-Bench establishes the baseline: LLMs can fix SPARQL syntax but generate wrong semantic queries. This is the gap the other papers respond to.
- ODA responds at the agent level: don't generate SPARQL at all — use a structured observe-act-reflect loop that lets the LLM reason over KG facts without writing formal queries.
- GRASP responds at the query level: let the LLM discover IRIs via exploration before committing to a final query, eliminating the namespace-memorization failure mode.
- Agentic SPARQL extends query generation to federated multi-endpoint settings via MCP tool exposure, making the architecture composable with existing agentic infrastructure.
- ETLCH closes the data-population loop: once you have query generation working, you need efficient extraction to keep the KG populated from new documents.
- PharmGraph-Auditor shows what a safety-critical deployment looks like: KB-grounded verification with full traceability, iterative schema refinement, and a hybrid relational+graph storage model.
Harness Integration Points
| Paper | Harness component | Integration pattern |
|---|---|---|
| ODA | Planner-First + Memory Store | Replace flat retrieval with recursive KG traversal during observe phase |
| ETLCH | Literature Review Pipeline | Swap annotation step for 1B LoRA extractor: abstract → structured JSONL |
| LLM-KG-Bench | Evaluation Framework | Benchmark any SPARQL generation component before deploying to production |
| Agentic SPARQL | MCP Dispatch Router | Register SPARQL endpoints as MCP tools; route KG queries through existing dispatcher |
| GRASP | Planner-First query generation | Replace static SPARQL templates with IRI-exploration loop over target endpoint |
| PharmGraph-Auditor | Verification + Observability | Add KB-grounded CoV to Wiggum evaluator for traceable fact-level verification |