Memory as Infrastructure: Quality-Weighted Retrieval and the Ontology Graph
Every harness run starts cold: the agent has no knowledge of the 200 research tasks it completed last month. The memory layer changes this — a dual-store system that compresses each run into an observation, ranks past observations by semantic similarity blended with output quality, and surfaces them as a context prefix before synthesis. This post covers the architecture, the ranking formula, the RLHF feedback loop, and the UMAP visualization that makes the knowledge base navigable.
The baseline agentic pipeline has a clean architecture problem: every run is independent. The agent searches the web, synthesizes a document, receives an evaluator score, and then the session ends. The next run starts with no knowledge of what was researched before — no awareness that an adjacent task was completed two days ago with a 9.1 score, no recall of the specific claims the evaluator praised or flagged. The pipeline is stateless by default, and statelessness is the enemy of improvement.
The memory system in harness/memory.py addresses this with two complementary ideas. First: every completed run is compressed into a structured observation and stored persistently. Second: before each new synthesis, the most relevant past observations are retrieved and injected into the synthesis prompt as context. The agent doesn't remember in the way a human does, but it can read its own history.
The Dual-Store Architecture
The core tension in agent memory design is between retrieval quality and retrieval reliability. Pure semantic search via vector embeddings gives high-quality similarity matching but degrades silently — if the embedding model changes, or ChromaDB's HNSW index drifts, or a library update shifts the embedding space, the system fails without any obvious signal. Pure keyword search (FTS5) is more reliable but misses synonyms and topically adjacent queries that semantic search handles well. The harness uses both, with a clear fallback hierarchy.
SQLite + FTS5
Source of truth for all observations. Schema: id, timestamp, task, task_type, title, narrative, facts (JSON array), output_path, final_score, final, run_id, quality (integer ±3), tags. FTS5 virtual table (observations_fts) indexes task, title, narrative, and facts with content='observations' for row-join retrieval. WAL mode enabled for concurrent read/write safety. The FTS5 path is the fallback when ChromaDB is unavailable, and the source of ground truth for the auto-migration that keeps ChromaDB in sync.
ChromaDB + sentence-transformers
Persistent ChromaDB collection with cosine similarity (hnsw:space: cosine). Embedding model: all-MiniLM-L6-v2 (~22MB, runs locally on CPU or CUDA, no API key required). Each observation is embedded as the concatenation of its title, narrative, and facts strings. At init time, the store compares SQLite count to ChromaDB count and backfills any missing observations in 50-row batches — so a ChromaDB wipe or backend migration doesn't silently leave the vector index stale.
The split between the two stores is deliberate: SQLite holds the full row including quality scores and feedback history, while ChromaDB holds only the embedding and minimal metadata. The retrieval path queries ChromaDB for the top candidates, then joins against SQLite to get the quality-weighted ranking fields. This keeps the vector index lean and ensures that quality adjustments made through the feedback UI are reflected immediately without needing to re-embed anything.
The Compression Pipeline
After every completed run, compress_and_store() is called with the full context: task string, task type, web search queries issued, output content, line/byte counts, Wiggum scores, and evaluator-flagged issues. The compression model (qwen3:8b by default, overridable via COMPRESS_MODEL) receives a structured prompt and is required to respond in a fixed three-field format:
Title: <one-line summary of what was researched and produced, max 80 chars>
Narrative: <2-3 sentences: what was researched, what the output contains, anything notable>
Facts: <JSON array of 3-5 specific factual strings worth remembering>
The compression prompt passes the first 600 characters of the output content as an excerpt — enough for the model to characterize what was produced without passing the full document. Search queries (up to 4) and the final Wiggum score are also included, so the model can describe what information was sought and how well the synthesis was received. Wiggum-flagged issues are appended to the facts array with a [wiggum] prefix, so downstream retrieval can surface the evaluator's specific objections alongside the observation itself.
Before any observation is written to SQLite or ChromaDB, the title, narrative, and facts are passed through scan_for_injection(). Web-fetched and synthesized content can carry adversarial instructions designed to be injected into future sessions as trusted memory context — the injection scan blocks these writes before they reach the store.
Run-ID Provenance
Every observation in the store carries a run_id foreign key linking it to the RunTrace record in data/runs.jsonl that produced it. This bidirectional linkage is the basis for memory observability: from the Memory panel, a user can click any observation and navigate directly to the run that created it — seeing the full synthesis content, token counts, Wiggum score, and evaluator feedback that the compressed observation was derived from.
The reverse direction is equally useful. RunTrace.log_memory_hits(count, titles) records which observations were injected before synthesis, with their titles. After a run completes, data/runs.jsonl contains memory_hits and memory_context_titles — so the runs view can show, for any run, which past observations influenced it. This is the provenance chain: observation → run → synthesis → evaluation → new observation.
The run_id column was added via ALTER TABLE observations ADD COLUMN run_id TEXT at init time, with the existing rows left null. New runs populate it through compress_and_store(..., run_id=self.run_id). The provenance chain is therefore partial for observations written before the column was added, but complete going forward.
Quality-Weighted Retrieval
The default retrieval path uses ChromaDB to over-fetch 12 candidates (SEMANTIC_CANDIDATES = 12), then re-ranks them before returning the top 4 (MAX_CONTEXT_OBSERVATIONS = 4). The re-ranking formula blends cosine similarity with the observation's final score and its accumulated quality signal:
sim is the cosine similarity: 1.0 - (distance / 2.0), converting ChromaDB's cosine distance to a [0, 1] similarity. qual_weight is derived from the observation's Wiggum final score, with a soft floor:
if raw_score is not None and raw_score < 7.0:
qual = (raw_score / 10.0) * 0.5 # penalise low-quality runs
else:
qual = (raw_score or 5.0) / 10.0 # neutral default if score missing
A run that scored 6.0 gets a quality weight of 0.30 rather than 0.60 — it can still be retrieved if it's highly semantically relevant, but it won't displace a higher-quality observation of similar relevance. The rationale is that a low-scoring observation, even on a topically adjacent task, may reflect a research failure that shouldn't anchor the current synthesis.
q_adjust is the accumulated human-feedback multiplier: max(0.2, 1.0 + quality * 0.15). A quality score of +3 (the maximum, from three upvotes) multiplies the rank by 1.45. A quality score of −3 multiplies it by 0.55 — the observation is still retrievable but needs significantly higher semantic similarity to displace a neutral one. The floor at 0.2 prevents negatively-rated observations from being completely zeroed out while still strongly deprioritizing them.
| Quality | q_adjust | Effect on rank |
|---|---|---|
| +3 (max upvotes) | 1.45 | Retrieval strongly preferred |
| +1 | 1.15 | Mild preference boost |
| 0 (default) | 1.00 | Rank driven by sim + score only |
| −1 | 0.85 | Mild suppression |
| −3 (max downvotes) | 0.55 | Strongly deprioritized |
After ranking, the results are deduplicated by title — if two observations have the same title (e.g., the same task was run twice), only the highest-ranked one is kept. This prevents the context injection from repeating near-identical observations that would waste the 4-slot budget.
If ChromaDB is unavailable (import error, corrupted index, dimension mismatch from a model change), the system falls through to _search_fts(), which runs a BM25-ranked FTS5 query built from the non-stopword content terms of the task string — up to 8 terms, deduped and OR-joined. The FTS5 fallback is slower and less precise but never fails silently.
RLHF Quality Signals
The quality score is the mechanism by which human judgment about retrieval quality propagates back into the ranking. The Memory panel exposes thumbs-up/thumbs-down buttons on every observation detail view. Each click calls POST /api/memories/{id}/feedback, which increments or decrements the quality column by 1 (clamped to [−3, +3]) and appends a row to the memory_feedback_log table with the rating, an optional comment, and a timestamp.
The feedback log is not just for the quality adjustment. It is a labeled preference dataset: each row is a (memory_id, rating, comment) triple that records a human judgment about whether a specific compressed observation was useful or misleading. Over time, these labels can be used for fine-tuning the compression model — to preferentially generate observations of the kind that accumulate positive ratings, and avoid the patterns that accumulate negative ones. The log is a byproduct of normal workflow that accrues training signal at no additional cost.
Connection to DPO: The memory feedback log is structurally analogous to the preference pairs in scripts/hf_datasets/dpo.jsonl. Both record a human preference signal over LLM-generated text. The compression model's output (title + narrative + facts) is the kind of compact, structured generation that DPO fine-tuning can improve most directly — because the correct format is well-defined and the quality signal is explicit.
The UMAP Ontology Graph
The graph view in the Memory panel renders the knowledge base as a 2D scatter plot of UMAP-projected embeddings, with KMeans cluster coloring. The implementation lives entirely in MemoryStore.get_graph() and runs on demand when the graph tab is opened.
The projection uses umap.UMAP(n_components=2, random_state=42, low_memory=True) over the raw ChromaDB embeddings — no intermediate dimensionality reduction step. For a store of N observations, KMeans uses k = min(8, max(1, N // 3)) clusters, bounded at 8 to avoid micro-clusters at high observation counts. The cluster count scales with corpus size so early sessions (20–30 observations) get a sensible number of topic groupings rather than 8 singleton clusters.
Each node in the returned JSON carries: 2D coordinates, cluster label, observation title, task type, and quality score. The frontend renders this as an interactive D3 scatter plot with zoom/pan, brush selection (for batch-selecting a region to review or prune), and per-node color coding by quality score — green nodes are positively-rated, red nodes negatively-rated, neutral nodes are dim. The graph makes two patterns visible that the list view does not: topic drift over time (later observations clustering away from earlier ones as the research domain shifts) and quality clustering (whether low-scoring observations are randomly distributed or concentrated in a particular topic region).
| UMAP parameter | Value | Rationale |
|---|---|---|
n_components |
2 | 2D coordinates for D3 scatter rendering |
random_state |
42 | Reproducible layout across refreshes |
low_memory |
True | Avoids duplicating the embedding array in RAM |
| KMeans k | min(8, max(1, N // 3)) | Scales with corpus, caps at 8 colors |
| Max nodes | 600 (configurable) | Most-recent 600 by SQLite id; prevents layout slowdown |
The graph endpoint is the most computationally expensive operation in the API — UMAP on 600 embeddings of dimension 384 (~1–3 seconds on CPU). It is intentionally not cached server-side, since the point is to reflect the current store state including recent quality updates. The frontend requests it once on tab open and caches the result for the session.
The Memory Panel
The dashboard Memory view (dashboard/src/views/Memory.tsx) exposes three tabs: memories, review, and graph. The memories tab is a paginated, filterable list with search (FTS5-backed via the API), task type filter, quality range filter, and final verdict filter. Selecting an observation opens a detail pane with the full narrative, facts array, run ID (linked to the runs view), and thumbs-up/thumbs-down controls.
The review tab shows prune candidates: observations with quality ≤ −2 or observations with both final_score < 6 and quality < 0. These are the observations most likely to be misleading retrieval context — either explicitly downvoted multiple times, or a combination of low quality and at least one downvote. The tab makes it easy to bulk-delete low-signal observations before they accumulate enough retrieval weight to harm synthesis quality.
The graph tab renders the UMAP scatter plot via D3, with zoom behavior attached to the SVG viewport and a brush overlay for rectangular selection. Clicking a node shows the observation title and quality score in a tooltip; double-clicking navigates to the detail pane. The cluster legend on the right maps cluster indices to colors, though the clusters themselves are unlabeled (the labels are derived from KMeans and carry no semantic meaning unless the user reads the observation titles within each cluster).
Test Coverage
The memory module has 57 tests in tests/test_memory.py, covering: initialization and schema migration, compression prompt parsing (including malformed responses and markdown-fenced output from the compression model), FTS5 query construction and stopword filtering, ChromaDB backfill from SQLite, quality-weighted ranking with explicit score and q_adjust scenarios, feedback application and clamping, prune candidate selection, injection scan blocking, the UMAP graph generation path (mocked to avoid requiring a GPU), and the store_direct() bulk import path used by the literature-review ingestion pipeline.
Known gap: The assess_novelty() function — which scores how much new web search results add beyond the current knowledge state, using an ephemeral ChromaDB collection for per-query similarity — has no tests for the fallback heuristic path. The word-overlap heuristic is a rough approximation that could return misleading novelty scores when the research context and new results share domain vocabulary but differ in substance. A test suite for this path would clarify the edge cases.
What the Literature Identifies
Three relevant results from the harness’s own lit-review corpus bear on the memory design decisions made here.
MemoryOS (2506.06326) proposes a three-tier memory architecture (working, short-term, long-term) analogous to operating system page management, with explicit promotion and eviction policies based on access frequency and recency. The harness memory store is flat — all observations are peers — which means a highly-relevant observation from six months ago competes on equal footing with a recent one, modulo quality score. A recency decay factor, even a mild one, would shift the retrieval distribution toward more current observations without discarding older ones.
HippoRAG (2405.14831) argues that episodic memory for RAG should model the hippocampal-neocortical binding mechanism, storing not just the content of observations but the relational graph between them. Two observations that both cover “LLM inference cost optimization” but from different angles — one on batching, one on quantization — would be linked in the graph and retrievable as a pair when either is queried. The current store retrieves them independently and deduplicate by title, which discards the relational signal entirely.
PromptCache (2311.04934) addresses the inference cost of prepending memory context: a 4-observation context prefix adds ~2,400 tokens to every synthesis prompt. PromptCache shows that KV-cache reuse for repetitive prefix structures can reduce this cost to near-zero at repeated query time. The harness currently pays full prompt cost for every memory injection; a prefix cache would be materially valuable at high run frequency.