RAG Architecture — Main Street Independent

Persistent AI memory as a pipeline, not a bigger context window: conversations become atomic, provenance-weighted, retrievable units through extraction, chunking, and a selection funnel.

The design claim: persistent AI memory is not a bigger context window — it is a pipeline that turns a conversation archive into atomic, provenance-weighted, retrievable units, and a retrieval path that optimizes for precision, not just recall. A model has no memory between calls. The vault supplies the durable store (specified in The Vault); this is the mechanism that fills the store from conversation and reads it back as purpose-built context. Two pieces are distinctive and worth specifying at the level a competent engineer could rebuild them: the conversation-to-atomic processing that produces the corpus, and the selection funnel that defends retrieval against high-confidence off-topic matches.

The retrieval-augmented-generation literature optimizes a single loop: chunk a corpus, embed it, retrieve the top-k by vector similarity, inject. That loop has two failure surfaces this architecture is built against. On the way in, naive chunking and summarization destroy the atomic units of meaning the retriever needs. On the way out, vector similarity plus provenance can both reward the wrong chunk — a vocabulary-rich note that matches on shared words while being topically unrelated. The architecture answers the first with semantic extraction and hierarchical chunking, and the second with a relevance gate inserted ahead of provenance ranking.

The corpus side: conversation to atomic note

Semantic extraction replaces summarization. Summarization delegates editorial judgment to the model — you decide what matters — and the model’s idea of what matters is exactly the wrong filter for a memory system, because the discarded detail is what a future query will need. Semantic extraction is a deterministic reformatting instead: find every discrete unit of meaning and render it in minimum viable language, with editorial filtering explicitly prohibited. The metaphor is removing packing foam to show what is in the box, not guessing at the box’s contents.

The extraction operator controls four variables: the unit (a distinct claim, decision, question, constraint, fact, or action item); the compression rule (each unit a single declarative sentence of fifteen words or fewer, units never combined); the prohibition (no evaluation of importance, no omission, repetitions collapsed but no idea dropped); and a forced categorical output — Decisions, Open Questions, Constraints, Facts, Ideas, Action Items, Topics, plus a PROVISIONAL/EXPLORATORY category that exists to solve a specific failure mode: extraction silently tightening a vague exploratory idea into a confident declarative claim. Temperature is pinned at 0–0.1 to keep the operation deterministic. The distinction from summarization is the whole point — a structural instruction, not a judgment call.

Hierarchical semantic chunking replaces fixed-token chunking. Fixed 512-token chunks truncate mid-sentence and split related ideas across boundaries, so the retriever pulls partial context. The replacement aligns chunk boundaries with semantic boundaries defined by heading structure (H1–H4): the retrieval unit is “one complete level of one concept,” not a token count. The governing-idea test is concrete — can a single sentence of fifteen words or fewer describe what all the bullets under a heading are sub-points of? A soft ceiling of two-to-four bullets per level flags when multiple concepts have been packed into one. Token counting happens after structural retrieval, not before: the pipeline pulls the structural unit, then counts tokens for budget decisions, so retrieval stays semantically clean while the context budget stays enforced.

Three-tier compression turns a session into memory. Each session compresses three ways: atomic extraction (individual claims as candidate notes — the primary corpus-population mechanism), a 200–400-word session summary (indexed at session granularity), and, for multi-session work, a thread map of the arc of thinking. One rule is load-bearing: decisions are extracted from the user’s prompts, not the model’s responses. The model’s output is raw material for atomic extraction; the user’s prompts are where the evolution of actual thinking is recorded, so a decision is credited only when the user committed to it.

The engram lifecycle and the quality gate. An extracted unit enters as provisional. It becomes a kept atomic note — an engram — only after passing the Note Quality Gate’s criteria and the user’s keeping it; that act of keeping is what makes it the user’s adopted thinking, and it is why a kept engram carries full retrieval weight regardless of whether the user or the model first typed it (the provenance rationale lives in The Vault). Curation is continuous rather than gated at the door: the Engram Cleaning Framework sweeps the corpus, collapses near-duplicate notes into one canonical note, and handles contradiction by supersession — the newer note gets a supersedes relationship and the older is archived out of default retrieval but kept as a record. At corpus scale (over a hundred thousand atomics in active use), continuous cleaning is the only tractable curation model; pre-elevation gating does not scale.

The retrieval side: scoring, decomposition, and the selection funnel

Scoring. The ranker scores every candidate chunk as similarity × provenance_weight × recency. The provenance weights are the vault’s (engram 1.0, resource 0.8, chat/transcript 0.6, web 0.1; the external-web tiers classified at fetch time) and are specified in The Vault; recency decay applies only to the conversational tiers. Two retrieval pathways stay architecturally separate: internal RAG over the local vector index (millisecond retrieval, provenance pre-scored at ingestion) and external RAG over live web (seconds per source, provenance inferred at fetch time). A lightweight router decides internal-only versus internal-plus-live per prompt.

Multi-query decomposition. A prompt is decomposed into three or four targeted query variations aimed at different neighborhoods of the corpus — concept, methodology, session-continuity, supplemental-authority — run in parallel, deduplicated, and merged into a richer package than any single query produces.

The selection funnel — the precision stage. Plain top-k retrieval optimizes recall and has no precision stage, and that is a structural defect, not a tuning problem: a vocabulary-rich note can match a prompt on shared words while being topically unrelated, and because it is often a high-provenance engram at weight 1.0, it does not merely slip into the package — it ranks at the top. The documented case is a Hubble’s-Law cosmology note contaminating a Japanese-garden art analysis on shared emptiness / negative-space vocabulary. Similarity rewards it (the match is high) and provenance rewards it (it is a kept engram), so neither existing signal can exclude it, and downstream adversarial review cannot catch a package-level contamination that reads as internally coherent.

The funnel inserts a third signal at the ranker chokepoint, through which every vault lane passes:

Retrieve wider — pull ~15 candidates rather than the production 5, because calibration showed relevant chunks running past the old positional cutoff.
Similarity floor — drop candidates below a low absolute threshold (~0.40). This is only a cheap noise backstop; legitimate matches on this corpus bottom out near 0.45, and contamination arrives at high similarity, so the floor cannot be the real defense.
Fit-gate — a small fast model reads each surviving candidate against the prompt’s actual intent and returns keep/drop with a reason. This is the load-bearing precision step — the relevance judgment neither similarity nor provenance carries.
Provenance rank, gate-first — the survivors are sorted by provenance after the gate has removed off-topic chunks, so a high-provenance off-topic chunk can no longer float to the top. Ordering matters: provenance dominance is correct among relevant chunks and wrong before relevance is established.
Provenance-preserving dedup — collapse exact duplicates, keeping the highest-provenance copy.
Token-cap backstop — the context-budget cap remains but should never fire; a 15-candidate package is a few thousand tokens against a ~58k cap, so package size is governed by the gate, not the cap.

The key design decision was rejecting the two obvious re-weighting fixes. Spreading the similarity range empowers the contaminant (its defining trait is high similarity); compressing the provenance gaps does nothing for the worst case (same-tier engram-versus-engram contamination, where provenance plays no role). The resolution is a separate pass/fail relevance check, not a better blend of the two numbers already present. This implements, via an LLM fit-judgment rather than a classic BM25-plus-cross-encoder reranker, the long-standing requirement for reranking on top of vector similarity.

Context-budget assembly. The selected chunks are assigned relevance tiers — Core (never cut), Structural (cut to first level before Core), Peripheral (cut first) — and assembled against a fixed budget in a deterministic order, using no more than 75–80% of the window so a quarter is reserved for the model’s output.

Retrieval inside the pipeline

Adversarial RAG divergence. In the Gear-4 parallel pipeline (specified in The Adversarial Pipeline), the depth and breadth analysts assemble their RAG packages independently. The evidence bases may diverge, producing a wider combined base than either model would assemble alone, and the divergence is signal — when two independently-retrieved analyses still converge, confidence rises; when they diverge, the disagreement is genuine rather than an artifact of a shared package.

Supplemental retrieval and extraction escalation. Retrieval is not only a front-loaded step. Mid-pipeline, a model can trigger supplemental retrieval when it finds it needs more than the initial package carried — the recall counterpart to the funnel’s precision. And on the web stream specifically, a bounded escalation handles the case where a high-trust source’s snippet is too thin to carry the needed detail: a cheap deterministic selector picks snippets that are thin and high-trust and from a not-yet-fetched domain (capped at a few fetches), fetches the full pages through a three-tier httpx → browser → extraction cascade, chunks them, runs them through the same relevance fit-gate, and folds the kept passages into the web-context block beside the snippets. The three failure modes are kept distinct: a search failure (source not found) escalates discovery, an extraction failure (found but thin) triggers the fetch, and an interaction failure (content behind JavaScript) is absorbed by the fetch cascade’s browser tier.

Open problems

The precision gate is itself a model call. The fit-gate is the load-bearing defense against contamination, and it is a small-model keep/drop judgment, run fail-open (an error keeps all candidates rather than blanking the context). Its quality depends on the intent altitude it judges against: in calibration, judging against a too-narrow article angle over-dropped relevant material (3 of 10 kept) while judging against the subject domain kept the right breadth (9 of 10). The gate is only as good as the intent description it is handed, and choosing that altitude is a per-surface calibration, not a solved default.
The selection funnel is still behind a flag. It runs gated, off by default; until it is the production default, the precision defense is opt-in, and the recall-only path remains live.
Coverage is uneven across pipelines. The vault lanes share one funnel, but project pipelines that retrieve through separate code (MSI’s article generator among them) do not inherit it; each extension is its own workstream, and a surface without the gate has only similarity and provenance — the two signals that cannot exclude the contaminant.
Retrieval quality rides on index integrity, which is upstream of retrieval. A migration once zipped record ids against storage-order payloads, welding each id to another row’s content; retrieval then surfaced correct content under wrong labels. The runtime gate defended against the symptom, but the lesson stands: the cleanest retrieval logic cannot compensate for a corrupted index, and index-integrity checks belong in the ingestion path, not only in retrieval.
Reranking by LLM trades cost for the absence of a trained reranker. Using a model call per lane as the reranker is portable and needs no trained cross-encoder, but it adds a model call to every retrieval and inherits the model’s latency and failure modes. Whether a trained reranker would be cheaper and more stable at scale is untested.

None of these blocks the architecture from running in production. Each marks a seam where the current mechanism is a defensible approximation rather than a finished answer.