---
title: The Vault
section: Ora — Foundation arguments
status: review
description: Ora's persistent, provenance-weighted state — the substrate that turns scattered conversations into a memory that compounds with use.
authors:
  - The Ora Foundation
downloads:
  md: /papers/white/the-vault.md
license: https://creativecommons.org/publicdomain/zero/1.0/
---

# The Vault

## What the vault is, architecturally

The Second Brain paper covers what the vault enables — past projects informing present projects, patterns becoming visible across years, the substrate compounding with use. This paper covers the architectural form that makes those properties operative: the schema, the retrieval, the dual representation, the curation framework.

The vault is the user's local knowledge substrate. It holds every input, intermediate result, and decision the user has produced or collected: conversations, notes, references, source documents, frameworks, modes, project matrices, drafts in progress, work the user has finished and wants to return to.

The vault is a directory of markdown files on the user's filesystem. The files have a defined structure. The structure is documented, public, and durable. The user can read the files with any text editor. The user can search them with any search tool. The user can copy them, back them up, version-control them, transform them, take them to a different system.

This is the architectural commitment the rest of the architecture rests on. Reliability engineering needs persistent state to operate over; the vault is the state. Frameworks need a substrate against which retrieval-augmented generation runs; the vault is the substrate. The AHI commitment that the user's accumulated work compounds rather than resets every session needs a place where the accumulation lives durably; the vault is the place.

## The schema

Every file in the vault carries YAML frontmatter that classifies it. The schema is documented in `Reference — Ora YAML Schema` and enforced by the linter. The schema's governing rule is that a property belongs in the schema if and only if it serves a search, navigation, or retrieval task — either human navigation (Bases queries, property filters) or machine retrieval (the RAG pipeline). If a property serves neither, it does not belong.

### Core properties

Three properties are present on every file:

**`nexus`** — links the note to projects and passions defined in the matrix system. The bridge between the note layer (individual files) and the conceptual layer (project / passion definitions). Empty nexus means domain-general; populated nexus means the note serves one or more specific projects or passions.

**`type`** — operational role within the vault. One of twelve mutually exclusive values: `engram` (curated atomic notes), `resource` (external library content + chunked source-document material), `chat` (conversation artifacts), `transcript` (whisper-transcribed audio/video), `web` (web content saved into vault), `framework` (executable formalized specifications), `mode` (analytical thinking instructions), `reference` (vault-internal documentation and registries), `paper` (white-paper treatment of a framework or system), `working` (in-progress drafts), `matrix` (navigation hubs — projects, passions, MOCs), `supervision` (vault administration). Type determines provenance weighting in retrieval and decay eligibility.

**`tags`** — thematic and overlapping classification cutting across the type schema. Properties own architecture; tags own ideas. Tags are drawn from a controlled vocabulary: format/structure (`atomic`, `molecular`, `compound`, `process`, `glossary`), purpose (`framework/instruction`, `framework/builder`, `position`), status (`archived`, `incubating`, `private`, `superseded`), the provenance marker `ai-derived`, and user-extensible domain tags.

Two additional properties are linter-managed: `date created` and `date modified`, drawn from the filesystem.

### Standard and conditional properties

Other properties apply when relevant. `relationships` carries typed connections to other vault notes — supports, contradicts, qualifies, extends, supersedes, analogous-to, derived-from, enables, requires, produces, precedes, parent, child. The relationship graph is queryable; the system traverses it during retrieval.

`subtype` classifies atomic notes by claim type — fact, process_principle, definition, causal_claim, analogy, evaluative.

`source_*` properties identify the source document for notes produced through the document processing pipeline — `source_file`, `source_format`, `source_path`, `processed_date`, `chunk_index`, `total_chunks`, `source_document` (wikilinks to the source resource notes the atomic was extracted from).

`writing` is set on files that are part of fiction or book production — `general`, `ideation`, `theme`, `character`, `setting`, `outline`, `prose`, `master`, `archive`.

`project_type` is set on matrix files to declare the matrix's classification — `project`, `operation`, `passion`, `incubator`, plus optional domain types like `book`, `knowledge`, `workflow`, `fiction`.

The schema is open to extension when extensions serve search, navigation, or retrieval. The schema rejects extensions that don't.

## Provenance hierarchy

The retrieval engine scores chunks as `similarity × type_weight × recency_factor`. Type weights derive from the schema's provenance hierarchy.

| Tier | Type | Weight |
|---|---|---:|
| P1 | `engram` (user-side or AI-side — authorship no longer modifies the weight) | **1.0** |
| P2 | `resource` | **0.8** |
| P3 | `chat`, `transcript` | **0.6** |
| P4 | `web` | **0.1** |

Types not retrieved (framework, mode, reference, working, matrix, supervision, paper) serve other purposes — orchestrator loading, navigation, admin, authorship — and the retrieval engine never queries them as-is. The two retrievable folders are `Engrams/` and `Resources/`; everything else is canonical-lookup or navigation surface.

Reference files at vault root flow through document processing — chunks become resource (P2), atomic distillations become engram (P1, no tag — user-authored). The source reference file remains at vault root for canonical lookup.

The hierarchy encodes the AHI commitment, with a correction the schema made in rev 5.2: a kept engram weighs 1.0 whether the user or the AI originally typed it, because primacy follows *review-status*, not authorship. An engram exists only because the user kept it, and that act of keeping is the adoption that makes it the user's thinking — the user often asks precisely because he does not yet know, so the AI's answer is frequently the substance, and the extraction quality gate already removes the AI content the user pushed back on. The `ai-derived` tag persists as a provenance marker (the Engram Cleaning Framework reads it for contradiction reasoning) but no longer caps weight; the earlier `source-derived` modifier is retired. What the hierarchy still enforces is that curated vault content (P1–P2) outranks conversation (P3) and far outranks unvetted web (P4): the user's kept corpus is never silenced by a transient retrieval.

## Time decay

Conversation chunks accumulate in clusters around evolving topics. A chat from years ago about an inactive topic stays at full weight; a chat from yesterday about an actively-evolving topic begins to decay as fresher takes accumulate.

The decay function is `factor = max(FLOOR, 1.0 - (newer_count_in_cluster × DECAY_PER_NEWER))`. Decay applies only to types flagged decay-eligible: `chat`, `transcript`, `web`. Engrams (curated atomic notes the user has explicitly authored) and resources (curated source documents) do not decay; they retain their full weight regardless of age, because curation is the signal that the content remains current.

The decay model means chat content stays useful for the period the cluster is alive, and is appropriately deprioritized once the cluster's discussion has been superseded by fresher work. The user does not have to manually retire stale chats; the decay handles the prioritization without removing the content from the vault.

## Tag-based filters

Three tags affect retrieval directly.

**`archived`** — chunks tagged `archived` are excluded from default retrieval entirely. They remain in the vault for historical reference and may be queried explicitly, but the standard RAG context-assembly path skips them. The signal is the user's judgment that the content is intentionally retired.

**`incubating`** — chunks tagged `incubating` are included in retrieval but surfaced with an explicit status flag in the assembled context. The consuming model knows the content is mid-review and not yet vetted. The user has not yet given the content full canonical status; the tag preserves its retrievability without falsely implying full curation.

**`private`** — chunks tagged `private` are filtered conditionally on the active conversation's mode. When the active conversation is not in private mode, private-tagged chunks are excluded. When the active conversation is in private mode, all chunks are visible, including private chunks from any other conversation. The filter is one-way: private content is invisible from outside a private context, fully visible from inside one.

The `ai-derived` tag is a provenance *marker*, not a filter and — since rev 5.2 — not a weight modifier: it records that an engram originated AI-side, for the Engram Cleaning Framework's contradiction reasoning, without lowering its retrieval weight or removing it from retrieval.

## The dual representation

Every chunk emitted by the conversation pipeline is recorded twice.

**Vault YAML.** Canonical, human-readable, on-disk frontmatter. Drives Bases queries and Obsidian navigation. Lives in the user's filesystem. The user can read it, edit it, move it, version-control it.

**ChromaDB metadata.** A rich record of fields indexed alongside the embedding for retrieval-augmented generation. Drives ranking, filtering by topic / cluster / conversation, and cluster-recency decay. Lives in the user's local ChromaDB instance.

The two representations are kept in lockstep by the conversation pipeline. Any change to the vault file's chunk content triggers reindexing into ChromaDB with matching metadata. The vault is the source of truth; ChromaDB is the retrieval optimization. A user who exports the vault to a different system can rebuild the ChromaDB index from the canonical files; nothing important lives only in ChromaDB.

The dual representation serves two different consumers. Human navigation reads the vault directly — Obsidian's property panels, Bases queries, file-system search, version control. Machine retrieval reads ChromaDB — the embedding-based similarity search, the metadata filters, the ranking pipeline. Neither consumer has to know the other exists; both are looking at the same underlying state through their appropriate access pattern.

## The Engram Cleaning Framework

Curation happens continuously. The Engram Cleaning Framework runs across the corpus, surfacing contradictions where two atomics make incompatible claims about the same subject and presenting each contradiction to the user for triage.

Three resolution paths:

**Changed mind.** Apply a `supersedes` relationship from the new atomic to the older one; tag the superseded atomic `archived` so it leaves default retrieval but remains as a record of the user's intellectual evolution.

**Hypocrisy / motivated-reasoning flag.** Surface the inconsistency for the user's reflection; no automatic resolution. The user gets to notice the inconsistency and decide what it means.

**Wrong.** Delete the atomic, or tag it `archived` if the user wants to retain the historical mistake as a record.

The framework replaces the gatekeeping-at-the-door model of incubation-elevation that earlier versions of the schema used. The volume of atomic extraction (over a hundred thousand atomics in active corpora) makes human-in-the-loop pre-elevation impractical. Pollution prevention shifts from gatekeeping at the door to ongoing cleaning across the whole corpus. The vault accepts content quickly and surfaces contradictions over time; the user does the curation work in the cleaning queue rather than at the production interface.

A parallel sweep applies the same supersession mechanism to evolving news stories in `Resources/`. When a new news article supersedes an older one, the framework writes a `supersedes` relationship and `archived`-tags the older version. This keeps the news corpus current without losing the historical record.

## The web tier

External web sources are classified at retrieval time according to a four-stage cascade.

**Whitelist match.** URL pattern matches an entry in `Registry — Trusted Web Sources` at the high-provenance tier → classify as `whitelisted`, weight 0.8. Clearing a site onto the trusted list is itself a curation act, so a trusted-web hit weighs as curated reference (raised from 0.7 in rev 5.2).

**Page-specific override.** URL exactly matches a page-specific override → classify at the override's declared tier.

**Corroboration count.** Two or more unaffiliated domains in the current search result set carry the same finding → classify as `corroborated`, weight 0.3.

**Single source.** One non-farm source → classify as `single`, weight 0.15.

Patterns matching the exclusion list (link farms, content mills) classify as `excluded`, weight 0.0 — filtered out before ranking.

External sources are surfaced inline at retrieval. Nothing about external retrieval enters the vault unless the user manually saves it as `type: web` (which is then retrieved at vault P4, not at the external tier weight). The split between vault tier and external tier means the user's curated content always ranks above transient web fetches; the web is a supplement, not a substitute.

## The vault is the user's

Every architectural choice in the vault answers to one commitment: the user owns the vault.

The format is portable. Markdown for the body, YAML frontmatter for the metadata. Both are standard, decades-old, supported by every text editor and every version control system. A user who wants to migrate from Ora to a different cognitive-automation system, or from cognitive automation entirely, takes the vault with them.

The schema is documented. The YAML schema is published; the conventions are described; the directory structure follows publicly documented patterns. A user who wants to write their own tools to operate against their vault can do so. A user who wants to migrate their vault into a different schema can write a transformation; the transformation is a tractable text-processing job, not a reverse-engineering project.

The pipeline is transparent. The dual representation is documented. The user can rebuild the ChromaDB index from the vault if anything goes wrong; nothing important lives only in the index.

The provenance is preserved. Every chunk carries the trace of where it came from. The user can audit any retrieval back to the source content that produced it.

The conversation history is human-readable. Conversations are stored as markdown files with timestamps, attribution, and topic metadata. A user who wants to read their conversation history without any tool can do so.

These are not promises layered on top of the architecture. They are properties of the architecture. A different architecture that wanted lock-in could not be built on the same substrate without abandoning the substrate's properties; the architecture is what it is because lock-in is structurally precluded.

## The summary

The vault is the architectural commitment that makes the rest of the architecture work. Persistent state for reliability engineering. Substrate for retrieval-augmented generation. Durable accumulation for AHI's compounding-over-time commitment. The schema is open and documented; the provenance hierarchy encodes user-authored primacy; the cleaning framework handles curation across the corpus rather than at the door; the dual representation serves human navigation and machine retrieval through the same underlying state. The vault is the user's because every architectural choice was made to make it so. The corpus belongs to the human, not the platform.