The design claim: provisioning is two phases, not one — a universal base that gives every machine a working AI through one orchestrator, and an additive local-capability phase that is gated on detected hardware and validates that a model will actually fit before it downloads it. The same architecture runs on a Chromebook-class laptop and a 128 GB Mac Studio; only Phase 2 differs, and it differs by what the hardware can carry, not by a different product. The two reliability-critical decisions are made before any model is downloaded: can this machine run a local model at all, and will this specific model fit in this machine’s memory. Getting either wrong produces the worst failure mode in provisioning — a long download that ends in a model that will not load.
There are two ways to provision a local-first AI system. The naive way installs a model first and discovers its problems at load time — wrong size for the RAM, missing companion files, a chat-template mismatch that garbles output. The reliable way evaluates the machine, validates the fit, and only then commits to the download. This document specifies the second: the phase structure, the hardware gate, the tiering, the fit validation, inference-engine selection, integrity verification, and the deployment profiles — as a mechanism a competent engineer could rebuild.
Phase 1 — the universal base
Phase 1 runs on every machine regardless of hardware, and at its end the user has a working browser-based AI at localhost:5000 with tool execution. Six layers: a Python environment; the workspace directory structure (the vault, the conversation store, the ChromaDB index, routing-config.json); the framework library (cloned from the repository); the orchestrator (boot.py, boot.md, mind.md, the tool implementations); the API-key framework (staged); and the universal chat server.
The architectural commitment Phase 1 establishes is the orchestrator-in-the-loop contract: the model never touches the machine directly. The model emits a tool request in a defined format; the orchestrator’s Python watches for it, executes the corresponding function, and injects the result back into the conversation. boot.md is the contract; the orchestrator is the party that honors it; a model without the orchestrator is issuing tool calls into silence. (This is the harness control loop, specified in Harness Control Loop, established at install time.) The practical consequence: every serious interaction flows through localhost:5000, where Python is in the loop — not through a raw model interface.
Phase 1 reaches the model through API routing — OpenRouter required, direct provider APIs optional. An earlier design routed through browser automation against the user’s logged-in subscription accounts; that dispatcher was retired (install Chunk 1) because the reliability and cost-management properties of APIs proved decisive over the “use what you already pay for” advantage of browser sessions. The orchestrator’s Python-in-the-loop contract is unchanged; only the path from orchestrator to model differs.
The hardware-evaluation gate
After Phase 1, a brief check decides whether Phase 2 runs at all. The rule is explicit: if total RAM is below 8 GB, stop at Phase 1; if RAM is 8 GB or more and free disk is 5 GB or more, proceed to Phase 2. Stopping is not a failure state — it is Tier 0, the same architecture on different hardware. A Tier-0 machine has commercial-AI intelligence reaching its local files through the orchestrator, every framework installed, and every methodology working identically; it simply has no local model. The gate is designed so the low end is a complete system, not a degraded one.
This is the first reliability decision, and it is made before anything is downloaded. A machine that cannot run a local model is told so and given a complete cloud-routed system instead of a half-installed local one.
Hardware tiers — behavior matched to capability
Phase 2 tiers its behavior to detected hardware. Detection (Phase 2, Layer 1) reads OS, total and available RAM, disk, processor, and GPU. The tiers:
- Tier 0 (cloud-only, < 8 GB): Phase 1 system; commercial AI through the orchestrator. No local model.
- Scout (≈ 8 GB): a small local model as the primary endpoint. Capable of basic work; flagged for the tool-reliability limit below.
- Workhorse (≈ 8–64 GB): a mid-size local model; solid general use and analysis.
- Sovereign (64 GB+): large or multiple local models. This tier unlocks the Model Switcher (Phase 2, Layer 9), which configures which model fills each pipeline slot — the install-time entry point to the adversarial-diversity choice the The Adversarial Pipeline depends on.
The tiers determine which model class auto-selection targets and which capabilities install; the architecture does not change across them. Tiering is additive: Phase 2 registers a local endpoint alongside the Phase-1 system rather than replacing it, so a local model is a new endpoint in the same chat server, never a separate product.
Validating model fit before download
The second reliability decision — will this model fit — is made before the download commits, and it is made from the model’s parameter count and quantization level, never from its file size. The fit math (RAM ≈ parameters × per-quantization factor + overhead, against a budget of 75% of system RAM; the MoE total-vs-active distinction; the dense-vs-MoE capability-per-gigabyte frontier) is specified in Model Selection and is not repeated here. What matters architecturally is when it runs: at selection, against the detected hardware, so a model that cannot fit is rejected or downsized before the user waits for gigabytes that will not load.
Selection itself is script-driven and transparent. The install script auto-populates model configurations using the Pareto-frontier + intelligence-floor + cost-sort algorithm (again, Model Selection) rather than a hand-picked default — so the model chosen for a machine is the best fit on the cost/capability frontier for that hardware, derived rather than guessed.
Inference engine and integrity verification
The inference engine is selected by architecture: MLX on Apple Silicon (the safetensors/MLX path), Ollama elsewhere (the GGUF path), with vllm-mlx available for the multi-model Sovereign case. Engine selection follows from the format the validated model is available in; a model not available in the machine’s format triggers a search for a converted version rather than a failed install.
Download (Phase 2, Layer 5) is followed by integrity verification before the endpoint registers: the required companion files must be present (config for parameters, at least one tokenizer file, all weight shards referenced in the index), and the chat template is checked — its absence is a named, non-fatal warning rather than a silent default that garbles output later. Only a model that loads and passes a smoke round-trip registers as an endpoint; a model that fails the load invariant stops Phase 2 with a specific recovery declaration rather than leaving a broken endpoint in the routing config.
The provisioning architecture names its failure modes so the installer can detect and report them rather than producing a confidently-broken system:
- RAM-from-file-size — estimating memory from the download size instead of from parameters × quantization. The defining error this architecture is built against; fit is always computed from parameters.
- XET pointer — a repository whose listed files are kilobyte pointers, not the real weights, defeating size-based estimation (another reason fit is parameter-based).
- Chat-template mismatch — a missing or wrong template producing garbled output; checked and surfaced at install, not discovered in use.
- Small-model tool reliability (the Local Model Tool Reliability Trap) — below roughly 13B at 4-bit, tool-call reliability degrades; a Scout-tier selection ships with this caveat attached rather than implied.
A Missing-Information Declaration and a Recovery Declaration are mandatory outputs: any hardware characteristic assumed, any metadata estimated, any test skipped, and any unresolved failure with its specific next action are stated explicitly. A response that acknowledges a limitation is preferred over one that implies more capability than was installed.
Deployment profiles
The post-overhaul installer is script-driven (scripts/install.py) and prompts for a deployment profile that shapes what Phase 2 provisions:
- Solo — a single local (MLX) worker, no API worker pool; the privacy-focused local default, and the only profile fully enabled today.
- Hybrid — a local worker plus a small API worker pool for failover.
- Organization — pure API workers, no local model, scaling to many concurrent processes (an interim API-only server path exists today; the full profile is gated on the concurrency-architecture work).
The profile is a provisioning-shape decision layered on top of the tiering: tiering says what the hardware can run; the profile says what mix of local and API workers to stand up for the deployment’s purpose. A single-user laptop and an organizational inference node run the same installer with different profiles.
Open problems
- The installer documentation is mid-reconciliation. The operational layers (
~/ora/installer/) still cite the retired pre-customization first-boot document as their “canonical source,” and the Model-Switcher layer still describes the pre-configuration bucket/slot model that the configuration architecture replaced. The provisioning architecture is current; the layer files have residual drift, and bringing them fully current is tracked as its own install-overhaul workstream. - Two profiles are scaffolded, not shipped. Hybrid and Organization are designed and partially stubbed but gated on the concurrency-architecture work; only Solo is fully enabled, with an interim API-only server path standing in for Organization. The architecture admits all three; the implementation does not yet deliver all three.
- Fit validation trusts model metadata. Parameter count and quantization are read from the repository’s published config; a model whose config misreports them (or omits them, forcing an estimate) can still pass validation and fail at load. The estimate is conservative, but it is an estimate.
- The tool-reliability floor is a heuristic, not a measured per-model property. ”≈ 13B at 4-bit” is a rule of thumb; a small model that happens to call tools reliably is still flagged, and a larger one that calls them badly is not — the installer warns by size, not by a tested capability.
- First-boot assumes a coding-agent or scripted execution environment. The install runs as a script or under a coding agent with terminal access; it cannot provision from a plain chat interface, which bounds who can run it unattended.
None of these blocks the installer from taking a bare machine to a working system. Each marks a seam where the current mechanism is a working approximation rather than a finished answer.