---
title: Model Selection
section: Ora — System papers
status: review
description: "How Ora picks a model for each step: install-time hardware fit, a cost-versus-capability frontier with preset intelligence floors and cost ceilings, and runtime gear-downgrade routing."
authors:
  - The Ora Foundation
downloads:
  md: /papers/white/model-selection.md
license: https://creativecommons.org/publicdomain/zero/1.0/
---

# Model Selection

A reliable automated process runs dozens of model calls to finish one task — cleanup, classification, retrieval planning, breadth analysis, depth analysis, adversarial evaluation, consolidation. Use one model for all of them and you are wrong in one of two directions at every step. A model strong enough for adversarial evaluation is wasted on input cleanup — you pay its price and its latency for a step that a small model finishes correctly. A model cheap enough for cleanup cannot carry the analysis — it fails the step that matters. Per-step success rate is the reliability budget for the whole process, so the wrong model in any slot erodes the end-to-end success rate the architecture exists to protect.

Model selection is the mechanism that puts the right model in each slot, automatically, and keeps the process running when the chosen model is unavailable. It does this across a frontier of cost against capability rather than by a fixed assignment, and that is the design claim worth stating plainly: **the system pays for frontier capability only at the steps where frontier capability changes the output.** Everywhere else it spends down the frontier toward free.

This is three problems, not one, and the architecture answers them in three layers:

- **Fit** — can this model physically run on this hardware? (Resolved once, at install.)
- **Selection** — given a catalog of hundreds of endpoints, which handful is worth considering for a slot, given what it costs and how capable it is? (The cost/capability frontier algorithm.)
- **Routing** — which model fills each slot *right now*, and what happens when it is offline, rate-limited, or constrained? (Runtime resolution with graceful degradation.)

Each layer solves a distinct failure mode. I will take them in order.

---

## Layer 1 — Hardware fit

**The design claim: estimate a model's memory requirement from its parameter count and quantization level, never from its file size.**

A model that does not fit in RAM does not run — or it runs by swapping to disk at a latency that makes it useless. So fit is a hard gate before anything else. The naive approach is to read the file size on the model repository and compare it to available memory. It is wrong often enough to be dangerous, for two reasons: a repository may store weights behind XET pointers (the listed files are kilobyte stubs, not the multi-gigabyte weights), and file size on disk does not predict resident memory once the weights are dequantized and the inference engine's working buffers are allocated.

The reliable estimate comes from two numbers the model always carries — its parameter count and its quantization level.

*Quantization* is the precision at which weights are stored. Lower precision trades a modest amount of quality for a large reduction in memory. The resident-memory estimate follows directly:

- **4-bit:** ≈ 0.5 GB per billion parameters
- **8-bit:** ≈ 1.0 GB per billion parameters
- **16-bit (full precision):** ≈ 2.0 GB per billion parameters
- plus ≈ 2 GB of overhead for the inference engine and working buffers.

The budget the estimate is checked against is `AVAILABLE_MODEL_RAM = total system RAM × 0.75` — the remaining quarter is left to the operating system and the application. A model whose estimate exceeds that budget is a hard block, not a warning.

*Mixture-of-Experts (MoE)* models need a second number. An MoE model has a large total parameter count but activates only a fraction of it per token. The memory estimate must use the **total** count — all weights load into RAM whether or not they fire on a given token — but the *capability* the model delivers tracks the **active** count. A 120B model that activates 4 of 128 experts reasons closer to a 4B dense model while occupying the RAM of a 120B one. The operative metric is active parameters per gigabyte of resident memory: dense models deliver roughly 1.5–2.0 B/GB; an MoE model can deliver as little as 0.06 B/GB. So the fit layer prefers dense models for reasoning work at any tier where a useful-size dense model fits in budget, and reaches for MoE only when RAM is too tight for any dense model worth running.

Fit also sorts hardware into capability tiers, which become the default ceiling on what auto-selection will pick:

| Tier | `AVAILABLE_MODEL_RAM` | Dense model at 4-bit |
|---|---|---|
| 1 | 6–11 GB | 3–7B |
| 2 | 12–23 GB | 7–14B |
| 3 | 24–47 GB | 30–70B |
| 4 | 48–95 GB | 70B+ |
| 5 | 96+ GB | two large models concurrently |

**The tradeoff this layer carries forward is the tool-calling reliability floor.** Reliable tool calling — the model emitting a well-formed call the harness can execute — degrades below roughly 13B at 4-bit, or 7B at 8-bit. A Tier-1 machine can run a model that converses well and calls tools badly, and tool-call failure is a per-step reliability erosion that the upper layers cannot fully recover. Fit surfaces this rather than hiding it: a small-model selection ships with the tool-reliability caveat attached.

**Alternative considered and rejected: size-from-file.** It is simpler and needs no parameter parsing. It was rejected because it silently mis-estimates under quantization and breaks entirely on XET-pointer repositories — and the failure is a model that downloads, then will not load, after the user has waited for the download. A wrong answer that arrives late is worse than a slightly more expensive check up front.

---

## Layer 2 — The cost/capability frontier

This is the layer the rest of the system is named for, and the one most worth getting right.

**The problem:** the catalog holds hundreds of candidate endpoints — at the time of writing, 358 models across 431 registered endpoints, spanning local weights and a dozen commercial providers. No human ranks those by hand, and re-ranking them by hand every time a provider ships a model does not scale. The system needs an algorithm that, given a slot and a budget posture, returns the few models worth using.

**The data the algorithm runs on.** Every catalog entry carries the fields the decision needs:

- `aa_intelligence_index` — a capability score on a 0–100 scale, sourced from the Artificial Analysis benchmark suite. This is the capability axis.
- `blended_per_m` — blended cost in dollars per million tokens. This is the cost axis.
- `latency_ttft_seconds` and `output_tokens_per_second` — time-to-first-token and throughput. Surfaced for the reader and the constraint checks; not, today, part of the optimization (see Open Problems).
- `parameters_b`, `size_bucket`, `is_free`, `vision_capable`, `context_window` — structural filters.

**The algorithm, in order:**

**1. Pareto filter.** Plot every candidate on the two axes — intelligence and cost. A model is *dominated* if some other model is at least as intelligent **and** at least as cheap. Drop every dominated model. What remains is the efficient frontier: the set where buying more intelligence costs more, and saving money costs capability, with no free lunch left on the table. This is the step that makes the rest tractable — it discards the models that are simply worse buys than something else in the catalog, which is most of them.

**2. Intelligence floor.** From the frontier, keep only models scoring at least `floor_pct` percent of the *top* intelligence in the bucket: `threshold = top_intelligence × floor_pct / 100`. The floor is how a slot says "do not go below this capability no matter how cheap." A high floor keeps only near-top models; a low floor admits cheaper, weaker ones.

**3. Cost ceiling.** Optionally drop models above a hard dollars-per-million ceiling. This bounds spend independent of the capability band.

**4. Loosening.** If fewer than the requested `top_n` models survive both bounds, relax them in a fixed order — drop the floor by 10 percentage points, then, if still short, double the ceiling — repeating until `top_n` pass. Every loosening event is recorded in the output metadata, so a configuration that could not be satisfied at its stated posture says so rather than silently returning too few options.

**5. Order and take.** Sort the survivors — cheapest-first when the posture is economy, intelligence-descending when the posture is maximum capability — and take `top_n`: one primary plus its fallbacks.

**The presets are named points on this frontier.** They are not separate code paths; they are parameter sets for the one algorithm, which is the property that makes them comprehensible:

| Preset | Floor | Cost ceiling | Loosening | Effect |
|---|---|---|---|---|
| **Premium** | none | none | no | Best-of-bucket by intelligence; cost ignored. |
| **Optimum** | 80% | none | no | Cheapest model within the top-80% intelligence band. The default; most configurations land here. |
| **Budget** | 50% | $5 / M | yes | Cheapest within a looser band under a hard cost cap; loosens before giving up. |
| **Free** | — | — | — | Free models only, Pareto-pruned, intelligence-descending. |

Optimum is the design center, and it states the whole thesis in one line: *take the efficient frontier, keep everything within 80% of the best, and buy the cheapest of those.* You give up the top 20% of measured capability and, in exchange, you usually pay a small fraction of the top price — because the frontier is steep at the top, where the last increments of benchmark score cost the most.

**Size buckets keep the frontier honest per slot.** Each slot declares a `size_bucket`. Utility slots — input cleanup, classification, retrieval planning — declare `small`; analysis slots declare `large`. The selector filters to the bucket before it runs the frontier, so a cleanup slot never reaches for a frontier reasoning model (overkill: full price and latency for no quality gain on a step a small model does correctly) and an analysis slot never drops to a tiny one (underkill: a quality risk on the step that carries the task). `top_n` is set per slot too — three for utility, four for the Gear-4 analysis pair — so each slot gets a primary and enough fallbacks to survive provider outages.

**Adversarial diversity is a selection constraint, not a separate stage.** The depth and breadth analysis slots in Gear 4 — the parallel adversarial pass, where two models work the same problem independently so each can catch the other's errors — are most valuable when the two models come from different training lineages, because shared lineage means shared blind spots. The selector enforces this by excluding depth's chosen primary from breadth's candidate set, pushing breadth toward a different model family. (An earlier design made cross-company diversity its own pipeline stage — a notional "Gear 5." That was retired: diversity is cheaper and cleaner as a constraint on selection than as a stage in execution.)

**Alternatives considered and rejected:**

- *One model for every slot.* The status quo for most deployments. Rejected because it forces the single wrong-direction error at every step: priced for analysis, it overpays on utility; priced for utility, it fails analysis.
- *Manual per-slot assignment.* Workable for five models; unworkable for 358, and it rots the moment a provider ships a new model or changes a price. The frontier algorithm re-derives the picks from the refreshed catalog with no human in the loop.
- *Rank by intelligence alone, ignore cost.* This is exactly the Premium preset — offered, because some users want it, but rejected as the *default* because it pays frontier rents at every slot, including the slots where a model at 80% of the top score produces identical output.

**The tradeoff.** The frontier is only as good as the capability metric, and `aa_intelligence_index` is a single external benchmark — one number standing in for "how good is this model." It lags new releases, it cannot see task-specific strength, and a published benchmark is a target that can be optimized against. The system buys a large reduction in selection effort and spend, and accepts a dependency on a metric it does not control. That dependency is the first entry in the open-problems list below.

---

## Layer 3 — Routing and graceful degradation

Selection produces, per slot, a primary and an ordered fallback chain. **Routing is what turns that chain into a running process when reality intervenes** — when the primary is offline, rate-limited, already busy, or blocked by a hardware constraint.

A *configuration* maps each pipeline *cell* — a slot at a given gear — to its primary and fallbacks. (*Gear* is the tier of analytical machinery applied to a task: Gear 3 is a sequential pass; Gear 4 is the parallel adversarial pass. Heavier work runs at a higher gear.) The configuration in use is derived from context, not chosen by the user per query: interactive work resolves against the `user-pipeline` configuration; background work — the `autonomous` and `agent` execution contexts — resolves against `background-default`. Interactive work reserves its endpoints, so a background process never starves the human at the keyboard.

**Single-slot resolution** walks the chain and returns the first usable model:

```
route(cell):
  for model in cell.chain:          # primary, then fallbacks, in order
    if not model.enabled:      continue
    if not model.available:    continue   # offline / rate-limited / busy
    if violates_constraint(model): continue
    return model
  return DOWNGRADE
```

**Process-level resolution** wraps that in a gear-downgrade cascade. If every model in a slot's chain is exhausted, the slot returns `DOWNGRADE` and the whole process drops a gear and retries — a parallel adversarial pass it cannot staff becomes a sequential pass it can:

```
execute(requested_gear):
  for gear in [requested, requested-1, ..., 1]:
    assignments = resolve every slot the gear needs via route()
    if any slot returned HALT:       return halted(status)
    if any slot returned DOWNGRADE:  continue        # try the next gear down
    return run(gear, assignments)
  return halted(status)
```

This is the reliability payoff of the whole selection stack: **"my preferred model is unavailable" degrades to "the task ran at a lower gear," not "the task failed."** Degradation is bounded and visible — the run records the gear it actually executed and why it dropped — rather than silent.

Two classes of rule guard the resolution:

- **Hard constraints** block a model outright. Two large local models cannot run in parallel on one machine's GPU (the MLX Metal constraint), and a model whose resident memory exceeds machine RAM cannot load. A `STOP` marker in a chain halts the process rather than letting it proceed on an unacceptable model.
- **Soft warnings** are surfaced but do not block: a frontier model sitting in a utility slot (overkill), a weak or free model in an analysis slot (underkill), and the same provider on both halves of the adversarial pair (reduced independence). These are advisory because the user may have a reason; they are surfaced because the default reading of a configuration should make its compromises legible.

---

## Why the frontier framing matters

State the economic property without dressing it up: when selection routes each step to the cheapest model that clears the step's capability floor, frontier-model spend collapses to the few steps where frontier capability changes the answer. A thirty-call task does not pay thirty frontier prices; it pays a few, and spends the rest of its budget near the floor or at zero. The capability the task delivers is set by the slots that matter, which run at the top of the frontier, while the cost is set by the many slots that don't, which run at the bottom. That decoupling of delivered capability from total cost is the point of doing selection as a frontier search rather than a fixed choice.

---

## Open problems

Engineering honesty requires naming what this layer has not solved.

- **The capability metric is a single external benchmark.** `aa_intelligence_index` is one number from one source. It lags new models by days to weeks, it does not capture task-specific or mode-specific capability (a model strong on general reasoning may be weak at a particular analytical mode), and as a public benchmark it is gameable. There is no per-task capability model; a slot that needs a specific competence cannot yet express it to the selector beyond the global floor.
- **Dominance optimizes two axes; latency and throughput ride along unused.** The Pareto relation uses intelligence and blended cost only. Time-to-first-token and tokens-per-second are surfaced and checked but are not in the dominance computation, so a latency-sensitive interactive slot and a throughput-sensitive batch slot are selected against the same frontier when they want different ones. A multi-objective frontier is the obvious next step and is not built.
- **The intelligence floor is global, not per-mode.** One `floor_pct` governs a whole bucket. Some modes tolerate a lower floor than others; the system cannot currently set the floor per mode.
- **Diversity is enforced by exclusion, not by measured independence.** Breadth excludes depth's primary, which forces a different model *id*, but two differently-named models trained on convergent data can still share blind spots. The system has no measure of genuine training-distribution independence, so the adversarial pair's independence is assumed from family difference rather than verified.
- **Availability is point-in-time.** `route()` checks whether an endpoint is reachable now; it does not model rate-limit recovery windows or predict which free-tier endpoint is about to throttle, so it can route to a model that fails on the next call. Predictive health, and modeling failure *correlation* across free endpoints, are unbuilt.

None of these blocks the architecture from doing its job — the selection stack runs in production every day — but each marks a place where the current mechanism is a defensible approximation rather than a finished answer.