---
title: Subtle-Calculation Errors in LLM Pipelines
section: Ora — Foundation arguments
status: review
description: "A failure class legacy programming languages largely avoided: how subtle calculation errors arise in multi-step LLM pipelines, and how to instrument and catch them."
authors:
  - The Ora Foundation
downloads:
  md: /papers/white/subtle-calculation-errors-in-llm-pipelines.md
license: https://creativecommons.org/publicdomain/zero/1.0/
---

# Subtle-Calculation Errors in LLM Pipelines

*A failure class that legacy programming languages largely avoided — and how to detect and correct it in a multi-step LLM pipeline. Methodology: instrument every package boundary, run probe prompts, examine the trace for what should have happened but didn't. The structural fix is an authorised non-confabulation path: when a step has insufficient information, it requests a supplement; the orchestrator runs the query and re-submits the entire package as a fresh stateless call. Reliability claims that rest on adversarial review break down silently when the source steps confabulate, because the wrong answers are internally coherent and review cannot catch them.*

## 1. The reliability claim and the threat

Ora's value proposition leads with reliability. A multi-step analytical process that completes end-to-end without human intervention, repeatedly, is visibly different from one that fails partway through. The advertised mechanism is adversarial review: a Breadth model evaluates a Depth model's analysis, both revise under cross-critique, a verifier checks the result against universal criteria, and a consolidator produces the irreducible corpus that the formatter places into the user-facing deliverable.

This works when the failures it catches are *visible* failures — refusal, clarification loops, output that violates the universal seven-section contract. The reliability layer treats these as health signals, retries, falls back, and surfaces explicit degradation when contingencies cascade.

The class of failure the adversarial layer cannot catch is **internally coherent wrongness**: outputs that match the contract, pass evaluation, survive verification, and arrive at the user looking complete — but contain confabulated facts, fabricated relationships, or invented citations. The verifier cannot detect them because the verifier has the same training distribution and the same evidence the analyst had. The Breadth/Depth split helps when the failure is style or framing; it does not help when both models confabulate from the same gap.

Without a structural fix, the reliability claim is an illusion: pipelines that fail silently look exactly like pipelines that succeed, until a user notices a specific factual error and traces back. By then the trust is gone.

## 2. The Excel analogy

In legacy programming languages, the typical bug halts execution. A division by zero, a null dereference, a type mismatch — they raise an exception, surface a stack trace, and demand attention. The cost of these bugs is high in interruption, low in stealth. A user who runs the program either gets output or gets an error; ambiguity is rare.

The category that broke this pattern was the Excel formula. A spreadsheet user who builds a calculation with a subtle reference error gets a number back. The number looks plausible. It propagates to dependent cells. It feeds a chart, a report, a decision. The error only surfaces when someone notices the answer is wrong — which may be never. Excel did not crash. It produced output. The output was confidently incorrect.

LLM pipelines have generalised this failure pattern. Every step is a function that takes a package and returns text. When the package is incomplete or when a step's training is thin on a specific fact, the function still returns text. The text is plausible, internally coherent, and indistinguishable from a successful output. Confabulation in LLMs is the Excel-formula-error of the agent era: confident output that is close to right but wrong, in ways that humans cannot detect without independent verification.

The reliability claim "we use adversarial review" is exactly the wrong defense against this class. Adversarial review compares two plausible outputs to each other. When both are plausible-but-wrong, review confirms agreement. Confirmation is not correctness.

## 3. Why this failure class is specific to LLM pipelines

Three properties combine to create the subtle-calculation class:

**Output is unbounded.** Unlike a function with a typed return value, an LLM call returns prose. There is no compile-time check that the prose is correct; no runtime exception when the prose is wrong. The output is whatever shape the model produced, including shapes that pass downstream parsers but carry wrong content.

**Errors are silent at the call site.** When a model encounters a fact it cannot verify, it does not raise an exception. It produces a guess. The guess is structurally indistinguishable from a verified fact. The next step in the pipeline consumes the guess as input and continues. The error is silent at the point of origin and silent at every subsequent step.

**Verification is from the same distribution.** Adversarial review uses another LLM with similar training. If the training corpus is thin on a fact, both the analyst and the evaluator are equally likely to confabulate plausibly. They will agree. The verification step concludes VERIFIED. The user sees a confident answer.

These three properties together produce a class of failure that legacy programming avoided through type systems, explicit error returns, and external validation. LLM pipelines have given up these defenses and gained generality; the cost is a new failure class that needs new defenses.

## 4. The diagnostic methodology

The methodology is not "test until you find bugs." That works for unbounded-output systems no better than testing a randomly-malformed spreadsheet. The methodology is **instrument every package boundary, run probe prompts, examine the trace for what should have happened but didn't**.

The steps:

### 4.1 Map every package boundary

A package boundary is any place in the pipeline where one step's output becomes another step's input. In Ora, the boundaries are:

- **Phase A input** — raw user prompt + conversation context (the latter may be empty)
- **Phase A output** — cleaned prompt + corrections log + inferred items (assume-mode assumptions)
- **Pre-routing input** — the operational-notation prompt from Phase A
- **Pre-routing output** — dispatched mode, completeness gaps, pending clarification
- **Step 2 input** — step-1 result
- **Step 2 output** — context package (mode text + conversation RAG + concept RAG + relationship RAG + budget signals + utilization)
- **Step 3 input** — context package + cleaned prompt
- **Steps 3–8** — each step's output is the next step's input under the gear-specific structure

For each boundary, three questions:
1. What is supposed to be in the package?
2. What happens if it is empty?
3. Does the next step have an authorised path other than confabulation?

### 4.2 Find the silent-fallback sites

The most common LLM-pipeline anti-pattern is `try: result = do_thing(); except: result = ""`. The exception is lost; the empty result is consumed by the next step as if `do_thing()` had legitimately returned nothing. The pipeline continues with degraded context, producing degraded output that looks the same as healthy output.

In Ora's `run_step2_context_assembly`, this pattern appeared four times — once for each of the four RAG queries (conversation-RAG via the ranker, conversation-RAG via the legacy path, concept-RAG via the ranker, concept-RAG via the legacy path). Each was wrapped in a bare `try/except` that mapped to an empty string with no logging. If ChromaDB went down, if the embedding model errored, if the vault index was stale — the package silently arrived empty and the analytical pipeline proceeded as if no vault content was relevant.

The silent-fallback inventory in Ora as of 2026-05-15:

| Site | What fails | What gets silently substituted |
|---|---|---|
| `run_step2_context_assembly` Phase 5.6 conv-RAG | ChromaDB or embedding error | empty string |
| `run_step2_context_assembly` legacy conv-RAG | knowledge_search error | empty string |
| `run_step2_context_assembly` Phase 5.6 concept-RAG | ChromaDB or embedding error | empty string |
| `run_step2_context_assembly` legacy concept-RAG | knowledge_search error | empty string |
| `run_step2_context_assembly` relationship-RAG | RAGEngine init or traversal error | empty string + stderr print |
| `run_gear4` Step 6 verifier (×2 streams) | verifier model exception | `"VERIFIED\n[Verification error, auto-pass: <e>]"` — failure → pass |
| `rag_engine.plan_retrieval` | (deliberate, not a failure) | keyword extraction instead of model-planned retrieval — the "planner model integration comes after the basic pipeline" comment was still load-bearing at audit time |
| Mode dispatch when `standard` catch-all fires | Stage 2 pending clarification | `mode = "standard"` with empty `mode_text` — the entire downstream pipeline runs without per-step guidance |

All of these are "Excel-formula errors": the pipeline keeps running, the output looks like normal output, and the failure is invisible without trace.

### 4.3 Build the forensic trace

The trace is not an after-the-fact debugging tool; it is a *primary* artifact. Without per-step inputs and outputs persisted to disk, the silent-fallback sites are invisible. Three properties make the trace useful:

**Per-turn directory structure.** Each pipeline turn lands in `~/ora/data/pipeline-traces/<conversation_id>/<utc-timestamp>/`. Each step writes one structured JSON file and one human-readable Markdown sibling. The structured file is for diffing across turns; the Markdown is for forensic reading.

**Failure logs as JSONL.** Silent fallbacks (RAG retrieval errors, supplement-request rejections, contingency-path entries) append to JSONL files in the same directory. The JSONL format is append-only — every event is preserved, every replay is reconstructible.

**Step-health summary.** Every turn ends with a `step-health.json` recording each step's pass/fail verdict and the list of contingency paths that fired. The summary file makes pattern detection across many turns a `jq` away.

The trace must be *defensive*: every disk operation wrapped in try/except, atomic writes via rename-on-write, failures printed to stderr but never breaking the pipeline. A trace that crashes the system it instruments is worse than no trace at all.

### 4.4 Run probe prompts and examine the trace

Probe prompts are not "good queries" or "bad queries"; they are queries designed to **exercise specific boundaries**. A short list:

- **Bypass-class prompt** ("Hello, what time is it?") — exercises Phase A + pre-routing Stage 1. Should bypass to direct response. If it doesn't, the bypass signal vocabulary is incomplete.
- **Vault-content-required prompt** ("Summarize the canonical changes to the RAG provenance system in the Ora vault") — should pull concept-RAG. If concept-RAG is empty in the trace but the model produces a confident summary, the model is confabulating from training. The trace confirms the failure unambiguously.
- **Specific-number prompt** ("What was Apple's Q3 2024 revenue?") — model has no tool for the answer. Should emit SUPPLEMENTAL RAG REQUEST or COVERAGE GAP. If it produces a number without trace evidence of retrieval, the number is invented.

Run each probe. Read the trace. The trace will tell you what happened. Compare against what should have happened. Each mismatch is a silent failure to name.

The first probe run on the live Ora pipeline (2026-05-15) exposed five silent failures:

1. **Bypass-vocabulary gap.** "Hello, what time is it right now?" matched zero Stage-1 bypass triggers. The prompt fell through to Stage 2, which treated it as an analytical query and emitted a confused disambiguation question ("are you trying to figure out who benefits / check the argument / decide what to do / understand why").
2. **Pending clarification swallowed.** Stage 2 produced a `pending_clarification` field but the system dispatched to `mode = "standard"` and proceeded anyway. The clarification ended up in `classification_reasoning`, and `server.py` is supposed to surface it via SSE; this user-CLI invocation did not. The question disappeared.
3. **Standard catch-all has empty mode text.** `load_mode("standard")` returned 0 characters. The mode that catches "everything else" has no `## DEPTH ANALYSIS GUIDANCE`, no `## CONSOLIDATION GUIDANCE`, no `## OUTPUT FORMAT GUIDANCE` — every per-step extraction returned empty.
4. **Empty RAG without explanation.** Both conversation-RAG and concept-RAG returned `""`. The trace cannot distinguish "vault has nothing relevant" from "ChromaDB returned an error that was caught and ignored" — because the prior `try/except` swallowed the error.
5. **Confabulated response.** The bypass response asserted "Friday, May 15, 2026 at 10:07:49 AM PDT" without any tool call, system-clock query, or external grounding. The model fabricated a plausible timestamp.

Each of these is a real failure visible because the trace recorded the package. None were visible before the trace was added.

## 5. The structural fix

Two fixes are required, not one:

### 5.1 Replace silent fallbacks with named failures

The four `try/except: result = ""` sites in `run_step2_context_assembly` were replaced with `try/except: result = ""; record_rag_failure(trace_dir, query_type, query, error)`. The fallback behaviour is identical — RAG failure still produces an empty string and the pipeline still proceeds — but the failure now lands in `rag-failures.jsonl` with the exception text, the query, and the timestamp. Subsequent runs can be examined for repeated failures, transient failures, vault-coverage gaps, embedding-model errors.

This is a small change with large consequences. Every silent fallback in the pipeline can be replaced with the same pattern: keep the graceful-degradation behaviour, but make the failure visible.

### 5.2 Authorise a non-confabulation path

The deeper fix is to give every analytical step an instruction other than "confabulate when uncertain." The **Supplemental RAG Protocol** is the structural answer:

- Every analyst, evaluator, reviser, verifier, and consolidator system prompt now ends with a universal protocol section.
- When the model encounters a factual claim, name, date, statistic, or relationship that the package does not support and that it cannot verify from training, it emits a `## SUPPLEMENTAL RAG REQUEST` block with three fields: the gap statement, the query terms, and why the gap matters.
- The orchestrator detects the block, runs the query against the vault knowledge collection, appends the result as a `## SUPPLEMENTAL RAG RESULT` message, and **re-submits the entire package as a fresh stateless call to the same endpoint**. The model sees its own prior request, the orchestrator's fetched result, and the original task — and re-runs its analysis with the new information.
- Cap: two supplements per step. After two, the model is instructed to emit a `## COVERAGE GAP` admission instead.
- Every request is logged to `supplemental-rag.jsonl` in the per-turn trace: the gap statement, the query terms, the supplement-result length, and whether resubmission resolved the gap.

The COVERAGE GAP is critical. A confident-looking confabulation is the worst outcome; an admission of what cannot be verified is the second-worst; a verified answer is the best. The cap forces the model toward second-worst rather than worst when retrieval fails to close the gap. Reviewing COVERAGE GAP frequency across many turns is the empirical signal of where the vault is thin and where the user might prioritise content creation.

### 5.3 The protocol does not guarantee compliance

The protocol authorises the path. It does not guarantee the model uses it. Models trained to produce plausible output have a strong default toward confabulation; this default is not erased by an instruction. Three failure modes are named explicitly in the spec:

- **The Confident Confabulator** — model produces a plausible answer without emitting a request. Detection: trace shows no supplement request, but the answer contains specific verifiable claims the package did not supply.
- **The Always-Requester** — model emits a request on every step, even when the package is adequate. Detection: high supplement frequency on prompts that should not need supplementation.
- **The Cap-Forced Confabulator** — model hits the 2-supplement cap and confabulates instead of emitting COVERAGE GAP. Detection: two supplements followed by an answer with specific unverifiable claims.

Each of these is observable in the trace. None is fixable by the protocol alone. The protocol is necessary, not sufficient — strengthened by training-time anti-confabulation discipline (RLHF on coverage-gap admission, fine-tuning on supplement-request examples) and by oversight dashboards that surface the metrics.

## 6. Generalizable recommendations

For any multi-step LLM pipeline making reliability claims:

**Instrument every package boundary.** Per-step input and output to disk, structured JSON beside human-readable Markdown, in per-turn directories. No silent transit — every boundary observable.

**Name your failure modes.** Build a vocabulary for the failure modes your pipeline can exhibit. Ora uses Budget Signals 0–6 (clean, compression warning, critical truncation, analytical floor breach, hardware constraint, RAG planner fallback, spawning constraint). Each named code is detectable and trackable; "an exception happened somewhere" is not.

**Replace silent fallbacks with named failures.** Every `try/except: result = ""` is a candidate. Keep the graceful-degradation behaviour, but log what was caught. The cost is one line; the visibility return is substantial.

**Authorise non-confabulation paths.** Every step where the model could confabulate needs an instruction telling it what to do *instead*. SUPPLEMENTAL RAG REQUEST is one such instruction; COVERAGE GAP is another. "Just don't confabulate" is not an instruction; it is a hope.

**Re-submit as fresh stateless calls.** Don't try to patch the context window incrementally. When new information arrives, re-submit the whole package. This honours the stateless-function model, keeps each call reproducible, and avoids the context-window drift that produces its own subtle failures.

**Cap retrieval depth, and instrument what hits the cap.** Two supplements per step is the cap in Ora because three would produce loops without proportionate benefit. The cap-hits are themselves a metric — they tell you where the vault is thin and where the model is reaching beyond what retrieval can give it.

**Per-step health summaries, not just full traces.** The full trace is for forensic reading. The `step-health.json` is for trend detection. Pattern: "the consolidator step is degraded on 12% of multi-stream turns" is far more actionable than 1000 individual traces.

**Verifier exceptions must not auto-PASS.** Treating "model crashed during verification" as "VERIFIED" is the kind of fail-soft that destroys reliability claims. Make verifier exceptions visible (logged to a contingency list) even when the cycle proceeds, so trend data shows where verification is genuinely happening and where it is being skipped.

## 6a. Second probe run — four more silent failures

The second probe run on 2026-05-15 (after the trace landed) sent an analytical prompt designed to route to a specific mode: *"Cui bono on the harness paradigm shift in commercial AI — who benefits when frontier labs reposition from selling models to selling harnesses, and who loses?"*. Expected behaviour: Stage 1 forwards to Stage 2, Stage 2 picks `cui-bono` mode on the strong "cui bono" signal, Gear 4 runs, the full 8-step adversarial pipeline executes through to a consolidated and formatted output.

Observed behaviour: bypass=True, dispatched=None, mode=`simple`, gear=2, single Haiku-API response. The adversarial pipeline never ran. The user got a clean-looking tradeoff analysis that *appeared* to be the output of an 8-step process.

Reading the trace exposed four additional silent failures, on top of the five from the first probe:

**Silent failure #6 — substring-collision in bypass detection.** The `STRONG_BYPASS_TRIGGERS` list included `"no analysis"` as a literal phrase, intended to catch user instructions like *"don't analyze this — just summarise"*. The signal-presence helper used substring matching for multi-word triggers with the comment *"low collision risk"*. The collision exists: `"no analysis"` matches inside `"cui bono analysis"` (`b[ono analysis]`). Every cui-bono prompt was routed to bypass. The pre-routing pipeline's matching algorithm had a structural false-positive on a specific common phrase.

The fix: word-boundary anchors on multi-word triggers. The bypass trigger now matches `\bno analysis\b` only — `"bono analysis"` no longer collides because `o` precedes `n` (boundary fails). The fix is one regex change; the failure class it eliminates is *every analytical mode whose name contains a word that, with a leading letter, spells a bypass trigger*. This is exactly the *Excel-formula error* class: the matching algorithm was subtly wrong, the wrong answer was internally coherent (bypass fires, response returns), and the failure was invisible until per-step input/output landed in the trace.

**Silent failure #7 — Phase A expansion poisoning the next stage.** Phase A's job is to expand the raw user prompt into rich operational notation — explicit constraints, named stakeholders, evaluation axes. The expansion does its job well. But the richer text gives Stage 1 *more substring to match against*. The original raw prompt `"Cui bono on the harness paradigm shift…"` would not have triggered `"no analysis"`. The expanded notation containing `"Structured cui bono analysis mapping stakeholder…"` did. Phase A's output is *strictly more verbose* than its input, so any substring-matching detector applied to Phase A output is strictly more likely to false-positive than the same detector applied to the raw prompt. This is a layering bug: each layer's input space is shaped by the previous layer's behaviour, and substring detectors don't compose cleanly across layers.

The structural fix: detectors that fire on the user's intent should run on the *raw* prompt before Phase A's expansion. Detectors that need the expanded form (territory dispatch, signal vocabulary lookup) run after. The current pre-routing pipeline does the bypass-detection step against post-Phase-A text, which is the wrong layer.

**Silent failure #8 — catch-all modes have no mode file.** When pre-routing falls back to the `simple` or `standard` catch-all modes, `load_mode("simple")` and `load_mode("standard")` return empty strings — the files do not exist in `~/ora/modes/`. The per-step section extractor returns empty for every section (DEPTH GUIDANCE, BREADTH GUIDANCE, EVALUATION CRITERIA, REVISION GUIDANCE, VERIFICATION CRITERIA, CONSOLIDATION GUIDANCE, OUTPUT FORMAT GUIDANCE). The pipeline runs with no mode-specific instructions at all.

The server (`server.py::_pipeline_stream`) has a guard that falls through to direct-stream when `step1["mode"]` is `simple` or `standard`. The orchestrator-CLI path (`run_pipeline` called directly) does not have this guard. The two paths produce different observable behaviour for the same prompt — server-side surfaces a direct response; CLI-side runs Gear 2 with empty instructions. Either way, the user thinks they got the analytical pipeline; they got something else.

**Silent failure #9 — Playwright session errors silently pass the verifier check.** During the Gear 4 probe run, the Step 6 verifier call to `chatgpt-browser` returned `"Playwright session error (chatgpt):"` followed by a session-expired error message. The orchestrator's `_run_model_with_tools` did not raise — it returned the error string as the model's output. The `_verifier_passed` helper checks for `VERIFIED` in the output and absence of `VERIFICATION FAILED`. Neither token appears in a Playwright error message — but `_verifier_passed` also has a fall-soft: when the output is shorter than 50 chars, it returns True ("garbled verifier — don't block"). Longer error messages don't hit this fall-soft, but the orchestrator's `try/except` wrapper around the future result substitutes `"VERIFIED\n[Verification error, auto-pass: <e>]"` when an *exception* fires. Between these two paths, virtually every verifier-side browser-session failure converts to PASS.

This is the same class as the RAG silent-fallback: a real failure (browser session expired, requires re-auth) is treated as a successful verification verdict. The trace exposes it because the verdict-raw is now persisted; the operational pipeline's behaviour (continue to step 7 with "verified" analysis) is unchanged. Without the trace, this failure is undetectable.

The structural fix: distinguish *verifier model errored* from *verifier passed*. The auto-pass-on-exception pattern should be replaced with auto-pass-on-exception-with-named-contingency: the step still proceeds, but the contingency name `step6-cycle-N-verifier-exception-auto-pass` lands in `step-health.json` so trend data shows where verification is genuinely happening and where it's being skipped. (This contingency name is already recorded in the trace as of 2026-05-15; what remains is to act on the metric.)

## 6b. Cross-cutting lesson: detector layering

The substring-collision bug, the Phase A expansion poisoning, and the missing catch-all mode files all share a structural feature: *each detector or fallback assumes properties of its input that the previous layer does not preserve*. The substring detector assumes "no analysis" is unique enough to be a low-collision phrase. The bypass-detection layer assumes the prompt has not been expanded. The catch-all path assumes someone wrote the mode file. None of these assumptions hold under the real composition of the pipeline.

The methodology recommendation that follows: **before adding a detector, write down the layer it runs against and the properties it assumes**. When the layer above it changes its output shape, the detector becomes a candidate for re-audit. The trace's per-step package serialisation makes this audit possible — without it, the assumption-violation is invisible.

## 6c. The full nine-failure catalogue + remediation status (2026-05-15 close-out)

The two probe runs and the subsequent fix-everything-in-this-thread pass produced a complete remediation table. All nine failures are closed in code; the trace makes regressions visible by design.

| # | Failure | Status | Fix |
|---|---|---|---|
| 1 | Bypass-vocabulary gap — `Hello, what time is it` matched zero Stage-1 triggers | **Closed** | New Stage 0 pre-Phase-A bypass check runs on the raw user prompt before Phase A's expansion can mask the trigger; `STRONG_BYPASS_TRIGGERS` list expanded with date variants, prior-conversation references, grammar/spelling fixes, explicit opt-outs (ora 384e392 + 5c4e98a) |
| 2 | Pending clarification silently swallowed — Stage 2 produced `pending_clarification`, system dispatched to `mode=standard` anyway | **Closed** | New `_best_guess_mode_from_matches` picks highest-confidence candidate from Stage 1 matches; `_PENDING_CLARIFICATION_FALLBACK_MODE = "deep-clarification"` covers the no-matches case. Original clarification text preserved in `pending_clarification_swallowed` trace field. `classification_confidence` becomes `best-guess` or `fallback` so downstream code can distinguish (ora 5c4e98a) |
| 3 | Standard catch-all has empty `mode_text` — `load_mode("standard")` returned 0 chars; analytical pipeline ran with empty step prompts | **Closed (by removal)** | The new pending-clarification path no longer routes to `standard`; if a future call still tries to load a missing mode file, `load_mode()` now logs a stderr warning so the failure becomes visible (ora 5c4e98a) |
| 4 | Empty RAG without explanation — trace recorded `chars: 0` with no way to distinguish index-empty from filtered-out from no-match | **Closed** | New `_diagnose_rag_emptiness` runs two cheap probes (`col.count()` + filtered raw query) when a RAG result is 0 chars without an exception. Adds `empty_diagnosis` field to `step2-context.json` with collection_total_count, raw_chunks_returned, filtered_chunks_returned, type_filter_applied, and a categorised `empty_reason` (`index_empty` / `no_match` / `filtered_out` / `ranker_truncation_or_filter_threshold`). First probe surfaced a real signal: cui-bono mode's type_filter drops every conversations-collection chunk because conversations carry `type: chat`, not `type: engram` (ora b79c395) |
| 5 | Confabulated response — bypass model fabricated "Friday, May 15, 2026 at 10:07:49 AM PDT" with no tool call | **Closed** | New `_UNIVERSAL_ANTI_CONFABULATION` directive injected into every model call via `build_system_prompt_for_gear` immediately after `boot.md`. States explicitly: "never invent specific facts you cannot verify"; lists the high-risk class (names, dates, statistics, citations, URLs, system state); names "the honest 'I don't know' beats the confident wrong answer". Analytical steps additionally carry the Supplemental RAG Protocol via `_assemble_step_prompt`. Post-fix probe: same prompt now produces *"I don't have access to the current time. **Gap:** I cannot verify the current time from my training or available context."* (ora ac185b5) |
| 6 | Substring-collision in bypass detection — `"no analysis"` matched inside `"cui bono analysis"` | **Closed** | Word-boundary anchors added to multi-word triggers in `_signal_present`. Verified 15/15 unit-test cases pass including the cui-bono case + edge cases like "monocultural analysis" (no longer collides) and the explicit "no analysis needed" (still bypasses correctly) (ora 99b83fb) |
| 7 | Phase A expansion poisoning later detectors | **Closed (structural)** | Bypass detection now runs on the raw prompt at Stage 0 *before* Phase A's expansion. Stage 1 retains the bypass scan as a defensive backup against the rare case where Phase A legitimately reveals a bypass-worthy element. Detector-layering bug class is structurally eliminated for bypass detection; the more general layering risk is documented as `§6b. Cross-cutting lesson: detector layering` (ora 384e392) |
| 8 | Catch-all modes have no mode file — `simple.md` / `standard.md` didn't exist | **Closed** | Created `modes/simple.md` (paired vault `Modes/simple.md`) — real bypass-direct-response mode file with Gear 1, explicit anti-confabulation discipline, and clear "this mode does not use the analytical pipeline" framing. `load_mode` now warns on missing files. `standard` is intentionally retired (the new clarification path no longer routes to it) (ora 5c4e98a + vault 8cfa866e41) |
| 9 | Playwright session errors silently pass the verifier check — auto-PASS-on-exception substituted `"VERIFIED\n[Verification error, auto-pass: ...]"` | **Closed** | Three-way verdict resolution per cycle: PASS / FAIL / BROKEN. New `_verifier_broken` detects exception substitutions, Playwright errors, rate-limit messages, and very-short non-verdict outputs. Real verdict tokens win over short-output flags (a 36-char "VERIFIED. Holds." is valid); known broken markers win over verdict tokens (legacy auto-pass shape now BROKEN). Loop logic: PASS or BROKEN unblocks; only FAIL re-revises. Per-cycle contingency `step6-cycleN-<slot>-verifier-BROKEN-not-verified` lands in `step-health.json` so trend data shows how often verification is actually performed (ora 7320b2e) |

**The trace makes regressions visible by design.** Every fix above includes its own corresponding trace field. If a future change reintroduces any of these failure classes, the trace will surface it on the next probe — the user does not have to remember to test for it.

## 6d. Sweep-discovered failures #10–#13 (closed-out 2026-05-15)

The 2026-05-15 sweep through code paths not exercised by the original probes surfaced four additional failure candidates. All four were closed in the same day's commits.

| # | Failure | Fix | Commit |
|---|---|---|---|
| 10 | Phase A INFERRED_ITEMS treated as facts downstream — assume-mode assumptions baked into operational_notation and used by Stage 1, Stage 2, Step 2 RAG queries as if user-stated | `build_system_prompt_for_gear` injects an explicit `## PHASE A ASSUMPTIONS (NOT USER-STATED FACTS)` block after the universal anti-confab directive when `inferred_items` is non-empty. The block instructs the model to "treat each as a working assumption" and "name it explicitly when your analysis depends on it so the user can correct the interpretation." `run_step2_context_assembly` threads `inferred_items` through `context_pkg`. | ora `2a2adcd` |
| 11 | Visual block suppression has no trace entry — `visual_adversarial.process_response` suppressed visuals with Critical findings and the diagnostics went to `context_pkg` ephemerally, not to disk | `_run_visual_hook` writes `step-visual-hook.json` to the per-turn trace with `visuals_seen`, `visuals_suppressed` count, full diagnostics dict, and a hook-exception entry if the hook itself raised. context_pkg attachment continues for SSE event surfacing; the trace now has a forensic record. | ora `2a2adcd` |
| 12 | Conversation chunk indexing failures (post-turn) silently caught — `server.py::_save_conversation` had `try: collection.add(...) except: pass`, so ChromaDB failures meant the conversation was un-retrievable via RAG and the user never found out | Replaced bare `pass` with structured logging to `~/ora/data/conversation-indexing-failures.jsonl` — timestamp, conversation_id, chunk_id, chunk_path, error type + message, tag. If the failure-log write itself fails, fall through to a stderr WARNING with the same diagnostic content. The conversation continues uninterrupted; the fix is observability, not behaviour. | ora `2a2adcd` |
| 13 | Friction reducer over-skip (potential) — Stage 2's high-confidence dispatch via `_select_dispatch_mode` could fire on weak signal evidence; no per-turn audit of how many signals actually supported the dispatch | `step1-pre-routing.json.signal_strength_summary` records total / strong / weak match counts, strong signals supporting the dispatched mode specifically, and a derived `single_signal_high_confidence` flag that fires when high-confidence dispatch was supported by exactly one strong signal. The markdown trace renders a "Signal strength summary (friction-reducer audit)" section with a ⚠️ flag when the condition triggers. Observability only — does not change Stage 2's behaviour. | ora `2a2adcd` |

**Updated total: 13 silent failures closed.** Each fix carries its own trace field. The pipeline trace is now the load-bearing instrument for detecting future regressions across all 13 classes without semantic comparison of model output.

- **Compliance metrics.** How often do models actually use the SUPPLEMENTAL RAG REQUEST path when they should? The trace gives the empirical answer; long-run data will tell.
- **Cross-model variation.** Models from different labs have different confabulation tendencies. The Ora pipeline uses Hermes-4-70B, Kimi-Dev-72B, Qwen-3.5-27B local plus browser-mediated Claude / ChatGPT / Gemini. Each may have a different supplement-request rate. The trace should make this measurable; analysing it is future work.
- **Vault coverage closure.** Each COVERAGE GAP and unresolved supplement is a content-gap signal. A monthly review of `supplemental-rag.jsonl` aggregated across turns should drive vault enrichment.
- **Verifier-side coverage.** Supplements are wired into analyst/evaluator/reviser/consolidator. The verifier still uses `_run_model_with_tools` directly and would not honour a supplement request. Extending coverage there is straightforward but unscoped at the time of this paper.
- **Beyond vault retrieval.** Some gaps are not in the vault and never will be (current weather, specific live data). The protocol assumes vault is the source; extending it to web retrieval with provenance ratings is a natural next step.

## 8. Coda — the Excel parallel matters

The Excel-formula error class persisted for decades. Auditing spreadsheets is still its own profession; subtle calculation errors in financial models have caused real-world losses in the billions. The reason the class survived is exactly the reason it survives in LLM pipelines: confident-looking output that is wrong is harder to detect than no output at all.

Ora's reliability claim is a real one, but only if the silent failures are visible — which means the trace exists, the silent fallbacks are named, the analytical steps have an authorised non-confabulation path, and the empirical record of where the pipeline reaches for information accumulates over time. With those four together, the claim survives scrutiny. Without them, it is a marketing assertion that the next user with a specific factual question will refute by accident.

This paper is the methodology record. The trace infrastructure landed 2026-05-15 (commit `28c8657`). The Supplemental RAG Protocol landed alongside. The catalog of silent failures uncovered during the first probe run is in §4.4. The next probe runs — on Gear 3 and Gear 4 prompts with the local 70B/72B models — will extend the catalog and validate the structural fixes against real adversarial-pipeline executions.

---

## Cross-references

- `Specification — Supplemental RAG Protocol` — full contract for the request/result format and resubmission behaviour
- `~/ora/orchestrator/pipeline_trace.py` — the trace module
- `~/ora/orchestrator/boot.py::_parse_supplemental_request` / `_fetch_supplement` / `_call_with_supplement` — the orchestrator-side implementation
- `Framework — Conversation Processing Pipeline` — the RAG storage layer this paper's mechanisms read from
- `Reference — Ora YAML Schema` — provenance weights applied to supplements

## Status

v1.0, drafted 2026-05-15 during the pipeline-RAG audit session. Methodology applied to the live Ora pipeline that day; trace + supplement infrastructure shipped in the same commit window. Future revisions will extend §4.4 with Gear 3 and Gear 4 probe-run findings and §7 with compliance metrics once aggregated trace data is available.
