The design claim: a model cannot verify its own output, so the verifier must be a different model, and the verification must be a structured, traced step rather than a request the same model answers about itself. The adversarial pipeline — Ora’s Gear 4 — is the mechanism that supplies the missing verifier: two models from different lineages work a problem independently, each attacks the other’s output, each revises under that attack, a verifier grades the result against an explicit gate, and the whole cascade runs under a per-step reliability layer that makes its own failures bounded and visible.
A model is a single forward pass over its inputs. The pass produces output. There is no internal verifier checking that output against an external standard, because the model has no external standard — it has its training distribution and its current input. Ask a model to “double-check its work” inside one prompt and what runs is another forward pass conditioned on the prior output plus the request to verify. The check is output of the same kind, from the same blind spots. It catches surface contradictions, because catching those is within the model’s capability. It does not catch the characteristic failures of model output — confident claims with no support, plausible reasoning that does not reach its conclusion, hallucinated specifics that fit the pattern but are not real. Those failures look right from inside the model that produced them. They look wrong from outside. The pipeline supplies the outside.
The cascade
Gear 4 runs as eight steps. The numbering continues from the harness’s shared front-end (cleanup is Step 1, context assembly is Step 2); the adversarial cascade proper is Steps 3 through 8. Each step is a bounded model call assembled by the harness — the analysts never see the evaluator’s instructions, the evaluator never sees the consolidator’s — and each step’s input and output is written to the per-turn forensic trace.
Front — consultation-augmented generation. Before the analysts write, the harness assembles a four-stream supplement: vault retrieval, conversation retrieval, the relationship graph, and live web. Approved sources bias the weighting; query intents are justification-gated; every retrieved chunk carries its provenance. The analysts write against retrieved, provenance-tagged context, not parametric memory alone — so a downstream verifier can check a claim against the source it was supposed to rest on.
Step 3 — parallel Depth and Breadth analysts. Two models, selected from different families for different blind-spot profiles, work the same problem independently and in parallel. Independence is the design point: each commits to its analysis before seeing the other’s, so the comparison in the next step measures genuine difference rather than one model anchoring on the other.
Step 4 — cross-evaluation, directional. Each model evaluates the other’s output against a universal seven-section evaluation contract — the breadth model critiques the depth stream, the depth model critiques the breadth stream. The instruction is to find what is wrong: unsupported claims, skipped inferential steps, confidence the evidence does not warrant. Not to summarize, not to refine. The evaluator is the model wearing its adversary’s hat, and it wears it against work it did not produce, which is the only configuration in which it has no social relationship with the output to protect.
Step 5 — parallel revisers. Each stream revises its own output in light of the cross-evaluation it received. Some criticisms are correct and the output changes; some are off-target and the output stands; the reviser records which, so the revision is itself inspectable rather than a silent rewrite.
Step 6 — cross-verification, up to two cycles. A verifier grades each revised stream against the mode’s verification criteria and returns a three-way verdict: PASS, FAIL, or BROKEN. FAIL triggers a re-revision of that stream and another verification pass, to a cap of two cycles. PASS and BROKEN both unblock the cycle — re-revising cannot fix a verifier that itself errored — but BROKEN is recorded as not verified, never as a silent PASS. (The auto-PASS-on-exception path that used to hide verifier failures was retired; a verifier that throws now lands an explicit BROKEN-not-verified marker in the health record, so the trend data reflects how often verification was actually performed.)
Step 7 — consolidation. One stream consolidates the two verified analyses into a single irreducible output — not a concatenation, a synthesis that resolves their overlap and preserves their genuine disagreements.
Step 8 — final verification, load-bearing. A verifier grades the consolidated output. If it FAILs, one corrective revision of the consolidation runs. If it still FAILs, the output ships with a single-line warning header and an event is logged to the oversight queue. The final verifier is the last gate between the cascade and the user, and it is allowed to fail loudly rather than pass quietly.
The per-step reliability layer
The cascade is adversarial; it is also wrapped, at every step, in a contingency layer that treats each model call as something that can degrade. This is the difference between a pipeline that is adversarial in principle and one that is reliable in production.
- Regenerate-on-unhealthy. Every step call runs through a retry that regenerates once when the output is unhealthy — a refusal, a clarification loop, a brief stub, a leaked tool call.
- Degraded-stream handling. If one analyst stream degrades past retry, the cascade continues but records the contingency, so the trace shows that cross-evaluation ran on half-healthy input rather than reporting a clean run. If both degrade, the cascade falls back to Gear 3 (sequential) with the healthier endpoint rather than pressing on with two broken streams.
- Consolidation fallback. If the consolidator degrades, the longer of the two revised streams ships under an explicit
[degraded — consolidation failed]header — the user gets output and is told it is degraded. - Health bookkeeping. Every step writes a health record to
step-health.jsonand the per-turn trace, so the cascade’s own failure rate is measurable over time, not just per run.
The point of this layer is stated in reliability terms: the adversarial pipeline raises per-step quality, and the contingency layer bounds the cost of the pipeline’s own per-step failures. Without it, a degraded evaluator silently weakens the verification it was supposed to provide, and the run still reports success.
Why different models, not two instances
Two instances of the same model arguing catch some failures and miss the ones common to both — the shared training distribution is a shared blind-spot profile. Two different models share fewer: where one is prone to a particular confabulation, the other tends to flag it; where one is overconfident on a topic, the other tends to be cautious. The differences are productively asymmetric, and the cascade exploits them by routing the two analysis slots to different families. That routing is a model-selection constraint, not a pipeline stage — the breadth slot’s candidates exclude the depth slot’s chosen model so the second stream reaches for a different lineage. (An earlier design made cross-company diversity its own stage, a notional “Gear 5”; it was retired in favor of the selection constraint.)
What survives single-pass and does not survive structured challenge
Three failure patterns are the cascade’s reason for existing.
Confabulated specifics. A name, date, quotation, or statistic that fits the shape of the discussion but is not real. A single model emits these without flagging them low-confidence; an evaluator asked to find errors catches them, because the evaluation frame turns the question from “does this read well” to “is this true” — and the consultation supplement gives it a source to check against.
Skipped reasoning. A chain that reaches its conclusion through an apparent sequence of inferences containing one step that does not follow. The model that produced the chain cannot see the gap, because the gap looks plausible from inside. A model reading the chain as an argument to grade surfaces it.
Misplaced confidence. A claim carrying the linguistic markers of certainty the evidence does not support. A single model’s defaults push toward declarative output; an evaluator marks the place where the tone outruns the support.
The audit trail
Because the cascade runs as discrete steps, each step’s output is preserved: which model produced the original analysis, which model challenged it, what the challenge said, how the revision handled it, what the verifier graded, what changed and what stood. The audit trail is the difference between trusting an output because it reads well and trusting it because its history is inspectable. For consequential work — legal drafting, medical summary, financial analysis — the trail is what makes the output usable without redoing it: the user spot-checks the steps where their own domain knowledge applies and accepts the rest on the strength of the recorded verification.
Tradeoffs
The cost is real: Gear 4 runs two models across up to eight steps with as many as two verification cycles, so it spends multiples of a single model’s inference time and tokens on every query. The tradeoff is justified because the failures it catches — hallucination, skipped reasoning, misplaced confidence — are the costliest errors in the system, and catching one structural error per consequential run more than pays for the throughput. This is also why Gear 4 is reserved for the deep-analysis modes that opt into it rather than applied to every prompt: the gear router sends trivial and factual inputs to cheaper gears, so the cascade’s cost lands only where its verification earns its price.
The alternative considered and rejected was a single large model with an extended context window and explicit self-review instructions. It was tested. It produces more thorough self-evaluation with better prompting, and it still cannot generate genuine adversarial pressure against its own conclusions — it finds minor issues and misses structural ones, because the reviewer produced the output it is reviewing.
Open problems
- Shared blind spots survive. If both models lack information the correct answer requires, or share a training-distribution gap, the cascade produces confidently wrong output that passes verification. Diversity is enforced by excluding the other stream’s model id, which forces a different family but does not measure training-distribution independence — two differently-named models trained on convergent data can still fail together.
- Framework-level errors are out of scope. The cascade catches errors within a framework’s execution; it cannot catch errors in the framework’s design. If the framework asks the wrong question, both models answer the wrong question and cross-checking ratifies it.
- The pipeline cannot supply missing context. A claim that is wrong only against information the user holds but did not provide cannot be caught by models that also lack it.
- The reliability ceiling is transient-failure only. The per-step layer protects against model misbehavior — refusal, clarification loop, stub, tool-call leak. It does not protect against API rate limits or provider outages. Raising the ceiling requires cross-provider fallback, circuit breakers, and result caching, none of which is built.
- BROKEN unblocks without recovery. A verifier that errors unblocks its cycle, because re-revision cannot fix the verifier — so a stream can ship without a genuine verification pass. The event is logged, not recovered.
- The two-cycle cap is a latency bound, not a convergence proof. A stream still failing after two correction cycles ships with a warning rather than blocking. The cap trades worst-case latency against the chance that a third cycle would have converged; it is a judgment call, not a guarantee.
None of these makes the cascade optional — it runs on every deep-analysis query and is the part of the architecture that most visibly converts an invisible error rate into a bounded, inspectable one. Each marks a place where the mechanism is a working approximation rather than a finished answer.