Tetlock Superforecasting

Why it matters

A real forecast isn’t a confident opinion — it’s a calibrated number you’d stake money on, and the people best at producing one win not by knowing more, but by how they build the number.

For example: asked whether a rival will ship a competing product by year-end, a pundit says “definitely” or “no chance.” A superforecaster says “about 65% — companies at this stage hit their ship dates roughly 60% of the time, nudged up a little for their recent hiring spree.” One of those is a feeling. The other is a base rate, an adjustment, and a number you could check.

What it reveals. Whether a forecast is actually calibrated — built from how often things like this happen, adjusted for what’s specific here — or just a confident-sounding guess dressed as analysis.
How it changes the read. You stop asking “what do I think will happen?” and start asking “out of all the times something like this came up, how often did it go this way — and what about this case moves it off that number?”
When to foreground it. Any question with a resolvable answer by a date — will it ship, will they win, will the number clear the threshold — where you want odds you could bet on rather than a story.
What you’d miss without it. That the loudest, most famous experts are often the worst forecasters, and that beating them takes method, not genius: start from the base rate, break the question up, move in small steps, give a number not a narrative.
Where it misleads. It only works on questions that actually resolve. Force it onto a vague or contested question (“will things get better?”) and the precise-looking number is false precision — the discipline can’t rescue a question that can’t be scored.

Realtime examples

See real, dated analyses where this discipline shaped the read on the news → Superforecasting on Main Street Independent

How to invoke it in Ora

You have a question about the future with a real answer coming — will it happen, by when — and you want odds you could act on, not a hedge.

Describe the question and the deadline, and ask:

“Forecast the probability that our main competitor launches a rival product before year-end — give me a calibrated estimate with the base rate and reasoning.”

Ora pins down what would count as yes or no, finds a reference class and states its base rate, adjusts for what’s specific to your case (showing the arithmetic), and hands back a probability range plus the signals that would move it.

One thing to know: the words forecast, probability of, what are the odds, base rate, or Tetlock are what route you here. A bare “what’s going to happen?” gets a clarifying question — a forecast needs a resolvable outcome and a date, and the first thing Ora does is make you nail those down.

Say how you’d know whether the forecast came true — what observable fact, by what date. A question you can’t score isn’t a forecast yet, and Ora will push you to operationalize it before it commits to a number.

One thing Ora won’t do: hand you a single confident percentage to sound authoritative. It gives a range whose width is honest about the uncertainty, and it would rather widen the range than fake precision the evidence doesn’t support.

How it works

In the 1980s a young psychologist named Philip Tetlock started doing something almost no one bothers to do: he wrote down experts’ predictions and checked them. Over twenty years he collected roughly twenty-eight thousand forecasts from nearly three hundred professional pundits, analysts, and commentators — the people on television explaining what would happen next — and then he waited to see what actually happened.

The result became famous. The average expert, he found, was about as accurate as “a dart-throwing chimpanzee.” Worse, the more famous the expert, the less accurate they tended to be — the ones with bold, simple, TV-ready theories of how the world works did the poorest of all, because a single big idea makes you confident and confidence makes you wrong. The ones who did a little better were the opposite type: cautious, self-doubting, magpie thinkers who collected lots of small considerations and never trusted any one of them too far. Tetlock borrowed Isaiah Berlin’s labels — the hedgehog who knows one big thing, the fox who knows many small things — and the foxes won.

That could have been a cynical story about how nobody can predict anything. The second act is what makes it useful. The U.S. intelligence community ran a massive forecasting tournament, and Tetlock entered a team — not of spies or PhDs, but of ordinary volunteers: a retired computer programmer, a homemaker, a pharmacist. These amateurs, with nothing but the open internet, beat the professional intelligence analysts who had access to classified information, by a wide margin. A subset of the volunteers were so consistently accurate that Tetlock called them superforecasters, and the obvious question was: how?

The answer was almost disappointing. It wasn’t IQ, and it wasn’t secret information. It was method — a handful of habits anyone can copy. They started from the base rate: before looking at the specifics, they asked how often things like this happen in general (the “outside view”), and used that number as their anchor. They broke big questions into small ones they could actually estimate. They moved in small steps, updating a few points at a time as evidence came in rather than swinging wildly on the latest headline. They gave numbers, not stories — and finer-grained numbers than you’d expect, distinguishing 65% from 70% when the evidence justified it, instead of retreating to a vague “likely.” And they stayed humble and kept score, treating every forecast as a bet to be checked and every miss as something to learn from rather than explain away.

The deep move underneath all of it is the one most people skip: anchor on the outside view before the inside view. Faced with “will this startup succeed?” the untrained mind dives straight into the specifics — the brilliant founder, the hot market, the great demo — and talks itself into a number. The superforecaster first asks “what fraction of startups like this succeed?” — gets the depressing base rate — and only then lets the specifics nudge it up or down a little. The specifics feel like the whole story; the base rate is the thing that keeps you honest about how special this case really is. A good forecast is just that: a base rate you can defend, an adjustment you can explain, and a range wide enough to admit what you don’t know.

Framework & implementation

This section uses Ora’s own terms for the parts of an analysis, so that if you open the actual mode and lens files they line up. Each is glossed in plain language on first use.

Pipeline execution

The superforecasting protocol is the foundational, required mental model of the Probabilistic Forecasting mode — it sits in the mode’s ANALYTICAL PERSPECTIVES block under “always loaded,” and the mode is this method made into an analysis. The mode runs at Gear 4, Ora’s most thorough setting: a Depth analyst and a Breadth analyst read the question independently, each critiques the other’s reading, both revise under that critique, and a consolidator merges what survives. The lens threads through those stages like this.

Detection. The lens engages on the cases in its Detection Signals — a question needing a probability over a future event with a resolvable outcome; diverging expert opinions that need aggregation discipline; a prior forecast that needs disciplined updating without over- or under-reacting. The precondition is the mode’s CQ1 (operational resolvability): a question that will be observably true or false by some date, not a vague one built to escape scoring.

The Depth and Breadth analysts. Two models read the question in parallel. The Depth analyst commits to one reading and runs the lens’s Application Steps: lock the resolution criteria, then run reference-class forecasting — name the class of comparable cases, state its base rate (the outside view), and adjust for the case’s specific inside-view drivers, each adjustment small and tied to a named mechanism rather than a feeling that “this case is special.” This serves the mode’s CQ2 (an explicit reference class with a base-rate number) and CQ3 (inside-view drivers held separate from the outside-view base rate, with the arithmetic of the adjustment shown). The Breadth analyst works the same question at the same time, surveying alternative reference classes (the mode wants at least two considered before one is locked) and scanning inside-view drivers across categories — mechanism, motivation, capacity, environment. Neither sees the other’s work.

Cross-adversarial evaluation. Each analyst’s reading is handed to the other to critique — the discipline the lens calls “let others bring out the best in you,” made structural. The lens’s signature failures are caught here, keyed to its Critical Questions and the mode’s: anchoring on the case’s specifics with no reference class (inside-view dominance → the mode’s base-rate-neglect); collapsing the two views so the estimate can’t be reconstructed from base-rate-plus-adjustment (view-collapse); defaulting to round numbers like 25% or 50% when the evidence supports a finer grain (round-number anchoring); and over-reacting to vivid but low-diagnosticity evidence. The evaluator also presses the mode’s false-precision failure — a point estimate where the evidence only supports a range.

Revision and claim-check. The reviser addresses the fixes. Where the reading rests on a factual claim — the base rate of the reference class, a real historical frequency, a leading indicator’s current value — that claim is marked a flagged claim and sent to a web-search tool; it has to resolve against outside sources before the revised draft moves forward, because a forecast anchored on a made-up base rate is worse than no anchor at all.

Consolidation and output. The consolidator merges the two revised readings, and the formatter places them into the mode’s set sections — which are this method, section by section: the operational question lands in Resolution criteria locked; the outside view in Reference class and base rate (with the alternative classes considered); the case-specific factors, each with direction and magnitude, in Inside view drivers; the transparent arithmetic (base rate + drivers = estimate) in Outside view adjustment; the forecast as a Probability estimate with range whose width reflects real confidence; the signals that would move it in Leading indicators and update triggers; and, kept distinct, calibration confidence and point confidence in Confidence in estimate.

What the analysis will not assert. It gives a range, never a falsely precise point, and it refuses to forecast a question that can’t be operationally resolved (it routes that to clarification first). And it holds Tetlock’s famous “ten commandments” lightly — as heuristics that cultivate the disposition (base-rate anchoring, view-separation, small updates, range-not-point), not as a checklist applied mechanically, which is the mode’s standing caution against commandment-rigidity.

Origin and evidence

The protocol is Philip Tetlock’s, built on two decades of keeping score. Expert Political Judgment (2005) reported the result that made his name: across some 28,000 predictions from 284 experts, the average forecaster was barely better than chance — and the most confident, most famous “hedgehogs” (one big idea) were beaten by self-critical “foxes” (many small considerations), Isaiah Berlin’s distinction turned into an empirical finding. The constructive sequel came from the Good Judgment Project, the team Tetlock and Barbara Mellers ran in a multi-year geopolitical forecasting tournament sponsored by the U.S. intelligence community’s research arm (IARPA): their best amateur forecasters — superforecasters — outperformed trained intelligence analysts with access to classified data, and the project’s published findings (Mellers et al., Psychological Science, 2014) identified what drove the accuracy — not raw intelligence but reference-class thinking, granular probability estimates, frequent small updates, and active open-mindedness. Superforecasting (2015), with Dan Gardner, is the accessible distillation, including the “ten commandments.” The deepest root is older: Kahneman and Tversky’s outside view / reference-class idea, the insight that we systematically neglect base rates in favor of the vivid inside story — which the whole protocol is built to counteract.

Applications and common uses

Disciplined forecasting is a working tool wherever a future question has a real answer and someone has to act on the odds — used to produce a calibrated estimate and to audit someone else’s.

Intelligence and geopolitical analysis. The protocol’s proving ground: estimating whether an event happens by a date, with explicit reference classes and update triggers, in place of confident narrative assessments. It is now embedded in parts of the analytic tradecraft it once outperformed.
Business and strategy. Will a competitor ship, a deal close, a market clear a threshold — reference-class forecasting (how often do projects like this finish on time?) is the antidote to planning-fallacy optimism and to the boldest executive’s gut.
Public-health and risk forecasting. Outbreak trajectories, adoption curves, and tail risks are estimated as ranges anchored in comparable historical episodes, with named indicators that trigger revision as data arrives.
Investing and policy. Calibrated probability on resolvable questions — a rate decision, an election, a regulatory outcome — with the base rate stated and the inside-view adjustments auditable, beats both the permabull and the permabear, who are really just hedgehogs with a position.
Aggregation and forecasting tournaments. The wisdom-of-crowds finding the project formalized: averaging many independent calibrated forecasts, weighted toward the better-calibrated, beats almost any individual — which is why prediction markets and forecaster panels work.

In every case the payoff is the same discipline: a number anchored in how often things like this actually happen, adjusted in the open, stated as a range, and updated on pre-committed signals rather than the latest headline.

Failure modes and when not to use it

The lens’s characteristic ways of going wrong are catalogued in its Common Failure Modes:

Inside-view dominance. Anchoring on the case’s specifics — the brilliant founder, the unique situation — without ever consulting a reference class. The tell is a forecast you can’t decompose into “base rate plus adjustments.” Force a reference class and a base-rate number first.
Round-number anchoring. Defaulting to 10%, 25%, 50%, 75%, 90% regardless of the evidence. The tell is probabilities that cluster suspiciously on round values. Ask what evidence would justify 73% over 70%; if it exists, use it.
Overreaction to vivid evidence. Swinging the estimate on salient but low-diagnosticity news. The tell is a big move in response to evidence that doesn’t actually look different in worlds where the answer is yes vs. no. Update only in proportion to how diagnostic the evidence really is.
Hindsight bias on resolution. Once a forecast resolves, treating the outcome as having been obvious. The tell is a post-mortem that says “we should have known” without naming the evidence that was actually available in advance. Score the forecast against what was knowable then, not what’s clear now.

When not to reach for it. When the question has no resolvable outcome — success is vaguely or contestably defined — the precise probability is theater; clarify the question first or switch to a narrative scenario tool. When you’re choosing among options now rather than estimating a future state, that’s a decision problem, not a forecast. And when what’s wanted is a story about how the future might unfold rather than a number, scenario planning is the better instrument — forcing a single probability on a question that calls for branching narratives loses the thing the asker actually needed.

Probabilistic Forecasting — the analysis this lens founds; turns a question about the future into a calibrated probability range.
Regression to the Mean — a core inside-view correction: extreme results tend to be followed by less extreme ones, so an outlier should be forecast back toward the base rate.
Wisdom of Crowds — the aggregation finding the protocol rests on: averaging many independent estimates beats almost any single forecaster.
Base-Rate Neglect — the bias the whole method is built to defeat: reaching for the vivid inside story and ignoring how often the thing actually happens.