Why it matters

A real forecast isn’t a confident opinion — it’s a calibrated number you’d stake money on, and the people best at producing one win not by knowing more, but by how they build the number.

For example: asked whether a rival will ship a competing product by year-end, a pundit says “definitely” or “no chance.” A superforecaster says “about 65% — companies at this stage hit their ship dates roughly 60% of the time, nudged up a little for their recent hiring spree.” One of those is a feeling. The other is a base rate, an adjustment, and a number you could check.

  • What it reveals. Whether a forecast is actually calibrated — built from how often things like this happen, adjusted for what’s specific here — or just a confident-sounding guess dressed as analysis.
  • How it changes the read. You stop asking “what do I think will happen?” and start asking “out of all the times something like this came up, how often did it go this way — and what about this case moves it off that number?”
  • When to foreground it. Any question with a resolvable answer by a date — will it ship, will they win, will the number clear the threshold — where you want odds you could bet on rather than a story.
  • What you’d miss without it. That the loudest, most famous experts are often the worst forecasters, and that beating them takes method, not genius: start from the base rate, break the question up, move in small steps, give a number not a narrative.
  • Where it misleads. It only works on questions that actually resolve. Force it onto a vague or contested question (“will things get better?”) and the precise-looking number is false precision — the discipline can’t rescue a question that can’t be scored.

How it works

In the 1980s a young psychologist named Philip Tetlock started doing something almost no one bothers to do: he wrote down experts’ predictions and checked them. Over twenty years he collected roughly twenty-eight thousand forecasts from nearly three hundred professional pundits, analysts, and commentators — the people on television explaining what would happen next — and then he waited to see what actually happened.

The result became famous. The average expert, he found, was about as accurate as “a dart-throwing chimpanzee.” Worse, the more famous the expert, the less accurate they tended to be — the ones with bold, simple, TV-ready theories of how the world works did the poorest of all, because a single big idea makes you confident and confidence makes you wrong. The ones who did a little better were the opposite type: cautious, self-doubting, magpie thinkers who collected lots of small considerations and never trusted any one of them too far. Tetlock borrowed Isaiah Berlin’s labels — the hedgehog who knows one big thing, the fox who knows many small things — and the foxes won.

That could have been a cynical story about how nobody can predict anything. The second act is what makes it useful. The U.S. intelligence community ran a massive forecasting tournament, and Tetlock entered a team — not of spies or PhDs, but of ordinary volunteers: a retired computer programmer, a homemaker, a pharmacist. These amateurs, with nothing but the open internet, beat the professional intelligence analysts who had access to classified information, by a wide margin. A subset of the volunteers were so consistently accurate that Tetlock called them superforecasters, and the obvious question was: how?

The answer was almost disappointing. It wasn’t IQ, and it wasn’t secret information. It was method — a handful of habits anyone can copy. They started from the base rate: before looking at the specifics, they asked how often things like this happen in general (the “outside view”), and used that number as their anchor. They broke big questions into small ones they could actually estimate. They moved in small steps, updating a few points at a time as evidence came in rather than swinging wildly on the latest headline. They gave numbers, not stories — and finer-grained numbers than you’d expect, distinguishing 65% from 70% when the evidence justified it, instead of retreating to a vague “likely.” And they stayed humble and kept score, treating every forecast as a bet to be checked and every miss as something to learn from rather than explain away.

The deep move underneath all of it is the one most people skip: anchor on the outside view before the inside view. Faced with “will this startup succeed?” the untrained mind dives straight into the specifics — the brilliant founder, the hot market, the great demo — and talks itself into a number. The superforecaster first asks “what fraction of startups like this succeed?” — gets the depressing base rate — and only then lets the specifics nudge it up or down a little. The specifics feel like the whole story; the base rate is the thing that keeps you honest about how special this case really is. A good forecast is just that: a base rate you can defend, an adjustment you can explain, and a range wide enough to admit what you don’t know.

Framework & implementation

Origin and evidence

The protocol is Philip Tetlock’s, built on two decades of keeping score. Expert Political Judgment (2005) reported the result that made his name: across some 28,000 predictions from 284 experts, the average forecaster was barely better than chance — and the most confident, most famous “hedgehogs” (one big idea) were beaten by self-critical “foxes” (many small considerations), Isaiah Berlin’s distinction turned into an empirical finding. The constructive sequel came from the Good Judgment Project, the team Tetlock and Barbara Mellers ran in a multi-year geopolitical forecasting tournament sponsored by the U.S. intelligence community’s research arm (IARPA): their best amateur forecasters — superforecasters — outperformed trained intelligence analysts with access to classified data, and the project’s published findings (Mellers et al., Psychological Science, 2014) identified what drove the accuracy — not raw intelligence but reference-class thinking, granular probability estimates, frequent small updates, and active open-mindedness. Superforecasting (2015), with Dan Gardner, is the accessible distillation, including the “ten commandments.” The deepest root is older: Kahneman and Tversky’s outside view / reference-class idea, the insight that we systematically neglect base rates in favor of the vivid inside story — which the whole protocol is built to counteract.

Applications and common uses

Disciplined forecasting is a working tool wherever a future question has a real answer and someone has to act on the odds — used to produce a calibrated estimate and to audit someone else’s.

  • Intelligence and geopolitical analysis. The protocol’s proving ground: estimating whether an event happens by a date, with explicit reference classes and update triggers, in place of confident narrative assessments. It is now embedded in parts of the analytic tradecraft it once outperformed.
  • Business and strategy. Will a competitor ship, a deal close, a market clear a threshold — reference-class forecasting (how often do projects like this finish on time?) is the antidote to planning-fallacy optimism and to the boldest executive’s gut.
  • Public-health and risk forecasting. Outbreak trajectories, adoption curves, and tail risks are estimated as ranges anchored in comparable historical episodes, with named indicators that trigger revision as data arrives.
  • Investing and policy. Calibrated probability on resolvable questions — a rate decision, an election, a regulatory outcome — with the base rate stated and the inside-view adjustments auditable, beats both the permabull and the permabear, who are really just hedgehogs with a position.
  • Aggregation and forecasting tournaments. The wisdom-of-crowds finding the project formalized: averaging many independent calibrated forecasts, weighted toward the better-calibrated, beats almost any individual — which is why prediction markets and forecaster panels work.

In every case the payoff is the same discipline: a number anchored in how often things like this actually happen, adjusted in the open, stated as a range, and updated on pre-committed signals rather than the latest headline.

Failure modes and when not to use it

The lens’s characteristic ways of going wrong are catalogued in its Common Failure Modes:

  • Inside-view dominance. Anchoring on the case’s specifics — the brilliant founder, the unique situation — without ever consulting a reference class. The tell is a forecast you can’t decompose into “base rate plus adjustments.” Force a reference class and a base-rate number first.
  • Round-number anchoring. Defaulting to 10%, 25%, 50%, 75%, 90% regardless of the evidence. The tell is probabilities that cluster suspiciously on round values. Ask what evidence would justify 73% over 70%; if it exists, use it.
  • Overreaction to vivid evidence. Swinging the estimate on salient but low-diagnosticity news. The tell is a big move in response to evidence that doesn’t actually look different in worlds where the answer is yes vs. no. Update only in proportion to how diagnostic the evidence really is.
  • Hindsight bias on resolution. Once a forecast resolves, treating the outcome as having been obvious. The tell is a post-mortem that says “we should have known” without naming the evidence that was actually available in advance. Score the forecast against what was knowable then, not what’s clear now.

When not to reach for it. When the question has no resolvable outcome — success is vaguely or contestably defined — the precise probability is theater; clarify the question first or switch to a narrative scenario tool. When you’re choosing among options now rather than estimating a future state, that’s a decision problem, not a forecast. And when what’s wanted is a story about how the future might unfold rather than a number, scenario planning is the better instrument — forcing a single probability on a question that calls for branching narratives loses the thing the asker actually needed.

  • Probabilistic Forecasting — the analysis this lens founds; turns a question about the future into a calibrated probability range.
  • Regression to the Mean — a core inside-view correction: extreme results tend to be followed by less extreme ones, so an outlier should be forecast back toward the base rate.
  • Wisdom of Crowds — the aggregation finding the protocol rests on: averaging many independent estimates beats almost any single forecaster.
  • Base-Rate Neglect — the bias the whole method is built to defeat: reaching for the vivid inside story and ignoring how often the thing actually happens.