Wisdom of Crowds

Why it matters

Average enough independent guesses and the errors cancel out — the crowd as a whole knows more than almost anyone in it, right up until the moment they start copying each other.

For example: ask one analyst how many units you’ll sell next year and you get one person’s blind spots baked into the number. Ask fifty analysts separately and average them, and the over-guessers and under-guessers quietly offset, leaving a central estimate tighter than almost any single one of them could produce — provided no one peeked at anyone else’s answer first. The catch is that last clause. Let them confer first and they converge on the room’s loudest opinion, the errors stop cancelling, and the wise crowd turns into a herd.

  • What it reveals. That a pile of independent estimates, averaged, is itself a powerful anchor — often more accurate than the single best expert — and that the spread of those estimates tells you how confident to be.
  • How it changes the read. You stop hunting for the one person who’s right and start asking “how do many independent estimates of this distribute, and where’s their center?” — treating the aggregate as the base rate to reason from.
  • When to foreground it. Whenever an estimate is needed, no single expert is clearly reliable, and you can gather several judgments that were made separately — a forecast, a sizing, a probability nobody can pin down alone.
  • What you’d miss without it. That the average of the many usually beats the loudest of the few — and that the width of the crowd’s spread is free information about how much the estimate should be trusted.
  • Where it misleads. The magic is independence, and it’s fragile. The instant the estimates are made socially — a meeting, a thread, a poll after the pundits weighed in — the errors correlate instead of cancelling, and a confident-looking average is just groupthink with a decimal point.

How to invoke it in Ora

You have something to estimate, no single expert you fully trust, and several judgments you can gather — and you want them combined into one calibrated number instead of a fight over whose gut is best.

Describe the estimate and the pool of estimators, and ask:

“Forecast using the wisdom of crowds: fifty analysts each independently estimate next year’s demand. Aggregate them into a calibrated forecast with a probability range.”

Ora treats the pool of independent estimates as the anchor: it locks what would count as the answer, builds the aggregate as a base-rate-style reference point, centers the forecast on it, and reads the spread of the crowd to set the width of the range — then hands back a probability with that range plus the signals that would move it.

One thing to know: the words forecast, wisdom of crowds, aggregate, and probability range are what route you here. A bare “what should we estimate?” gets a clarifying question — aggregation needs a resolvable outcome and a set of judgments to combine, and the first thing Ora does is make you name both.

Say where the estimates came from. The whole method rests on one condition — that the judgments were formed independently, each person estimating on their own. If they were made in a meeting or after everyone saw the same forecast, tell Ora; it will flag that the crowd may be a herd and that the tidy average is borrowed confidence, not cancelled error.

One thing Ora won’t do: hand you a single confident percentage to sound authoritative. It gives a range whose width reflects how widely the crowd actually disagreed, and it would rather widen the range than fake a precision the spread doesn’t support.

How it works

In the summer of 1906, the Victorian scientist Francis Galton went to a country fair in Plymouth and watched a weight-judging competition. A fat ox had been put on display, and for sixpence a ticket anyone could write down their guess of what the animal would weigh once it had been slaughtered and dressed. Butchers entered, and farmers, and a great many people who had never appraised a beast in their lives. Galton, who believed deeply that the common run of people were not very bright, expected the contest to prove it. He got hold of all the tickets afterward — 787 of them — and ran the numbers, fully expecting the crowd’s verdict to be a mess.

It was the opposite. He lined up every guess from lowest to highest and took the one in the middle. The crowd’s middle guess was 1,197 pounds. The ox, dressed, weighed 1,198. The crowd as a whole had missed by a single pound — closer than the cattle experts, closer than any individual butcher or farmer in the room. No one person at the fair knew the answer. The crowd, averaged, did. Galton had set out to measure the foolishness of the masses and accidentally measured something else entirely, and he had the honesty to publish it.

The reason it works is almost mechanical, and it’s the whole secret. Every guess is the truth plus an error — some people guess too high, some too low, for a thousand uncorrelated reasons. When you average a big pile of guesses, the high errors and the low errors run into each other and cancel, while the small thread of real signal that everyone is reaching toward survives the averaging and accumulates. The more independent guesses you pour in, the more completely the noise washes out and the more sharply the truth shows through. You don’t need a single person to be right. You need a lot of people to be wrong in different directions.

Which points straight at the condition the whole thing hangs on. The errors only cancel if they’re independent — if each person really did guess on their own. The instant people start looking at each other’s answers, the errors stop being random and different and start being the same: everyone drifts toward the first confident number, the loudest voice, the figure already on the screen. Now the mistakes don’t cancel, they pile up and reinforce, all leaning the same way. The crowd hasn’t gotten smarter by talking; it’s gotten correlated, and a correlated crowd is just one opinion wearing a lot of coats. This is the failure the journalist James Surowiecki put at the center of the idea when he gathered the modern evidence for it: a wise crowd needs its members to be diverse and, above all, independent; take that away and the same crowd that nailed the ox to the pound will stampede off a cliff together. Scott Page later showed mathematically why diversity does so much of the work — a group of people who make their errors in genuinely different ways is, provably, more accurate than a group of clones, however clever the clones.

So the move underneath this lens is the one most people get backwards. The instinct, faced with a hard estimate, is to find the single smartest person and defer to them. The better instinct is to gather many honest, separate guesses and average them — and to guard the one fragile thing that makes the average worth anything, which is that nobody saw anyone else’s answer first. The width of the disagreement, far from being a nuisance, is a gift: a tight cluster means the crowd is sure, a wide scatter means it isn’t, and either way you’ve learned how much to trust the number. Independence is what turns a mob into an oracle. Lose it, and you’ve still got a crowd — just not a wise one.

Framework & implementation

This section uses Ora’s own terms for the parts of an analysis, so that if you open the actual mode and lens files they line up. Each is glossed in plain language on first use.

Pipeline execution

The wisdom-of-crowds finding is one of the always-loaded mental models of the Probabilistic Forecasting mode — it sits in the mode’s ANALYTICAL PERSPECTIVES block under “always loaded,” available to every forecast the mode produces, because aggregation of independent estimates is the empirical foundation that forecast-averaging rests on. The mode runs at Gear 4, Ora’s most thorough setting: a Depth analyst and a Breadth analyst read the question independently, each critiques the other’s reading, both revise under that critique, and a consolidator merges what survives. The lens threads through those stages like this.

Detection. The lens engages on the cases in its Detection Signals — an estimate is needed and no single expert is clearly reliable; a forecasting system or estimation process is being designed; a group is converging on consensus too fast (the independence condition is under threat); or the analyst has to decide how much weight to give expert opinion versus an aggregate of non-expert estimates. The precondition is the mode’s CQ1 (operational resolvability): a question that will be observably true or false by some date, so the aggregate can eventually be scored, not a vague one built to escape scoring.

The Depth and Breadth analysts. Two models read the question in parallel. The Depth analyst commits to one reading and runs the lens’s Application Steps: collect estimates from as many independent sources as possible, check that each judgment was formed before its author saw the others, then aggregate — median for robustness to outliers, mean when the pool is well-behaved — and treat that aggregate as the forecast’s anchor. This serves the mode’s CQ2 directly: the crowd aggregate is the outside-view anchor, a base-rate-style reference point assembled from many estimates rather than looked up in a table. The Breadth analyst works the same question at the same time, pressuring the lens’s four conditions — diversity, independence, decentralization, aggregation — and asking whether this is a domain where crowd judgment actually performs well. Neither sees the other’s work, which is the method auditing itself: the analysts are run independently for exactly the reason the lens cares about.

Cross-adversarial evaluation. Each analyst’s reading is handed to the other to critique. The lens’s signature failures are caught here, keyed to its Common Failure Modes: an independence violation, where the estimates were formed in discussion and so cluster tighter than honest random error would allow (the tell that the crowd is really a herd); a diversity collapse, where every estimator shares one background and the errors correlate instead of cancelling; and a wrong-aggregation choice, where a mean is dragged around by a few extreme outliers that a median would have absorbed. The evaluator also presses the mode’s false-precision failure — a single point estimate where the spread of the crowd only supports a range. This is the lens’s tie to the mode’s CQ4: the width of the disagreement is the evidence for the width of the forecast, so collapsing it to a point throws away information the crowd handed you for free.

Revision and claim-check. The reviser addresses the fixes. Where the reading rests on a factual claim — the makeup of the estimator pool, the reported spread, a real historical frequency the aggregate is being checked against — that claim is marked a flagged claim and sent to a web-search tool; it has to resolve against outside sources before the revised draft moves forward, because an aggregate built on a misremembered pool is a confident number with nothing under it.

Consolidation and output. The consolidator merges the two revised readings, and the formatter places them into the mode’s set sections. The crowd aggregate, as the outside-view anchor, lands in Reference class and base rate — the heart of where this lens does its work, because an aggregate of many independent estimates is itself the most defensible base-rate-style anchor the forecast can stand on. The conditions checked and the case-specific factors land in Inside view drivers and Outside view adjustment (with the transparent arithmetic of any adjustment to the raw aggregate shown). The forecast itself, centered on the aggregate, becomes a Probability estimate with range whose width is set by how widely the crowd actually disagreed — the second place this lens is load-bearing, since averaging independent forecasts is what tightens the central estimate. The signals that would move it go in Leading indicators and update triggers; and, kept distinct, calibration confidence and point confidence go in Confidence in estimate — the third, because a wide independent pool is precisely what justifies confidence in the central number even when no single estimator was trusted.

What the analysis will not assert. It gives a range, never a falsely precise point, and the width of that range answers to the spread of the crowd, not to how authoritative the analyst wants to sound. It will not treat a large group as wise without checking the four conditions — and where independence was lost, it says so, and discounts the tidy-looking average rather than launder a herd’s consensus into a forecast.

Origin and evidence

The empirical seed is Francis Galton’s, and it is almost comically against type. In Vox Populi (Nature, 1907), Galton — a believer in the limited judgment of the common person — reported a weight-judging competition at a 1906 West of England fair where 787 visitors each paid to guess the dressed weight of an ox. The median of the 787 independent guesses was 1,197 pounds against a true weight of 1,198 — an error under one part in a thousand, and better than the individual experts present. The crowd’s middle estimate was, as Galton put it, more trustworthy than might have been expected. The modern formulation is James Surowiecki’s The Wisdom of Crowds (2004), which gathered the evidence across estimation, prediction markets, and group decisions and named the load-bearing precondition: a crowd is wise only when its members are diverse, independent, and decentralized, with a sound mechanism to aggregate them — and that the same crowd, allowed to influence one another, degrades into herding and information cascades. The mathematical underpinning of the diversity half is Scott Page’s The Difference (2007), which proved the “diversity-trumps-ability” and “crowd beats average” results — that under stated conditions a diverse pool of estimators is necessarily more accurate than the pool’s average individual, because independent, differently-structured errors cancel where correlated ones reinforce. Together they give the lens its shape: aggregation of independent error is the engine, and independence-plus-diversity is the fragile condition that runs it.

Applications and common uses

Aggregating independent estimates is a working tool wherever a number has to be produced, no single judge is reliable, and several separate judgments can be gathered — used both to build a calibrated estimate and to audit whether a group’s apparent consensus is real.

  • Forecasting and prediction markets. The canonical application: averaging many independent calibrated forecasts, often weighted toward the better-calibrated, reliably beats almost any single forecaster — the principle that makes prediction markets, forecaster panels, and polling aggregates work, and the foundation forecast aggregation rests on.
  • Estimation under uncertainty. Software teams use planning poker — every developer estimates privately, all reveal at once, the median anchors the number — precisely to keep the tech lead’s first guess from anchoring the room; the same private-then-reveal discipline sizes budgets, schedules, and risks.
  • Market and demand sizing. A pool of independent analyst estimates, aggregated, gives a central forecast with a defensible range, and the spread of the estimates is itself the read on how settled the question is — tight means confident, wide means genuinely contested.
  • Collective decisions and assessment. Juries, expert panels, and crowd-sourced judgments are stronger when members reason independently before conferring; the lens is also the defensive tool — a way to test whether a group converged because the evidence pointed one way or merely because everyone copied the first confident voice.
  • System and process design. When building any aggregation mechanism — a market, a panel, a survey — the lens specifies what to protect: collect estimates privately, recruit for diversity of information and reasoning, decentralize so no single source contaminates the pool, and choose an aggregation rule (usually the median) robust to outliers and gaming.

In every case the payoff is the same: a central estimate stronger than its best single member, a range whose width is honestly set by the crowd’s own disagreement, and a built-in test of the one thing that makes the number trustworthy — that the judgments were formed apart.

Failure modes and when not to use it

The lens’s characteristic ways of going wrong are catalogued in its Common Failure Modes:

  • Independence violation. Aggregating estimates that were formed in discussion — a meeting, a thread, a poll taken after the pundits spoke. The tell is estimates that cluster tighter than an honest independent-error model would predict; real independence leaves a spread. The correction is structural: enforce private estimation before any group discussion, and if it’s too late, treat the average as suspect rather than authoritative.
  • Diversity collapse. A pool whose members all share one background, training, or information source. The tell is errors that correlate strongly — everyone wrong in the same direction, so the averaging cancels nothing. The correction is to actively recruit estimators who reason and inform themselves differently.
  • Wrong-aggregation choice. Using a mean when a few extreme outliers distort it. The tell is an aggregate visibly dragged toward a handful of wild guesses. The correction is to switch to the median, or a trimmed mean for moderate cases, so the center reflects the body of the crowd rather than its tails.
  • Treating any crowd as wise. Assuming size alone confers wisdom, without checking the four conditions or whether the domain is one where crowd judgment performs well. The tell is a confident aggregate with no account of how the estimates were gathered. The correction is to check diversity, independence, decentralization, and aggregation explicitly before trusting the number.

When not to reach for it. When the estimates can’t be made independently — there’s only one source, or everyone has already conferred and can’t un-hear each other — the averaging cancels nothing and the lens has no engine to run; better to surface the dependence than disguise it. When the question has no resolvable outcome, the aggregate can never be scored, and a tidy crowd-average is false precision; clarify the question first. And when the task is genuinely one for a single domain expert — a specialist judgment a lay crowd has no signal about — pooling uninformed guesses just averages noise; the crowd is wise only where its members each hold a real fragment of the truth.

  • Probabilistic Forecasting — the analysis this lens informs; turns a question about the future into a calibrated probability range, anchored where possible on an aggregate of independent estimates.
  • Tetlock Superforecasting — the disciplined extension to expert forecaster pools: averaging many independent calibrated forecasts, weighted toward the better-calibrated, beats almost any single one.
  • Information Cascade — the failure mode in close-up: what happens to a crowd the moment independence breaks and each member starts copying the last, so errors compound instead of cancelling.
  • Prediction Markets — the institutional implementation: a standing mechanism for collecting and aggregating many independent estimates into a single live price.