---
name: Scatter Plot
status: draft
description: STATISTICAL family. One dot per observation on an x–y grid, so the relationship between two continuous variables becomes a shape the eye reads instantly — upward, downward, blob, curve, with clusters and outliers.
sources:
  - title: Tukey, John W. (1977), Exploratory Data Analysis, Addison-Wesley
    url: https://openlibrary.org/works/OL9572429W
  - title: "Anscombe, F. J. (1973), Graphs in Statistical Analysis, The American Statistician 27(1): 17–21"
    url: https://doi.org/10.1080/00031305.1973.10478966
---

# Scatter Plot

## Why it matters

A scatter plot puts one variable on the horizontal axis, a second on the vertical axis, and drops one dot for every observation — so the *relationship* between two things you measured stops being a number and becomes a shape you can see. Dots drifting up to the right mean the two move together; down to the right means one rises as the other falls; a shapeless blob means they have nothing to do with each other; a bend means the relationship changes as you go. The whole point is that your eye catches in a half-second what a correlation coefficient hides: clusters, gaps, a curve, the one stubborn outlier dragging everything.

For example: a team reports that ad spend and revenue have a correlation of 0.6 and concludes spending more works. Plotted, the cloud tells a different story — two tight clusters, one of small careful campaigns and one of big ones, with almost no slope *inside* either group. The 0.6 came entirely from the gap *between* the clusters, not from spending more within them. The number said "keep spending." The picture said "you're looking at two different kinds of campaign." Only the scatter plot showed which.

- **What it shows.** How two continuous variables relate across every observation at once — direction, strength, and shape of the relationship, plus the individual points that break the pattern.
- **When to reach for it.** You have two numeric measurements per item and want to know whether, and how, they move together — before trusting any single summary statistic.
- **How to read it.** Read the overall tilt of the cloud first (up, down, flat, curved), then look for what violates it: outliers off on their own, clusters, fans that widen, gaps.
- **What you'd miss without it.** The structure a summary number averages away — Anscombe's quartet is four datasets with the *same* mean, variance, and correlation but four completely different pictures; only one is the straight-line relationship the statistics imply.
- **Where it misleads.** A clear upward cloud is correlation, not proof of cause; a lurking third variable can manufacture the trend or, in Simpson's paradox, reverse it within subgroups. The cloud shows that two things move together, never *why*.

## How to read it

Picture a grid. The horizontal axis carries one variable, the vertical axis the other, and each observation in the data becomes a single **dot** placed at its pair of values — its x reading across, its y reading up. There are no bars, no connecting lines, nothing but the points; the meaning lives entirely in *where they fall together*. With a handful of dots you see little. With a few hundred, a shape emerges from the crowd, and that shape is the finding.

Read the shape first. A cloud sloping up from lower-left to upper-right is a **positive correlation** — the two variables tend to rise together. A cloud sloping down from upper-left to lower-right is a **negative correlation** — one rises as the other falls. A round, directionless blob means **no linear relationship**: knowing one variable tells you nothing about the other. A cloud that bends — rising then leveling, or U-shaped — is a **nonlinear relationship**, the kind a single correlation number flattens into nonsense. And the eye catches more than the trend: **clusters** (the points clump into subgroups), **outliers** (a few points sit far from the rest), **gaps**, and **fans** (the spread widens as you move along an axis). Those exceptions are often the most interesting thing on the page.

To pin down the headline relationship, add a **trend line** — the straight (or smoothed) line fitted through the cloud, with an optional confidence band showing how sure the fit is. It turns "looks like it slopes up" into a stated slope. But the cardinal caution holds at every step: **correlation in the cloud is not causation**, and a tidy line can lull you. This is exactly what **Anscombe's quartet** was built to teach — four datasets with identical means, variances, correlations, and regression lines, yet one is a clean line, one a perfect curve, one a line wrecked by a single outlier, one a vertical stack pulled sideways by one stray point. The summary statistics are the same; the truth is not. The discipline of the scatter plot is simple and non-negotiable: you must *look*.

## When to use it

The scatter plot belongs to the **STATISTICAL family** of visual outputs — the ones that turn data into a picture you can reason about — and within it the scatter plot is the specialist in *two continuous variables and whether they covary*. Reach for it the moment your question is "how do these two measured quantities relate?" — revenue against headcount, dose against response, latency against load. It is the natural companion to correlation testing and regression: the picture you draw *before* fitting (does the data have the shape the model assumes?) and *after* (do the leftover residuals look like random scatter, or is there structure the model missed?). Knowing its three closest relatives is how you pick the right one:

- A **Distribution Plot** answers a different question entirely — the shape of *one* variable on its own (where it clumps, how it spreads, whether it's skewed or has two peaks). Use it when you have a single column of numbers, not a relationship between two.
- A **Time Series** is the special case where the x-axis is *time* and the points are usually joined by a line, so you read a value's rise and fall *over time* rather than two free variables against each other.
- A **Heatmap** trades dots for a colored grid: it shows a *matrix of magnitudes* across two categorical or binned dimensions, which is what you want when both axes are categories or when overplotting would bury a scatter under its own density.

Reach for a scatter plot when both variables are continuous, every observation gives you a pair, and the goal is to *see* the relationship — direction, shape, exceptions — rather than reduce it to one number. Skip it when you only have one variable (use a distribution plot), when time is the organizing axis (use a time series), or when the data is so dense that the cloud becomes an ink-blob (switch to a heatmap or a 2-D density estimate).

## How Ora builds it

Ora produces a scatter plot from a **semantic spec** — a structured description naming the **x variable** and **y variable** (each with its units), the set of **points** to plot, and any optional **color encoding** (a third variable splitting the cloud into subgroups by hue) or **size encoding** (a fourth variable, producing a bubble plot), plus whether to fit a **trend line** and confidence band. That spec is then rendered to an actual chart by a plotting engine (a matplotlib- or Vega-style backend), which lays out the axes, places the points, and draws any fitted line. Because a cloud of diagonal dots is invisible to a screen reader, the renderer also emits a text description — the axis encodings and units, the headline relationship if there is one, the within-cloud heterogeneity and outliers, and any subgroup patterns — and applies alpha-blending automatically once the point count gets dense enough to overplot.

The producing context is **data analysis**: exploratory work where you are hunting for relationships, generating hypotheses, and sanity-checking a model. When you ask Ora to "show me how X relates to Y," this is the artifact that does the showing — and its discipline is the discipline of looking twice, once for the headline pattern and once for everything that violates it.

The technique has a long pedigree. The modern scatter plot is usually traced to **John Herschel**, who in 1833 plotted the orbital data of double stars against time — the first published scatter plot in the modern sense. Later in the century **Francis Galton**, studying how the heights of parents and children relate, drew the clouds that led him to the ideas of correlation and regression, giving the scatter plot its statistical meaning. **John Tukey** then made it a cornerstone of *Exploratory Data Analysis* (1977), the tradition of looking hard at data before modeling it. And **Francis Anscombe's** 1973 paper *Graphs in Statistical Analysis* delivered the quartet — the four-dataset demonstration that summary statistics without a picture can lie — which remains the canonical lesson in why you plot before you conclude.

## Related

- **Distribution Plot** — the STATISTICAL-family member for *one* variable: its shape, spread, skew, and peaks, when you have a single column of numbers rather than a pair.
- **Time Series** — the scatter plot's time-axis special case, with points joined into a line to read a value's movement over time.
- **Heatmap** — a colored grid of magnitudes across two categorical or binned dimensions, the tool to reach for when both axes are categories or a scatter would overplot into an ink-blob.
- **Quadrant Matrix** — splits a two-axis space into four labeled zones for *positioning and sorting* items by two qualities, where the scatter plot leaves the raw cloud unpartitioned.

## Sources

- [Tukey, John W. (1977), Exploratory Data Analysis, Addison-Wesley](https://openlibrary.org/works/OL9572429W)
- [Anscombe, F. J. (1973), Graphs in Statistical Analysis, The American Statistician 27(1): 17–21](https://doi.org/10.1080/00031305.1973.10478966)
