Distribution Plot
Why it matters
A distribution plot draws the shape of a set of numbers — where the values pile up, where they thin out, whether they lean to one side, whether they have one peak or several, and how far the extremes reach. Instead of crushing a column of data down to a single average, it shows you the whole crowd at once. Its job is to reveal what a summary number quietly throws away, because the same average can sit on top of completely different shapes, and the shape is usually the part that matters.
For example: a team reports that their web service responds in 200 milliseconds “on average,” and everyone relaxes. Plotted as a distribution, the same numbers tell a different story — most requests finish in 80 milliseconds, but a long tail drags out past four seconds, and that tail is where real users are timing out. The average of 200ms describes almost nobody: it’s the midpoint of two populations the mean glued together. The plot shows the tail the average erased.
- What it shows. The full shape of one variable — the range it covers, where it clusters, how it spreads, and how lopsided it is — laid out so the structure is visible at a glance.
- When to reach for it. Any time you’re about to summarize, compare, or model a quantitative variable and you haven’t yet looked at what it actually looks like.
- How to read it. Find the bulk (where most values sit), then read outward to the tails (how far the extremes reach), and count the peaks (one hump, or more than one).
- What you’d miss without it. Skew, hidden second peaks, fat tails, and outliers — exactly the features that make an average misleading and that decide which analysis is even valid.
- Where it misleads. The picture depends on a choice — how wide the bars are, how smooth the curve is — and the wrong choice can invent peaks that aren’t there or sand away ones that are.
How to read it
Start with the simplest form, the histogram. Slice the range of values into bins of equal width and draw a bar over each bin whose height is the count of values that fall in it. Reading left to right, the tall bars show you where the data crowds together and the short ones where it thins. The first three things to read off are the center (where the bulk sits), the spread (how wide the bars range before they peter out), and the shape — is it a single symmetric hump, or does one side trail off much further than the other?
Then learn the other three forms, because each trades detail for a different kind of clarity. The density curve is a histogram with the staircase smoothed into a continuous line — easier to compare across groups, at the cost of a smoothing choice that can blur or exaggerate. The box plot strips the shape down to five numbers: the median (the middle value), the two quartiles (the box holding the middle half of the data), the whiskers (the reasonable range), and dots for outliers beyond them — compact and great for comparing many groups side by side, but it hides whether each group has one peak or two. The violin puts the shape back: it’s a density curve mirrored into a symmetric blob, so a waist or a double-bulge that the box plot would conceal becomes visible again.
What all four are built to reveal is what a single number cannot. A mean can be identical for a tidy bell and for a lopsided pile with a long tail. Two peaks (bimodality) almost always mean two different populations got mixed together and should be pulled apart. A fat tail says rare-but-large events are more common than a normal curve would predict — the part that bankrupts you if you planned around the average. And a lone outlier can be a data-entry typo or the single most important point in the set. The standing lesson is the one Tukey built a discipline around: look at the data first, because the average can lie, and the shape is how you catch it lying.
When to use it
The distribution plot belongs to the STATISTICAL family of visual outputs — the ones that render quantitative data for inspection — and within it the plot is the foundational single-variable tool: the picture you draw to understand one column of numbers on its own before you summarize it, test it, or feed it to a model. This is the opening move of what John Tukey named exploratory data analysis — look at the shape before you assume one — and it places the plot next to two close relatives.
- A Scatter Plot is the two-variable cousin: where the distribution plot asks what does this one variable look like?, the scatter plot asks how do these two variables move together? — reach for it when you care about a relationship, not a single shape.
- A Time Series is the same variable laid out in time order: where the distribution plot deliberately ignores sequence to show the shape of the values, the time series keeps the order to show the trend. Reach for it when when matters, not just how many.
Reach for a distribution plot whenever you have a quantitative variable and you’re about to make a claim about it — its typical value, its variability, whether two groups differ — and you haven’t yet seen its shape. Skip it when the variable is already known to be simple and well-behaved (a five-point survey rating rarely needs a density curve), when the only thing at stake is a single agreed-on summary number, or when the data is purely categorical (counts of categories want a bar chart, which is a different thing). The distribution plot is the diagnostic prelude to analysis, not a substitute for it.
How Ora builds it
Ora produces a distribution plot from a semantic spec — a structured description of the variable and its units, the data itself (or a pointer to where it lives), which plot type to draw (histogram, density curve, box plot, violin, or several coordinated together), the binning or bandwidth choice that controls how fine the shape is resolved, and any annotations such as percentile markers or a flagged outlier. That spec is then rendered to an actual figure (a matplotlib- or Vega-style plot, with an accompanying text description of the shape — the number and location of peaks, the skew, the tail behavior, the key percentiles — since the visual alone isn’t accessible to a screen reader).
The diagram is the visual face of Ora’s quantitative-reporting and data-analysis work: when you hand over a column of numbers and ask “what does this actually look like — show me the distribution,” that is the operation this artifact performs and how it shows its result. Critically, the binning is treated as a decision, not a default: too-wide bins hide a real second peak, too-narrow bins turn noise into fake structure, so the spec supports principled automatic rules (Freedman-Diaconis or Scott’s rule for bins, Silverman’s rule for density bandwidth) with explicit overrides on top.
The forms trace back through the history of statistics. The histogram — grouping continuous measurements into bins and drawing the counts as bars — runs back to Karl Pearson’s late-nineteenth-century formalization, building on the graphical tradition William Playfair pioneered a century earlier. The smooth density curve is the kernel density estimate developed by Murray Rosenblatt (1956) and Emanuel Parzen. And the box plot is the invention of John W. Tukey, who introduced it as part of the Exploratory Data Analysis program that made “look at the shape first” the discipline these plots all serve.
Related
- Scatter Plot — the two-variable member of the STATISTICAL family: positions points by two coordinates to reveal a relationship, where the distribution plot reveals one variable’s shape.
- Time Series — the same single variable plotted in time order, trading the distribution’s shape view for the trend over time.
- Heatmap — extends quantitative shape into two dimensions, using color intensity across a grid to show where values concentrate in a matrix.
- Comparison Chart — the decision-facing companion: the distribution of outcomes is exactly the input its quantitative cells should summarize.