Distribution charts are used to see how quantitative values are distributed along an axis from lowest to highest. Looking at the shape of the data a user can identify characteristics such as the range of values, central tendency, shape and outliers. It can be used to answer questions such as “number of customers per age group” or “how many days late our payments are”.
Here are some example charts you can use to answer the questions above.
A histogram is used to look at the rough distribution of a measurement binned together to find the spread and peaks of the data values. As you can see in the histogram below there is quite an even spread of values with a peak almost in the middle. Had we instead had outliers or peaks at different parts of our data we would easily have seen it here.
The issue with a histogram is that it only gives you a rough distribution as the chart very much depends on the bin size of your bars. Therefore, this chart should preferably be used if you know the width of your bins, such as age groups. If you don’t know and just want to see the spread, you can either experiment by changing the bin width or number of bars. You could also use a calculation, such as Sturge’s formula, to find the optimal number of bins.
For more details on how a histogram works you can read my colleague's blog on “recipe for a histogram”.
If you want to look at the distribution of the actual values you can use a distribution plot, also known as a strip plot, where each value is plotted as a point. In this chart you can see which data points are your lowest as well as highest, you can also see if the points are evenly spread or if there are clusters and gaps.
Below I have three charts that show:
- The average lead time by product, where it’s easy to see which product has the lowest and which has the highest lead time.
- Same data, but now I’ve added in product group so that I can see the spread of average lead time per product and product group.
- Same chart as in 2, but with Jittering applied. Jittering is a method to spread out your values to make them easier to see when there are many values and they are on top of each other.
As you can see, when you have many values it’s hard to see anything other than the lowest and highest value. By grouping the values, we spread them out across different rows and with jittering we don’t stack them on top of each other. But if you still have many values and the data is hard to see then we should probably use the box plot instead.
When you have many values and want to see their distribution to easily find outliers, a box plot might be what you’re seeking. The box plot has many different settings, but normally the box shows you the main group of values and the whiskers shows you the spread. Depending on what threshold you have on the whiskers, you may see values that end up outside of the whiskers, these are the outliers. If no outliers are shown, the whiskers show the max and min value.
My colleagues have also written a blog on box plot and if you check out blog “Recipe for a Box Plot” you’ll get more info on how a box plot works.
That was all for today and hopefully you’ll try out some of
these charts with your data and find some interesting distributions!