Recipes for the Visualizations of Data Distributions (2024)

Visualization

Published in

Towards Data Science

9 min read

Oct 22, 2019

Histograms, KDE plots, box(en) plots and violin plots and more…

Recipes for the Visualizations of Data Distributions (3)

As a budding data scientist, I realized that the first piece of code is always written to understand the distribution of one or several variables in the data set during project initiation. Visualizing the distribution of a variable is important to immediately grasp valuable parameters such as frequencies, peaks, skewness, center, modality, and how variables and outliers behave in the data range.

With the excitement of sharing knowledge, I created this blog post about summarized explanations of single-variable (univariate) distributions to share my deductions from several articles and documentations. I will provide steps to draw the distribution functions without going deep in theories and keep my post simple.

I will start by explaining the functions to visualize data distributions with Python using Matplotlib and Seaborn libraries. Code behind the visualizations can be found in this notebook.

For illustrations, I used Gapminder life expectancy data, the cleaned version can be found in this GitHub repository.

The data set shows 142 countries’ life expectancy at birth, population and GDP per capita between the years 1952 and 2007. I will plot the life expectancy at birth using:

Histogram
Kernel Density Estimation and Distribution Plot
Box Plot
Boxen Plot
Violin Plot

Histograms are the simplest way to show how data is spread. Here is the recipe for making a histogram:

Create buckets (bins) by dividing your data range into equal sizes, the number of subsets in your data is the number of bins you have.
Record the count of the data points that fall into each bin.
Visualize each bucket side by side on the x-axis.
Count values will be shown on the y-axis, showing how many items are there in each bin.

And you have a brand-new histogram!

It is the easiest and most intuitive way. However, one drawback is to decide on the number of bins necessary.

In this graph, I determined 25 bins, which seems to be optimal after playing around with the bins parameter in the Matplotlib hist function.

# set the histogram
plt.hist(df.life_expectancy, 
 range=(df.life_expectancy.min(),
 df.life_expectancy.max()+1), 
 bins=25, 
 alpha=0.5) 
# set title and labels
plt.xlabel(“Life Expectancy”) 
plt.ylabel(“Count”) 
plt.title(“Histogram of Life Expectancy between 1952 and 2007 in the World”) 
plt.show()

Recipes for the Visualizations of Data Distributions (4)

Different number of bins can significantly change how your data distribution looks. Here is the same data distribution with 5 bins, it looks like a totally different data set, right?

Recipes for the Visualizations of Data Distributions (5)

If you don’t want to be bothered by the number of bins determination, then let’s jump to the kernel density estimation functions and distribution plots.

Kernel Density Estimation (KDE) plots save you from the hassle of deciding on the bin size by smoothing the histogram. Follow the below logic to create a KDE plot:

Plot a Gaussian (normal) curve around each data point.
Sum the curves to create a density at each point.
Normalize the final curve, so that the area under it equals to 1, resulting in a probability density function. Here is a visual example of those 3 steps:

Recipes for the Visualizations of Data Distributions (6)

You will find the range of the data on the x-axis and probability density function of the random variable on the y-axis. Probability density function is defined in this article by Will Koehrsen as follows:

You may think of the y-axis on a density plot as a value only for relative comparisons between different categories.

Luckily, you don’t have to remember and apply all these steps manually. Seaborn’s KDE plot function completes all these steps for you, just pass the column of your data frame or Numpy array to it!

# set KDE plot, title and labels
ax = sns.kdeplot(df.life_expectancy, shade=True, color=”b”) 
plt.title(“KDE Plot of Life Expectancy between 1952 and 2007 in the World”) 
plt.ylabel(“Density”)

Recipes for the Visualizations of Data Distributions (7)

If you want to combine histograms and KDE plot, Seaborn has another cool way to show both histograms and KDE plots in one graph: Distribution plot which draws KDE Plot with the flexibility of turning on and off the histograms by changing the hist parameter in the function.

# set distribution plot, title and labels
ax = sns.distplot(df.life_expectancy, hist=True, color=”b”)
plt.title(“Distribution Plot of Life Expectancy between 1952 and 2007 in the World”) 
plt.ylabel(“Density”)

Recipes for the Visualizations of Data Distributions (8)

KDE plots are also capable of showing distributions among different categories:

# create list of continents 
continents = df[‘continent’].value_counts().index.tolist() # set kde plot for each continent 
for c in continents: 
 subset = df[df[‘continent’] == c]
 sns.kdeplot(subset[“life_expectancy”], label=c, linewidth=2) # set title, x and y labels 
plt.title(“KDE Plot of Life Expectancy Among Continents Between 1952 and 2007”) 
plt.ylabel(“Density”) 
plt.xlabel(“Life Expectancy”)

Recipes for the Visualizations of Data Distributions (9)

Although KDE plots or distribution plots have more computations and mathematics behind compared to histograms, it is easier to understand modality, symmetry, skewness and center of the distribution by looking at a continuous line. One disadvantage may be, lacking information about summary statistics.

If you wish to provide summary statistics of your distribution visually, then let’s move to the box plots.

Box plots show data distributions with the five-number summary statistics (minimum, first quartile Q1, median the second quartile, third quartile Q3, maximum). Here are the steps to draw them:

Sort your data to determine the minimum, quartiles (first, second and third) and maximum.
Draw a box between the first and third quartile, then draw a vertical line in the box corresponding to the median.
Draw a horizontal line outside of the box halving the box into two and put the minimum and maximum at the edge. These lines will be your whiskers.
The end of the whiskers are equal to the minimum and maximum of the data and, if you see any, the little diamonds set aside is interpreted as “outliers”.

Steps are straightforward to create a box plot manually, but I prefer to get some support from Seaborn box plot function.

# set the box plot and title 
sns.boxplot(x=”life_expectancy”, data=df, palette=”Set3") plt.title(“Boxplot of Life Expectancy between 1952 and 2007 in the World”)

Recipes for the Visualizations of Data Distributions (10)

There are several different ways to calculate the length of whiskers, Seaborn box plot function determines whiskers by extending the 1.5 times the interquartile range (IQR) from the first and third quartiles by default. Thus, any data point bigger than Q3+(1.5*IQR) or smaller than Q1-(1.5*IQR) will be visualized as outliers. You can change the calculation of whiskers by adjusting the whis parameter.

Like KDE plots, box plots are also suitable for visualizing the distributions among categories:

# set the box plot with the ordered continents and title sns.boxplot(x=”continent”, y=”life_expectancy”, data=df,
 palette=”Set3", 
 order=[“Africa”, “Asia”, “Americas”, “Europe”,
 “Oceania”]) 
plt.title(“Boxplot of Life Expectancy Among Continents Between 1952 and 2007”)

Recipes for the Visualizations of Data Distributions (11)

Box plots provide the story of the statistics, where half of the data lies, and the whole range of data by looking at the box shape and whiskers. On the other hand, you don’t have the visibility of the story of the data outside the box. That is the reason why some scientists published a paper about boxen plots, known as extended box plots.

Boxen plots, or letter value plots or extended box plots, might be the least used method for data distribution visualizations, yet they convey more information on large data sets.

To create a boxen plot, let’s first understand what a letter value summary is. Letter value summary is about continually determining the middle value of a sorted data.

First, determine the middle value for all the data, and create two slices. Then, determine the median of those two slices and iterate on this process when the stopping criteria is reached or no more data is left to be separated.

First middle value determined is the median. Middle values determined in the second iteration are called fourths, and middle values determined in the third iteration are called eights.

Now let’s draw a box plot and visualize letter value summaries outside the box plot instead of whiskers. In other words, plot a box plot with extended box edges corresponding to the middle value of the slices (eights, sixteenths and so on..)

# set boxen plot and title 
sns.boxenplot(x=”life_expectancy”, data=df,palette=”Set3") plt.title(“Boxenplot of Life Expectancy Among Continents Between 1952 and 2007”)

Recipes for the Visualizations of Data Distributions (12)

They are also effective in telling the data story for different categories:

# set boxen plot with ordered continents and title sns.boxenplot(x=”continent”, y=”life_expectancy”, data=df,
 palette=”Set3", 
 order=[“Africa”, “Asia”, “Americas”, “Europe”,
 “Oceania”]) 
plt.title(“Boxenplot of Life Expectancy Among Continents Between 1952 and 2007”)

Recipes for the Visualizations of Data Distributions (13)

Boxen plots emerged to visualize the larger data sets more effectively by showing how data is spread outside of the main box and putting more emphasis on the outliers because the importance of outliers and data outside the IQR is more significant in larger data sets.

There are two perspectives that give clues about data distribution, the shape of the data distribution and the summary statistics. To explain a distribution from both perspectives at the same time, let’s learn to cook some Violin plots.

Violin plots are the perfect combination of the box plots and KDE plots. They deliver the summary statistics with the box plot inside and shape of distribution with the KDE plot on the sides.

It is my favorite plot because data is expressed with all the details it has. Do you remember the life expectancy distribution shape and summary statistics we plotted earlier? Seaborn violin plot function will blend it for us now.

Et voilà !

# set violin plot and title 
sns.violinplot(x=”life_expectancy”, data=df, palette=”Set3") plt.title(“Violinplot of Life Expectancy between 1952 and 2007 in the World”)

Recipes for the Visualizations of Data Distributions (14)

You can observe the peak of the data around 70 by looking at the distribution on the sides, and half of the data points gathered between 50 and 70 by noticing the slim box inside.

These beautiful violins can be used to visualize data with categories, and you can express summary statistics with dots, dashed lines or lines if you wish, by changing the inner parameter.

Recipes for the Visualizations of Data Distributions (15)

The advantage is obvious: Visualize the shape of the distribution and summary statistics simultaneously!

Bonus points with Violin plots: By setting scale parameter to count, you can also show how many data points you have in each category, thus emphasizing the importance of each category. When I change scale, Africa and Asia expanded and Oceania shrank, concluding there are fewer data points in Oceania and more in Africa and Asia.

# set the violin plot with different scale, inner parameter and title 
sns.violinplot(x=”continent”, y=”life_expectancy”, data=df,
 palette=”Set3", 
 order=[“Africa”, “Asia”, “Americas”, “Europe”,
 “Oceania”], 
 inner=None, scale=”count”) 
plt.title(“Violinplot of Life Expectancy Among Continents Between 1952 and 2007”)

Recipes for the Visualizations of Data Distributions (16)

So, these recipes about visualizing distributions explained the core idea behind each plot. There are plenty of options to show single-variable, or univariate, distributions.

Histogram, KDE plot and distribution plot are explaining the data shape very well. Additionally, distribution plots can combine histograms and KDE plots.

Box plot and boxen plot are best to communicate summary statistics, boxen plots work better on the large data sets and violin plot does it all.

They are all effective communicators and each of them can be built quickly with Seaborn library in Python. Your visualization choice depends on your project (data set) and what information you want to transfer to your audience. If you are thrilled by this post and want to learn more, you can check the Seaborn and Matplotlib documentation.

Last but not least, this is my first contribution for Towards Data Science, I hope you enjoyed reading! I appreciate your constructive feedback and would like to hear your opinions about this blog post in the responses or on Twitter.