In the real-world data is messy and often comes with missing values, which causes problems when it comes time to do analysis on the data. Before starting any research on a dataset the missing values have to be checked.
There are many ways to handle missing data. I will demonstrate it in a toy dataset which we will create together. Then, we will answer the following questions in this post:
- What is missing data and what are the types of missing data?
- How can we detect missing values?
- How can we handle missing values?
The following packages will be used in this tutorial. If you don’t have any of these, just pip install {package name}.
# This piece of code blocks the warning messages
import warnings
warnings.filterwarnings('ignore')
# Import libraries and check the versions
import pandas as pd
import sys
import missingno as msno
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import pandas_profiling
%matplotlib inline
print('Python version ' + sys.version)
print('Numpy version ' + np.__version__)
print('Pandas version ' + pd.__version__)
print('Matplotlib version ' + matplotlib.__version__ )
print('Missingno version ' + msno.__version__)
Numpy, pandas and matplotlib are commonly used in data science. In this post, we will use two packages that you might not have in your system. However, they are easy to install. Just uncomment the package you are missing below (by removing the #) and run the cell. Once you install, go back to the previous cell and import all the packages and make sure you have everything installed.
# !pip install pandas_profiling
# !pip install missingno
First we will create a toy dataset that has some missing values.
data = {'name': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
'gender':[None,'F',np.NaN,'F',np.NaN, 'M'],
'height': [123, 145, 100 , np.NaN, None, 150],
'weight': [10, np.NaN , 30, np.NaN, None, 20],
'age': [14, None, 29 , np.NaN, 52, 45],
}
df = pd.DataFrame(data, columns = ['name','gender', 'height', 'weight', 'age'])
df
What is missing data and what are the types of missing data?
Missing data in a dataset is a value that has no computational value. Notice that our toy dataset has two types of missing values; None and np.Nan. The difference between None and NaN (Not a Number) is that None is the Pythonic way of representing missing values and NaN is much better known by other systems.
Pandas was conveniently built to handle both of these data types. On the other hand, NumPy has special built-in functions to handle missing data. Let’s see an example below.
# create a numpy array that has a missing value
a = np.array([1,2,np.nan, 4])
a.dtype
# sum doesn't work how it is expected
np.sum(a)
# use nansum for expected result
np.nansum(a)
How can we detect missing values?
I will show three ways that I find useful to identify missing values in a dataset.
1- .info(), isnull() and notnull() are useful in detecting missing values,
# .info() is general information about a dataset
df.info()
# sum of the missing values in each column
df.isnull().sum()
# notnull() is opppsite of isnull()
df.notnull().sum()
2- Missingno is a great package to quickly display missing values in a dataset. More examples and features can be found in its github repo.
msno.matrix(df.sample(6))
msno.bar(df.sample(6))
3- pandas_profiling is another package for missing data that gives a high level overview of the dataset as well as detailed information for each column in the dataset including the number of missing values.
pandas_profiling.ProfileReport(df)
How can we handle missing values?
The easiest way is to get rid of the rows/columns that have missing values. Pandas built-in function dropna() is for that. Pandas does not allow single cell deletion. Either the entire row or column has to be removed.
One thing to keep in my mind is that dropna() has a parameter called inplace=False which protects the dataset from changes. If inplace=True, then any changes will apply to the dataset right away.
# original dataset has not changed
df.dropna()
# parameter axis=1 deletes the columns
df.dropna(axis = 1)
In some cases you won’t want to lose any data in a dataset. In that case, use fillna(). How to fill the missing values is up to you. I will show a few ways below.
# fills all the missing values with the spcified value, inplace is False.
df['age'].fillna(0)
ffill means forward-fill. Here we filled the index row 2 with the previous value which is F. The first row has no previous row to copy data from, therefore it remained None.
# inplace = False
df['gender'].fillna(method='ffill')
To overcome this you can use the bfill parameter, which stands for back-fill. It works the opposite way of ffill and perfectly covers all our missing values in the gender column.
# inplace is True. Changes has applied to the dataset.
df['gender'].fillna(method='bfill', inplace=True)
A third option to fill missing data is to use the mean value of certain rows/columns. For example, we filled missing values in the height column with each gender’s mean value. You could use median, mode etc.
df['height'].fillna(df.groupby('gender')['height'].transform('mean'), inplace=True)
This time we will fill the weight column with the median of all values in that column.
df['weight'].fillna(df['weight'].median(), inplace=True)
# only age column has missing values
df.isnull().sum()
And lastly, we can use interpolation to fill missing data. This method allows you to fill the missing values depending upon with the previous and the next values. In our example, missing values are filled linearly by default.
df['age'].interpolate(inplace=True)
After using the methods outlined above our toy dataset is finally complete and has no missing values.
df
Let’s see that we have no missing data anymore.
msno.matrix(df)
msno.bar(df)
Further Learning
Most of the methods we have seen here have different parameters. I recommend to play with the parameters, change them and observe the results.
As always, all the material belonging to this post can be found and download on my github.
Pandas documentation for working with missing data.
Find more about Interpolation here.
Handling missing data by Jake VanderPlas
Missing Data In Pandas Dataframes by Chris Albon
How to Handle Missing Data with Python by Jason Brownlee
0 comments on “Handling Missing Data”