Data Science MissingData NumPy Pandas

Handling Missing Data

Screen Shot 2017-06-06 at 2.25.12 AM.png

In the real-world data is messy and often comes with missing values, which causes problems when it comes time to do analysis on the data. Before starting any research on a dataset the missing values have to be checked.

There are many ways to handle missing data. I will demonstrate it in a toy dataset which we will create together. Then, we will answer the following questions in this post:

  • What is missing data and what are the types of missing data?
  • How can we detect missing values?
  • How can we handle missing values?

The following packages will be used in this tutorial. If you don’t have any of these, just pip install {package name}.

# This piece of code blocks the warning messages
import warnings 
warnings.filterwarnings('ignore')

# Import libraries and check the versions
import pandas as pd
import sys
import missingno as msno
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import pandas_profiling
%matplotlib inline


print('Python version ' + sys.version)
print('Numpy version ' + np.__version__)
print('Pandas version ' + pd.__version__)
print('Matplotlib version ' + matplotlib.__version__ )
print('Missingno version ' + msno.__version__)
>>> Python version 3.6.1 |Anaconda custom (x86_64)| (default, Mar 22 2017, 19:25:17) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
>>> Numpy version 1.12.1
>>> Pandas version 0.19.2
>>> Matplotlib version 2.0.0
>>> Missingno version 0.3.5

Numpy, pandas and matplotlib are commonly used in data science. In this post, we will use two packages that you might not have in your system. However, they are easy to install. Just uncomment the package you are missing below (by removing the #) and run the cell. Once you install, go back to the previous cell and import all the packages and make sure you have everything installed.

# !pip install pandas_profiling
# !pip install missingno

First we will create a toy dataset that has some missing values.

data = {'name': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'gender':[None,'F',np.NaN,'F',np.NaN, 'M'],
        'height': [123, 145, 100 , np.NaN, None, 150],
        'weight': [10, np.NaN , 30, np.NaN, None, 20],
        'age': [14, None, 29 , np.NaN, 52, 45],
        }
df = pd.DataFrame(data, columns = ['name','gender', 'height', 'weight', 'age'])
df
name gender height weight age
0 Michael None 123.0 10.0 14.0
1 Jessica F 145.0 NaN NaN
2 Sue NaN 100.0 30.0 29.0
3 Jake F NaN NaN NaN
4 Amy NaN NaN NaN 52.0
5 Tye M 150.0 20.0 45.0

What is missing data and what are the types of missing data?

Missing data in a dataset is a value that has no computational value. Notice that our toy dataset has two types of missing values; None and np.Nan. The difference between None and NaN (Not a Number) is that None is the Pythonic way of representing missing values and NaN is much better known by other systems.

Pandas was conveniently built to handle both of these data types. On the other hand, NumPy has special built-in functions to handle missing data. Let’s see an example below.

# create a numpy array that has a missing value
a = np.array([1,2,np.nan, 4])
a.dtype
>>> dtype('float64')
# sum doesn't work how it is expected
np.sum(a)
>>> nan
# use nansum for expected result
np.nansum(a)
>>> 7.0

How can we detect missing values?

I will show three ways that I find useful to identify missing values in a dataset.

1- .info(), isnull() and notnull() are useful in detecting missing values,

# .info() is general information about a dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
name      6 non-null object
gender    3 non-null object
height    4 non-null float64
weight    3 non-null float64
age       4 non-null float64
dtypes: float64(3), object(2)
memory usage: 320.0+ bytes
# sum of the missing values in each column
df.isnull().sum()
name      0
gender    3
height    2
weight    3
age       2
dtype: int64
# notnull() is opppsite of isnull()
df.notnull().sum()
name      6
gender    3
height    4
weight    3
age       4
dtype: int64

2- Missingno is a great package to quickly display missing values in a dataset. More examples and features can be found in its github repo.

msno.matrix(df.sample(6))
1
msno.bar(df.sample(6))
2

3- pandas_profiling is another package for missing data that gives a high level overview of the dataset as well as detailed information for each column in the dataset including the number of missing values.

pandas_profiling.ProfileReport(df)

Overview

Dataset info

Number of variables 5
Number of observations 6
Total Missing (%) 33.3%
Total size in memory 320.0 B
Average record size in memory 53.3 B

Variables types

Numeric 3
Categorical 1
Date 0
Text (Unique) 1
Rejected 0

Warnings

  • age has 2 / 33.3% missing values Missing
  • gender has 3 / 50.0% missing values Missing
  • height has 2 / 33.3% missing values Missing
  • weight has 3 / 50.0% missing values Missing

Variables

age
Numeric

Distinct count 5
Unique (%) 125.0%
Missing (%) 33.3%
Missing (n) 2
Infinite (%) 0.0%
Infinite (n) 0
Mean 35
Minimum 14
Maximum 52
Zeros (%) 0.0%

gender
Categorical

Distinct count 4
Unique (%) 133.3%
Missing (%) 50.0%
Missing (n) 3
F
2
M
1
(Missing)
3

height
Numeric

Distinct count 5
Unique (%) 125.0%
Missing (%) 33.3%
Missing (n) 2
Infinite (%) 0.0%
Infinite (n) 0
Mean 129.5
Minimum 100
Maximum 150
Zeros (%) 0.0%

name
Categorical, Unique

First 3 values
Amy
Jake
Sue
Last 3 values
Michael
Jessica
Tye

weight
Numeric

Distinct count 4
Unique (%) 133.3%
Missing (%) 50.0%
Missing (n) 3
Infinite (%) 0.0%
Infinite (n) 0
Mean 20
Minimum 10
Maximum 30
Zeros (%) 0.0%

Sample

name gender height weight age
0 Michael None 123.0 10.0 14.0
1 Jessica F 145.0 NaN NaN
2 Sue NaN 100.0 30.0 29.0
3 Jake F NaN NaN NaN
4 Amy NaN NaN NaN 52.0

How can we handle missing values?

The easiest way is to get rid of the rows/columns that have missing values. Pandas built-in function dropna() is for that. Pandas does not allow single cell deletion. Either the entire row or column has to be removed.

One thing to keep in my mind is that dropna() has a parameter called inplace=False which protects the dataset from changes. If inplace=True, then any changes will apply to the dataset right away.

# original dataset has not changed
df.dropna()
name gender height weight age
5 Tye M 150.0 20.0 45.0
# parameter axis=1 deletes the columns
df.dropna(axis = 1)
name
0 Michael
1 Jessica
2 Sue
3 Jake
4 Amy
5 Tye

In some cases you won’t want to lose any data in a dataset. In that case, use fillna(). How to fill the missing values is up to you. I will show a few ways below.

# fills all the missing values with the spcified value, inplace is False.
df['age'].fillna(0)
0    14.0
1     0.0
2    29.0
3     0.0
4    52.0
5    45.0
Name: age, dtype: float64

ffill means forward-fill. Here we filled the index row 2 with the previous value which is F. The first row has no previous row to copy data from, therefore it remained None.

# inplace = False
df['gender'].fillna(method='ffill')
0    None
1       F
2       F
3       F
4       F
5       M
Name: gender, dtype: object

To overcome this you can use the bfill parameter, which stands for back-fill. It works the opposite way of ffill and perfectly covers all our missing values in the gender column.

# inplace is True. Changes has applied to the dataset.
df['gender'].fillna(method='bfill', inplace=True)

A third option to fill missing data is to use the mean value of certain rows/columns. For example, we filled missing values in the height column with each gender’s mean value. You could use median, mode etc.

df['height'].fillna(df.groupby('gender')['height'].transform('mean'), inplace=True)

This time we will fill the weight column with the median of all values in that column.

df['weight'].fillna(df['weight'].median(), inplace=True)
# only age column has missing values
df.isnull().sum()
name      0
gender    0
height    0
weight    0
age       2
dtype: int64

And lastly, we can use interpolation to fill missing data. This method allows you to fill the missing values depending upon with the previous and the next values. In our example, missing values are filled linearly by default.

df['age'].interpolate(inplace=True)

After using the methods outlined above our toy dataset is finally complete and has no missing values.

df
name gender height weight age
0 Michael F 123.000000 10.0 14.0
1 Jessica F 145.000000 20.0 21.5
2 Sue F 100.000000 30.0 29.0
3 Jake F 122.666667 20.0 40.5
4 Amy M 150.000000 20.0 52.0
5 Tye M 150.000000 20.0 45.0

Let’s see that we have no missing data anymore.

msno.matrix(df)
3
msno.bar(df)
4.png

Further Learning

Most of the methods we have seen here have different parameters. I recommend to play with the parameters, change them and observe the results.

As always, all the material belonging to this post can be found and download on my github. 

Pandas documentation for working with missing data.

Find more about Interpolation here.

Handling missing data by Jake VanderPlas

Missing Data In Pandas Dataframes by Chris Albon

How to Handle Missing Data with Python by Jason Brownlee

0 comments on “Handling Missing Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: