Handling missing data with pandas df.dropna()

Einblick Content Team - April 19th, 2023

pandas dropna() is a function used to remove rows or columns with missing values (NaN) from a DataFrame. There are several interesting arguments you can leverage to tailor how missing data is handled. In this post, we’ll review the axis, how, thresh, and subset arguments. Open and fork the canvas below to see all of the code, or keep reading along.

Basic Syntax: drop rows, df.dropna(axis = 0, how = ‘any’, inplace = False)

The pandas dropna() function has 3 default argument values:

  • axis: default value is 0, if 0, function evaluates missing values by row, if 1, function evaluates missing values by column
  • how: default value is 'any', so any rows or columns with any missing values will be dropped
  • inplace: whether or not to alter the original DataFrame, default value is False

We're using data from Kaggle about Olympic athletes. The original shape of the data is as such:

df.shape

Output:

Out[22]: (271116, 15)

So when we use the default arguments for df.dropna(), we expect to get a DataFrame with fewer than 271,116 rows.

# Default: axis = 0, how = 'any'
df2 = df.dropna()

df2.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 206165 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      206165 non-null  Int32  
 1   Name    206165 non-null  string 
 2   Sex     206165 non-null  string 
 3   Age     206165 non-null  Int32  
 4   Height  206165 non-null  Int32  
 5   Weight  206165 non-null  Float32
 6   Team    206165 non-null  string 
 7   NOC     206165 non-null  string 
 8   Games   206165 non-null  string 
 9   Year    206165 non-null  Int32  
 10  Season  206165 non-null  string 
 11  City    206165 non-null  string 
 12  Sport   206165 non-null  string 
 13  Event   206165 non-null  string 
 14  Medal   206165 non-null  string 
dtypes: Float32(1), Int32(4), string(10)
memory usage: 22.2 MB

Example 1: drop columns, df.dropna(axis = 1)

If rather than dropping rows with missing values, you want to drop columns with missing values, you simply set axis = 1. In the below example, you can see that 3 columns have been dropped, and the number of rows is still 271,116.

df3 = df.dropna(axis = 1)

df3.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 12 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   ID      271116 non-null  Int32 
 1   Name    271116 non-null  string
 2   Sex     271116 non-null  string
 3   Team    271116 non-null  string
 4   NOC     271116 non-null  string
 5   Games   271116 non-null  string
 6   Year    271116 non-null  Int32 
 7   Season  271116 non-null  string
 8   City    271116 non-null  string
 9   Sport   271116 non-null  string
 10  Event   271116 non-null  string
 11  Medal   271116 non-null  string
dtypes: Int32(2), string(10)
memory usage: 23.3 MB

Example 2: drop if all NA, df.dropna(how = ‘all’)

One interesting argument when using pandas dropna() is the how argument. There are two values that how can take on:

  • 'any' drops a row or column if any of its values are missing
  • 'all' drops a row or column only if ALL its values are missing

The default is how = 'any'.

df4 = df.dropna(how = 'all')

df4.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  Int32  
 1   Name    271116 non-null  string 
 2   Sex     271116 non-null  string 
 3   Age     261642 non-null  Int32  
 4   Height  210945 non-null  Int32  
 5   Weight  208241 non-null  Float32
 6   Team    271116 non-null  string 
 7   NOC     271116 non-null  string 
 8   Games   271116 non-null  string 
 9   Year    271116 non-null  Int32  
 10  Season  271116 non-null  string 
 11  City    271116 non-null  string 
 12  Sport   271116 non-null  string 
 13  Event   271116 non-null  string 
 14  Medal   271116 non-null  string 
dtypes: Float32(1), Int32(4), string(10)
memory usage: 27.1 MB

Example 3: drop based on threshold, df.dropna(thresh = 265000)

Another way to configure how pandas drops missing values is the thresh argument. thresh requires a minimum number of non-null values for the row/column to be kept. As you can see in the example below, all columns that have fewer than 265,000 rows have been dropped.

df5 = df.dropna(axis = 1, thresh = 265000)

df5.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 12 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   ID      271116 non-null  Int32 
 1   Name    271116 non-null  string
 2   Sex     271116 non-null  string
 3   Team    271116 non-null  string
 4   NOC     271116 non-null  string
 5   Games   271116 non-null  string
 6   Year    271116 non-null  Int32 
 7   Season  271116 non-null  string
 8   City    271116 non-null  string
 9   Sport   271116 non-null  string
 10  Event   271116 non-null  string
 11  Medal   271116 non-null  string
dtypes: Int32(2), string(10)
memory usage: 23.3 MB

Example 4: drop based on subset, df.dropna(subset = [“Height”])

Lastly, the subset argument allows you to specify the columns that should be used when evaluating which rows to evaluate for missing values.

df6 = df.dropna(axis = 0, subset = ["Height"])

df6.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210945 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      210945 non-null  Int32  
 1   Name    210945 non-null  string 
 2   Sex     210945 non-null  string 
 3   Age     210068 non-null  Int32  
 4   Height  210945 non-null  Int32  
 5   Weight  206853 non-null  Float32
 6   Team    210945 non-null  string 
 7   NOC     210945 non-null  string 
 8   Games   210945 non-null  string 
 9   Year    210945 non-null  Int32  
 10  Season  210945 non-null  string 
 11  City    210945 non-null  string 
 12  Sport   210945 non-null  string 
 13  Event   210945 non-null  string 
 14  Medal   210945 non-null  string 
dtypes: Float32(1), Int32(4), string(10)
memory usage: 22.7 MB

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.