pandas
dropna()
is a function used to remove rows or columns with missing values (NaN
) from a DataFrame
. There are several interesting arguments you can leverage to tailor how missing data is handled. In this post, we’ll review the axis
, how
, thresh
, and subset
arguments. Open and fork the canvas below to see all of the code, or keep reading along.
df.dropna(axis = 0, how = ‘any’, inplace = False)
Basic Syntax: drop rows, The pandas
dropna()
function has 3 default argument values:
axis
: default value is 0, if 0, function evaluates missing values by row, if 1, function evaluates missing values by columnhow
: default value is'any'
, so any rows or columns with any missing values will be droppedinplace
: whether or not to alter the originalDataFrame
, default value isFalse
We're using data from Kaggle about Olympic athletes. The original shape of the data is as such:
df.shape
Output:
Out[22]: (271116, 15)
So when we use the default arguments for df.dropna()
, we expect to get a DataFrame with fewer than 271,116 rows.
# Default: axis = 0, how = 'any'
df2 = df.dropna()
df2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 206165 entries, 0 to 271115
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 206165 non-null Int32
1 Name 206165 non-null string
2 Sex 206165 non-null string
3 Age 206165 non-null Int32
4 Height 206165 non-null Int32
5 Weight 206165 non-null Float32
6 Team 206165 non-null string
7 NOC 206165 non-null string
8 Games 206165 non-null string
9 Year 206165 non-null Int32
10 Season 206165 non-null string
11 City 206165 non-null string
12 Sport 206165 non-null string
13 Event 206165 non-null string
14 Medal 206165 non-null string
dtypes: Float32(1), Int32(4), string(10)
memory usage: 22.2 MB
df.dropna(axis = 1)
Example 1: drop columns, If rather than dropping rows with missing values, you want to drop columns with missing values, you simply set axis = 1
. In the below example, you can see that 3 columns have been dropped, and the number of rows is still 271,116.
df3 = df.dropna(axis = 1)
df3.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 271116 non-null Int32
1 Name 271116 non-null string
2 Sex 271116 non-null string
3 Team 271116 non-null string
4 NOC 271116 non-null string
5 Games 271116 non-null string
6 Year 271116 non-null Int32
7 Season 271116 non-null string
8 City 271116 non-null string
9 Sport 271116 non-null string
10 Event 271116 non-null string
11 Medal 271116 non-null string
dtypes: Int32(2), string(10)
memory usage: 23.3 MB
df.dropna(how = ‘all’)
Example 2: drop if all NA, One interesting argument when using pandas
dropna()
is the how
argument. There are two values that how
can take on:
'any'
drops a row or column if any of its values are missing'all'
drops a row or column only if ALL its values are missing
The default is how = 'any'
.
df4 = df.dropna(how = 'all')
df4.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 271116 non-null Int32
1 Name 271116 non-null string
2 Sex 271116 non-null string
3 Age 261642 non-null Int32
4 Height 210945 non-null Int32
5 Weight 208241 non-null Float32
6 Team 271116 non-null string
7 NOC 271116 non-null string
8 Games 271116 non-null string
9 Year 271116 non-null Int32
10 Season 271116 non-null string
11 City 271116 non-null string
12 Sport 271116 non-null string
13 Event 271116 non-null string
14 Medal 271116 non-null string
dtypes: Float32(1), Int32(4), string(10)
memory usage: 27.1 MB
df.dropna(thresh = 265000)
Example 3: drop based on threshold, Another way to configure how pandas
drops missing values is the thresh
argument. thresh
requires a minimum number of non-null values for the row/column to be kept. As you can see in the example below, all columns that have fewer than 265,000 rows have been dropped.
df5 = df.dropna(axis = 1, thresh = 265000)
df5.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 271116 non-null Int32
1 Name 271116 non-null string
2 Sex 271116 non-null string
3 Team 271116 non-null string
4 NOC 271116 non-null string
5 Games 271116 non-null string
6 Year 271116 non-null Int32
7 Season 271116 non-null string
8 City 271116 non-null string
9 Sport 271116 non-null string
10 Event 271116 non-null string
11 Medal 271116 non-null string
dtypes: Int32(2), string(10)
memory usage: 23.3 MB
df.dropna(subset = [“Height”])
Example 4: drop based on subset, Lastly, the subset
argument allows you to specify the columns that should be used when evaluating which rows to evaluate for missing values.
df6 = df.dropna(axis = 0, subset = ["Height"])
df6.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 210945 entries, 0 to 271115
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 210945 non-null Int32
1 Name 210945 non-null string
2 Sex 210945 non-null string
3 Age 210068 non-null Int32
4 Height 210945 non-null Int32
5 Weight 206853 non-null Float32
6 Team 210945 non-null string
7 NOC 210945 non-null string
8 Games 210945 non-null string
9 Year 210945 non-null Int32
10 Season 210945 non-null string
11 City 210945 non-null string
12 Sport 210945 non-null string
13 Event 210945 non-null string
14 Medal 210945 non-null string
dtypes: Float32(1), Int32(4), string(10)
memory usage: 22.7 MB
About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.