One of the most powerful functions in pandas is the groupby() function, which is an efficient way of summarizing large datasets with just a few lines of code. Groupby()
allows you to get aggregate statistics about multiple categories across one or more variables in a dataset. In this post, we’ll combine groupby()
with the function, count()
. We’ll cover basic syntax and a few examples, as well as compare count()
and size()
.
In the below code and canvas, we're using a subset of movie data from Netflix. Additionally, we added in a couple of movies with missing data to illustrate a few arguments and functions below.
df.groupby(by = [“col1”, “col2”, …]).count()
Basic Syntax: The main argument for df.groupby()
is by
, which specifies a column or columns to group the results. In the below example, we used 2 columns: "release_year"
and "rating"
. Then we can call the count()
function on the results to get a list of counts of each group.
# Group by Release Year and Rating, aggregate by count
movies.groupby(by = ["release_year", "rating"]).count()
Output:
type title
release_year rating
2001 G 3 2
PG 5 5
PG-13 8 8
R 14 14
TV-14 6 6
TV-MA 2 2
TV-PG 2 2
TV-Y7 1 1
...
2021 PG 11 11
PG-13 14 14
R 21 21
TV-14 73 73
TV-G 9 9
TV-MA 110 110
TV-PG 20 20
TV-Y 6 6
TV-Y7 13 13
as_index
and dropna
Additional groupby() Arguments: df.groupby(as_index = False).count()
Example 1: If you would like to retain the columns you are grouping by, you can set the argument as_index = False
, like so:
# Group by Release Year and Rating, aggregate by count
movies.groupby(by = ["release_year", "rating"], as_index = False).count()
Output:
release_year rating type title
0 2001 G 3 2
1 2001 PG 5 5
2 2001 PG-13 8 8
3 2001 R 14 14
4 2001 TV-14 6 6
5 2001 TV-MA 2 2
6 2001 TV-PG 2 2
7 2001 TV-Y7 1 1
...
199 2021 PG 11 11
200 2021 PG-13 14 14
201 2021 R 21 21
202 2021 TV-14 73 73
203 2021 TV-G 9 9
204 2021 TV-MA 110 110
205 2021 TV-PG 20 20
206 2021 TV-Y 6 6
207 2021 TV-Y7 13 13
Rather than having the grouped columns as the indices, each row has a complete set of values for all of the variables.
df.groupby(dropna = False).count()
Example 2: If you have missing values in columns you are grouping by, and you would like to keep count of them, set dropna = False
, like so:
# Group by Release Year and Rating, aggregate by count
movies2.groupby(by = ["release_year", "rating"], dropna = False).count()
Output:
type title
release_year rating
2001 G 3 2
PG 5 5
PG-13 8 8
R 14 14
TV-14 6 6
TV-MA 2 2
TV-PG 2 2
TV-Y7 1 1
NaN 1 1
...
2021 PG 11 11
PG-13 14 14
R 21 21
TV-14 73 73
TV-G 9 9
TV-MA 110 110
TV-PG 20 20
TV-Y 6 6
TV-Y7 13 13
As you can see, there is now a new row under "release_year" = 2001
, where "rating" = NaN
.
groupby count()
vs. pandas groupby size()
pandas In contrast to count(), which does not include missing values in the counts provided (see the first row of the previous examples where under the "type"
column, there are 3 movies released in 2001 with a G-rating, but there are only 2 movies in the "title"
column released in 2001 with a G-rating. There is a movie with a missing title that count()
is not including. The size()
function, however, only cares about how many rows of data are in the groups:
# Group by Release Year and Rating, aggregate by count
movies2.groupby(by = ["release_year", "rating"]).size()
Output:
release_year rating
2001 G 3
PG 5
PG-13 8
R 14
TV-14 6
TV-MA 2
TV-PG 2
TV-Y7 1
...
2021 PG 11
PG-13 14
R 21
TV-14 73
TV-G 9
TV-MA 110
TV-PG 20
TV-Y 6
TV-Y7 13
About
Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.