pandas basics: groupby() with count()

Einblick Content Team - April 6th, 2023

One of the most powerful functions in pandas is the groupby() function, which is an efficient way of summarizing large datasets with just a few lines of code. Groupby() allows you to get aggregate statistics about multiple categories across one or more variables in a dataset. In this post, we’ll combine groupby() with the function, count(). We’ll cover basic syntax and a few examples, as well as compare count() and size().

In the below code and canvas, we're using a subset of movie data from Netflix. Additionally, we added in a couple of movies with missing data to illustrate a few arguments and functions below.

Basic Syntax: df.groupby(by = [“col1”, “col2”, …]).count()

The main argument for df.groupby() is by, which specifies a column or columns to group the results. In the below example, we used 2 columns: "release_year" and "rating". Then we can call the count() function on the results to get a list of counts of each group.

# Group by Release Year and Rating, aggregate by count
movies.groupby(by = ["release_year", "rating"]).count()

Output:

                        type	title
release_year	rating		
    2001           G	3	    2
                  PG	5	    5
               PG-13	8	    8
                   R	14	    14
               TV-14	6	    6
               TV-MA	2	    2
               TV-PG	2	    2
               TV-Y7	1	    1
...
    2021	      PG	11	    11
               PG-13	14	    14
                   R	21	    21
               TV-14	73	    73
                TV-G	9	    9
               TV-MA	110	    110
               TV-PG	20	    20
                TV-Y	6	    6
               TV-Y7	13	    13

Additional groupby() Arguments: as_index and dropna

Example 1: df.groupby(as_index = False).count()

If you would like to retain the columns you are grouping by, you can set the argument as_index = False, like so:

# Group by Release Year and Rating, aggregate by count
movies.groupby(by = ["release_year", "rating"], as_index = False).count()

Output:

	release_year   rating   type   title
0	2001	       G	    3	   2
1	2001	       PG	    5	   5
2	2001	       PG-13	8	   8
3	2001	       R	    14	   14
4	2001	       TV-14	6	   6
5	2001	       TV-MA	2	   2
6	2001	       TV-PG	2	   2
7	2001	       TV-Y7	1	   1
...
199	2021	       PG	    11	   11
200	2021	       PG-13	14	   14
201	2021	       R	    21	   21
202	2021	       TV-14	73	   73
203	2021	       TV-G   	9	   9
204	2021	       TV-MA	110	   110
205	2021	       TV-PG	20	   20
206	2021	       TV-Y	    6	   6
207	2021	       TV-Y7	13	   13

Rather than having the grouped columns as the indices, each row has a complete set of values for all of the variables.

Example 2: df.groupby(dropna = False).count()

If you have missing values in columns you are grouping by, and you would like to keep count of them, set dropna = False, like so:

# Group by Release Year and Rating, aggregate by count
movies2.groupby(by = ["release_year", "rating"], dropna = False).count()

Output:

                        type	title
release_year	rating		
    2001           G	3	    2
                  PG	5	    5
               PG-13	8	    8
                   R	14	    14
               TV-14	6	    6
               TV-MA	2	    2
               TV-PG	2	    2
               TV-Y7	1	    1
                 NaN    1       1
...
    2021	      PG	11	    11
               PG-13	14	    14
                   R	21	    21
               TV-14	73	    73
                TV-G	9	    9
               TV-MA	110	    110
               TV-PG	20	    20
                TV-Y	6	    6
               TV-Y7	13	    13

As you can see, there is now a new row under "release_year" = 2001, where "rating" = NaN.

pandas groupby count() vs. pandas groupby size()

In contrast to count(), which does not include missing values in the counts provided (see the first row of the previous examples where under the "type" column, there are 3 movies released in 2001 with a G-rating, but there are only 2 movies in the "title" column released in 2001 with a G-rating. There is a movie with a missing title that count() is not including. The size() function, however, only cares about how many rows of data are in the groups:

# Group by Release Year and Rating, aggregate by count
movies2.groupby(by = ["release_year", "rating"]).size()

Output:

release_year  rating  
2001          G             3
              PG            5
              PG-13         8
              R            14
              TV-14         6
              TV-MA         2
              TV-PG         2
              TV-Y7         1
...
2021          PG           11
              PG-13        14
              R            21
              TV-14        73
              TV-G          9
              TV-MA       110
              TV-PG        20
              TV-Y          6
              TV-Y7        13

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.