Aggregate and summarize your data with pandas groupby() and agg()

Einblick Content Team - May 4th, 2023

Aggregating data using one or more operations can be a really useful way to summarize large datasets. In particular, using pandas' groupby() can make this task even easier as you can determine different groups to compare. In this post, we'll cover how to use pandas' groupby() and agg() functions together so that you can easily summarize and aggregate your data.

The data we're using comes from Kaggle, and covers information about Olympic athletes from 1896 to 2016. We then subset the data for the 5 teams with the most athletes: France, Germany, Great Britain, Italy, and the United States. Check out the full code below.

Basic Syntax: df.groupby("col").agg(['func'])

For a basic use of these functions, you just need a column to group by, and a function that you want applied to all of the other numerical columns. In this example, our dataset has some columns with numeric data, and some with text data. As you can see, when we run the command, the function, currently returns a warning about the four columns that we could not aggregate over, but still prints our desired results.

# Basic aggregation
df.groupby("Season").agg(['mean'])

Output:

FutureWarning:
['Name', 'Sex', 'Team', 'Sport'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
Out[16]: 

  	    ID	            Age	        Height	    Weight	    Year
        mean	        mean	    mean	    mean	    mean
Season					
Summer	64592.282848	26.581148	176.627126	71.542282	1967.816088
Winter	64104.969639	25.106823	174.043178	69.697533	1987.252751

We were able to use the function to get the mean of the ID, age, height, and weight of all the athletes, grouped by the season they competed in.

Aggregate with external function: df.groupby("col").agg([func])

If you want to aggregate on a function from another package, like NumPy, you can use very similar syntax, just make sure you don't enclose the function in quotations, and don't add any parentheses.

# Aggregate with NumPy function
import numpy as np
df.groupby(["Season", "Team"]).agg([np.mean])

Output:

Aggregate for specific column and functions: df.groupby("col").col.agg(['func1', 'func2'])

If you only want to aggregate on a particular column, you can call that column after the groupby() function, as below. Additionally if you want to call multiple functions to use, you can do so by creating a list of the functions, and passing that list into the agg() function.

# Aggregate for specific column and functions
df.groupby("Season").Height.agg(['min', 'mean', 'max'])

Output:

	    min	mean	    max
Season			
Summer	137	176.627126	223
Winter	137	174.043178	200

Aggregate using lambda function: df.groupby("col").col.agg(lambda x: func(x))

If you want, you can also define your own custom function via lambda within the agg() function call.

# Aggregate using lambda function
df.groupby("Season").Height.agg(lambda x: np.max(x) * 2)

Output:

Season
Summer    446
Winter    400
Name: Height, dtype: Int32

Aggregate specific columns with specific functions: df.groupby("col").agg({'col1': func1, 'col2': func2})

If you want to aggregate specific columns with specific functions, you can do so by creating a mapping via a dictionary that is then passed to the agg() function, like below.

# Aggregate specific columns with specific functions
df.groupby(["Season", "Team"]).agg({'Height': ['min', 'mean', 'max'], 'Age': 'mean'})

Output:

Specify output column names: df.groupby("col").agg(col_name = pd.NamedAgg(column = "col", aggfunc = "func")

In the above example, it can be a bit confusing because of the nested headers. If you want to simplify the output, you can use the pd.NamedAgg() function to create new column names for each aggregated column.

# Specify output's column names
df.groupby(["Season", "Team"]).agg(min_height = pd.NamedAgg(column = "Height", aggfunc = "min"), mean_height = pd.NamedAgg(column = "Height", aggfunc = "mean"), 
max_height = pd.NamedAgg(column = "Height", aggfunc = "max"), 
mean_age = pd.NamedAgg(column = "Age", aggfunc = "mean"))

Output:

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators