Aggregating data using one or more operations can be a really useful way to summarize large datasets. In particular, using pandas' groupby()
can make this task even easier as you can determine different groups to compare. In this post, we'll cover how to use pandas' groupby()
and agg()
functions together so that you can easily summarize and aggregate your data.
The data we're using comes from Kaggle, and covers information about Olympic athletes from 1896 to 2016. We then subset the data for the 5 teams with the most athletes: France, Germany, Great Britain, Italy, and the United States. Check out the full code below.
df.groupby("col").agg(['func'])
Basic Syntax: For a basic use of these functions, you just need a column to group by, and a function that you want applied to all of the other numerical columns. In this example, our dataset has some columns with numeric data, and some with text data. As you can see, when we run the command, the function, currently returns a warning about the four columns that we could not aggregate over, but still prints our desired results.
# Basic aggregation
df.groupby("Season").agg(['mean'])
Output:
FutureWarning:
['Name', 'Sex', 'Team', 'Sport'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
Out[16]:
ID Age Height Weight Year
mean mean mean mean mean
Season
Summer 64592.282848 26.581148 176.627126 71.542282 1967.816088
Winter 64104.969639 25.106823 174.043178 69.697533 1987.252751
We were able to use the function to get the mean of the ID, age, height, and weight of all the athletes, grouped by the season they competed in.
df.groupby("col").agg([func])
Aggregate with external function: If you want to aggregate on a function from another package, like NumPy, you can use very similar syntax, just make sure you don't enclose the function in quotations, and don't add any parentheses.
# Aggregate with NumPy function
import numpy as np
df.groupby(["Season", "Team"]).agg([np.mean])
Output:

df.groupby("col").col.agg(['func1', 'func2'])
Aggregate for specific column and functions: If you only want to aggregate on a particular column, you can call that column after the groupby()
function, as below. Additionally if you want to call multiple functions to use, you can do so by creating a list of the functions, and passing that list into the agg()
function.
# Aggregate for specific column and functions
df.groupby("Season").Height.agg(['min', 'mean', 'max'])
Output:
min mean max
Season
Summer 137 176.627126 223
Winter 137 174.043178 200
df.groupby("col").col.agg(lambda x: func(x))
Aggregate using lambda function: If you want, you can also define your own custom function via lambda
within the agg()
function call.
# Aggregate using lambda function
df.groupby("Season").Height.agg(lambda x: np.max(x) * 2)
Output:
Season
Summer 446
Winter 400
Name: Height, dtype: Int32
df.groupby("col").agg({'col1': func1, 'col2': func2})
Aggregate specific columns with specific functions: If you want to aggregate specific columns with specific functions, you can do so by creating a mapping via a dictionary that is then passed to the agg()
function, like below.
# Aggregate specific columns with specific functions
df.groupby(["Season", "Team"]).agg({'Height': ['min', 'mean', 'max'], 'Age': 'mean'})
Output:

df.groupby("col").agg(col_name = pd.NamedAgg(column = "col", aggfunc = "func")
Specify output column names: In the above example, it can be a bit confusing because of the nested headers. If you want to simplify the output, you can use the pd.NamedAgg()
function to create new column names for each aggregated column.
# Specify output's column names
df.groupby(["Season", "Team"]).agg(min_height = pd.NamedAgg(column = "Height", aggfunc = "min"), mean_height = pd.NamedAgg(column = "Height", aggfunc = "mean"),
max_height = pd.NamedAgg(column = "Height", aggfunc = "max"),
mean_age = pd.NamedAgg(column = "Age", aggfunc = "mean"))
Output:

About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.