Pandasql: quickly query pandas DataFrames using SQL

Becca Weng - February 23rd, 2023

Have you ever had your data stored in a pandas DataFrame, and wanted to use SQL to query the data? Then you have to check out the pandasql library. pandasql is a Python library that allows you to use SQL syntax to work with data in the pandas DataFrame structure. It provides simple access to powerful data analysis functionality and makes it easier for users familiar with SQL syntax to utilize the power of pandas.

Just open and fork the canvas below to check out our example using car sales transactions, or read on for an in-depth walkthrough of how to get started with pandasql.

Getting started with pandasql

# Imports and setup
__requires__= 'SQLAlchemy==1.4.46'
%pip install SQLAlchemy==1.4.46
%pip install pandasql
import sqlalchemy
from pandasql import *
import pandas as pd

First make sure that you have installed the library and imported the functions. Keep in mind that a particular version of SQLAlchemy is a dependency. pandasql was first released back in 2017 by a company called Yhat, Inc. that focused on deploying self service data science tools. Yhat has since been acquired by Alteryx.

Write a SQL query using pandasql.sqldf()

# NOTE: because we're using Einblick, we don't include Line 2, as the CSV is already saved natively into the Einblick platform as df when you upload the dataset.
# df = pd.read_csv("car_sales_transactions.csv")

query = """SELECT Model, Type, ExteriorColor, CarYear, Price 
FROM df 
WHERE df.Model NOT LIKE '%C%'"""

df2 = sqldf(query = query, env = locals())

Now that the packages have been installed and imported, and the DataFrame has been stored as df, you can write your query. The main function in pandasql is sqldf(). The function can two arguments:

  • query: a formatted SQL query stored as a string
  • env: either locals() or globals() to give the function access to environment variables**

The query in this example filters out car models of type C.

Check queried data

# Plot queried data, and check if columns were subset
import seaborn as sns
sns.set_theme()
sns.countplot(data = df2, x = "Model")
print(df2.columns)

Output:

Seaborn histogram of car models, checking pandasql querySeaborn histogram of car models, checking pandasql query

Based on the output, we can see that the SQL query worked! We only have car models A and B. Happy coding!

**NOTE on env and df

You may have noticed in the code above that we never read in a DataFrame using pd.read_csv() or a similar command. This is because we uploaded the CSV file directly into Einblick's platform, which automatically saves those CSV files as pandas DataFrames. By linking the car_sales_transaction table to the Python cell, the platform saved the data in the table as a DataFrame called df.

If you're not using Einblick, just make sure you've added in a line of code that instantiates a DataFrame object.

Einblick's table cell linked to a Python cellEinblick's table cell linked to a Python cell

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators