Have you ever had your data stored in a
pandas DataFrame, and wanted to use SQL to query the data? Then you have to check out the
pandasql is a Python library that allows you to use SQL syntax to work with data in the
DataFrame structure. It provides simple access to powerful data analysis functionality and makes it easier for users familiar with SQL syntax to utilize the power of
Just open and fork the canvas below to check out our example using car sales transactions, or read on for an in-depth walkthrough of how to get started with
Getting started with
# Imports and setup __requires__= 'SQLAlchemy==1.4.46' %pip install SQLAlchemy==1.4.46 %pip install pandasql import sqlalchemy from pandasql import * import pandas as pd
First make sure that you have installed the library and imported the functions. Keep in mind that a particular version of
SQLAlchemy is a dependency.
pandasql was first released back in 2017 by a company called Yhat, Inc. that focused on deploying self service data science tools. Yhat has since been acquired by Alteryx.
Write a SQL query using
# NOTE: because we're using Einblick, we don't include Line 2, as the CSV is already saved natively into the Einblick platform as df when you upload the dataset. # df = pd.read_csv("car_sales_transactions.csv") query = """SELECT Model, Type, ExteriorColor, CarYear, Price FROM df WHERE df.Model NOT LIKE '%C%'""" df2 = sqldf(query = query, env = locals())
Now that the packages have been installed and imported, and the DataFrame has been stored as
df, you can write your query. The main function in
sqldf(). The function can two arguments:
query: a formatted SQL query stored as a string
globals()to give the function access to environment variables**
The query in this example filters out car models of type C.
Check queried data
# Plot queried data, and check if columns were subset import seaborn as sns sns.set_theme() sns.countplot(data = df2, x = "Model") print(df2.columns)
Based on the output, we can see that the SQL query worked! We only have car models A and B. Happy coding!
You may have noticed in the code above that we never read in a DataFrame using
pd.read_csv() or a similar command. This is because we uploaded the CSV file directly into Einblick's platform, which automatically saves those CSV files as
pandas DataFrames. By linking the
car_sales_transaction table to the Python cell, the platform saved the data in the table as a DataFrame called
If you're not using Einblick, just make sure you've added in a line of code that instantiates a DataFrame object.
Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.