Data Cleaning with Prompt AI

Becca Weng - September 22nd, 2023

A fundamental part of any data analysis, data cleaning is as critical as it can be tedious and time-consuming. This tutorial provides examples of how to harness generative AI to expedite and automate data cleaning tasks such as changing data types, handling missing values, and more. For the purposes of this post, we'll be using a data set of Goodreads data.

Video Walkthrough of Data Cleaning with Generative AI

If you would prefer to watch a video tutorial, you can check out the 4-minute video below, which goes through all of the natural language prompts I tried out to prepare the original data set for further analysis.

If you would like to follow along directly in the canvas, you can open and fork the canvas below. All of the prompts seen in the video are in Markdown cells to the left of the Python cell generated by Einblick Prompt.

Prompt 1: Replacing values

When looking at a histogram of the language code of the books, I could see that there were multiple labels for English, based on geography (eng-CA for Canada; eng-GB for Great Britain; eng-US for United States; eng for generic English). So for my first query, I asked Prompt to consolidate these values into one.

Replace en-CA, en-GB, and en-US with eng

Prompt Output

# PROMPT: Replace en-CA, en-GB, and en-US with eng

# Replace the values 'en-CA', 'en-GB', and 'en-US' with 'eng' in the 'language_code' column of the 'book_data' dataframe

# Import the necessary libraries
import pandas as pd

# Replace the values in the 'language_code' column
book_data['language_code'].replace(['en-CA', 'en-GB', 'en-US'], 'eng', inplace=True)

Prompt 2: Change data types

Next, I had noticed earlier that the column publication_datetime contained string values rather than datetime values. So I tried the following prompt:

Convert publication datetime column to datetime object

Prompt Output

# PROMPT: Convert publication datetime column to datetime object

# Import the necessary libraries
import pandas as pd

# Convert the 'publication_datetime' column to datetime object
book_data['publication_datetime'] = pd.to_datetime(book_data['publication_datetime'], format='%Y-%m-%d')

From the Table cell, we can see that the publication_datetime column is now a datetime column, as indicated by the calendar icon in the column title, rather than the capital A, which indicates a string column.

Prompts 3-5: Filter data and drop columns

Next I wanted to subset for a particular condition and drop superfluous rows. I tried the following prompts, with the resulting output.

Get the value counts for publisher
Subset the data for the top 10 publishers
Drop isbn13 and publication date columns

Prompt Output

# PROMPT: Get the value counts for publisher

# Import the necessary libraries
import pandas as pd

# Use the 'value_counts()' function on the 'publisher' column of the 'book_data' dataframe to get the value counts
publisher_counts = book_data['publisher'].value_counts()

# Display the value counts
display(publisher_counts)

Code Output:

Vintage                         318
Penguin Books                   261
Penguin Classics                184
...
DAW                              41
G.P. Putnam's Sons               40
Harper                           40
Name: publisher, dtype: int64
# PROMPT: Subset the data for the top 10 publishers

# Subset the 'book_data' dataframe to include only the top 10 publishers
top_publishers = book_data[book_data['publisher'].isin(publisher_counts.index[:10])]

Now we can see only 10 publishers when we pull out a histogram of publishers.

# PROMPT: Drop isbn13 and publication date columns

# Drop the 'isbn13' and 'publication_date' columns from the 'top_publishers' dataframe
top_publishers = top_publishers.drop(['isbn13', 'publication_date'], axis=1)

Prompt 6-7: Extract data and create new column

Lastly, I noticed that there was book series information between parentheses in the title column. I wanted to extract this information and then save it as a new column.

Save book series information saved in () in title column
Value counts of book series

Prompt Output

# PROMPT: Save book series information saved in () in title column
# Import the necessary libraries
import pandas as pd
import re

# Create a new column named 'book_series' in the 'top_publishers' dataframe
top_publishers['book_series'] = ""

# Use regular expressions to extract the book series information from the 'title' column and assign it to the 'book_series' column
top_publishers['book_series'] = top_publishers['title'].apply(lambda x: re.findall(r'\((.*?)\)', x)[0] if re.findall(r'\((.*?)\)', x) else "")

# Display the updated dataframe
display(top_publishers.head(3))

Code Output:

# PROMPT: Value counts of book series

# Get the value counts of the 'book_series' column in the 'top_publishers' dataframe
book_series_counts = top_publishers['book_series'].value_counts()

# Print the value counts
display(book_series_counts)

Code Output:

                                      1279
Ranma 1/2 (US 2nd                        4
Drifting Classroom                       3
The Drifting Classroom                   2
Kristin Lavransdatter  #2                2
                                      ... 
Death Note  #7                           1
The Master of Hestviken  #2              1
Mitch Rapp  #9                           1
Orphans  #4                              1
Voyage of the Jerle Shannara  #1-3       1
Name: book_series, Length: 303, dtype: int64

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators