A fundamental part of any data analysis, data cleaning is as critical as it can be tedious and time-consuming. This tutorial provides examples of how to harness generative AI to expedite and automate data cleaning tasks such as changing data types, handling missing values, and more. For the purposes of this post, we'll be using a data set of Goodreads data.

Video Walkthrough of Data Cleaning with Generative AI
If you would prefer to watch a video tutorial, you can check out the 4-minute video below, which goes through all of the natural language prompts I tried out to prepare the original data set for further analysis.
If you would like to follow along directly in the canvas, you can open and fork the canvas below. All of the prompts seen in the video are in Markdown cells to the left of the Python cell generated by Einblick Prompt.
Prompt 1: Replacing values

When looking at a histogram of the language code of the books, I could see that there were multiple labels for English, based on geography (eng-CA
for Canada; eng-GB
for Great Britain; eng-US
for United States; eng
for generic English). So for my first query, I asked Prompt to consolidate these values into one.
Replace en-CA, en-GB, and en-US with eng
Prompt Output
# PROMPT: Replace en-CA, en-GB, and en-US with eng
# Replace the values 'en-CA', 'en-GB', and 'en-US' with 'eng' in the 'language_code' column of the 'book_data' dataframe
# Import the necessary libraries
import pandas as pd
# Replace the values in the 'language_code' column
book_data['language_code'].replace(['en-CA', 'en-GB', 'en-US'], 'eng', inplace=True)

Prompt 2: Change data types
Next, I had noticed earlier that the column publication_datetime
contained string
values rather than datetime
values. So I tried the following prompt:
Convert publication datetime column to datetime object
Prompt Output
# PROMPT: Convert publication datetime column to datetime object
# Import the necessary libraries
import pandas as pd
# Convert the 'publication_datetime' column to datetime object
book_data['publication_datetime'] = pd.to_datetime(book_data['publication_datetime'], format='%Y-%m-%d')

From the Table cell, we can see that the publication_datetime
column is now a datetime
column, as indicated by the calendar icon in the column title, rather than the capital A, which indicates a string
column.
Prompts 3-5: Filter data and drop columns
Next I wanted to subset for a particular condition and drop superfluous rows. I tried the following prompts, with the resulting output.
Get the value counts for publisher
Subset the data for the top 10 publishers
Drop isbn13 and publication date columns
Prompt Output
# PROMPT: Get the value counts for publisher
# Import the necessary libraries
import pandas as pd
# Use the 'value_counts()' function on the 'publisher' column of the 'book_data' dataframe to get the value counts
publisher_counts = book_data['publisher'].value_counts()
# Display the value counts
display(publisher_counts)
Code Output:
Vintage 318
Penguin Books 261
Penguin Classics 184
...
DAW 41
G.P. Putnam's Sons 40
Harper 40
Name: publisher, dtype: int64
# PROMPT: Subset the data for the top 10 publishers
# Subset the 'book_data' dataframe to include only the top 10 publishers
top_publishers = book_data[book_data['publisher'].isin(publisher_counts.index[:10])]
Now we can see only 10 publishers when we pull out a histogram of publishers.

# PROMPT: Drop isbn13 and publication date columns
# Drop the 'isbn13' and 'publication_date' columns from the 'top_publishers' dataframe
top_publishers = top_publishers.drop(['isbn13', 'publication_date'], axis=1)
Prompt 6-7: Extract data and create new column
Lastly, I noticed that there was book series information between parentheses in the title
column. I wanted to extract this information and then save it as a new column.
Save book series information saved in () in title column
Value counts of book series
Prompt Output
# PROMPT: Save book series information saved in () in title column
# Import the necessary libraries
import pandas as pd
import re
# Create a new column named 'book_series' in the 'top_publishers' dataframe
top_publishers['book_series'] = ""
# Use regular expressions to extract the book series information from the 'title' column and assign it to the 'book_series' column
top_publishers['book_series'] = top_publishers['title'].apply(lambda x: re.findall(r'\((.*?)\)', x)[0] if re.findall(r'\((.*?)\)', x) else "")
# Display the updated dataframe
display(top_publishers.head(3))
Code Output:

# PROMPT: Value counts of book series
# Get the value counts of the 'book_series' column in the 'top_publishers' dataframe
book_series_counts = top_publishers['book_series'].value_counts()
# Print the value counts
display(book_series_counts)
Code Output:
1279
Ranma 1/2 (US 2nd 4
Drifting Classroom 3
The Drifting Classroom 2
Kristin Lavransdatter #2 2
...
Death Note #7 1
The Master of Hestviken #2 1
Mitch Rapp #9 1
Orphans #4 1
Voyage of the Jerle Shannara #1-3 1
Name: book_series, Length: 303, dtype: int64
About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.