A fundamental part of any data analysis, data cleaning is as critical as it can be tedious and time-consuming. This tutorial provides examples of how to harness generative AI to expedite and automate data cleaning tasks such as changing data types, handling missing values, and more. For the purposes of this post, we'll be using a data set of Goodreads data.
Video Walkthrough of Data Cleaning with Generative AI
If you would prefer to watch a video tutorial, you can check out the 4-minute video below, which goes through all of the natural language prompts I tried out to prepare the original data set for further analysis.
If you would like to follow along directly in the canvas, you can open and fork the canvas below. All of the prompts seen in the video are in Markdown cells to the left of the Python cell generated by Einblick Prompt.
Prompt 1: Replacing values
When looking at a histogram of the language code of the books, I could see that there were multiple labels for English, based on geography (
eng-CA for Canada;
eng-GB for Great Britain;
eng-US for United States;
eng for generic English). So for my first query, I asked Prompt to consolidate these values into one.
Replace en-CA, en-GB, and en-US with eng
# PROMPT: Replace en-CA, en-GB, and en-US with eng # Replace the values 'en-CA', 'en-GB', and 'en-US' with 'eng' in the 'language_code' column of the 'book_data' dataframe # Import the necessary libraries import pandas as pd # Replace the values in the 'language_code' column book_data['language_code'].replace(['en-CA', 'en-GB', 'en-US'], 'eng', inplace=True)
Prompt 2: Change data types
Next, I had noticed earlier that the column
string values rather than
datetime values. So I tried the following prompt:
Convert publication datetime column to datetime object
# PROMPT: Convert publication datetime column to datetime object # Import the necessary libraries import pandas as pd # Convert the 'publication_datetime' column to datetime object book_data['publication_datetime'] = pd.to_datetime(book_data['publication_datetime'], format='%Y-%m-%d')
From the Table cell, we can see that the
publication_datetime column is now a
datetime column, as indicated by the calendar icon in the column title, rather than the capital A, which indicates a
Prompts 3-5: Filter data and drop columns
Next I wanted to subset for a particular condition and drop superfluous rows. I tried the following prompts, with the resulting output.
Get the value counts for publisher
Subset the data for the top 10 publishers
Drop isbn13 and publication date columns
# PROMPT: Get the value counts for publisher # Import the necessary libraries import pandas as pd # Use the 'value_counts()' function on the 'publisher' column of the 'book_data' dataframe to get the value counts publisher_counts = book_data['publisher'].value_counts() # Display the value counts display(publisher_counts)
Vintage 318 Penguin Books 261 Penguin Classics 184 ... DAW 41 G.P. Putnam's Sons 40 Harper 40 Name: publisher, dtype: int64
# PROMPT: Subset the data for the top 10 publishers # Subset the 'book_data' dataframe to include only the top 10 publishers top_publishers = book_data[book_data['publisher'].isin(publisher_counts.index[:10])]
Now we can see only 10 publishers when we pull out a histogram of publishers.
# PROMPT: Drop isbn13 and publication date columns # Drop the 'isbn13' and 'publication_date' columns from the 'top_publishers' dataframe top_publishers = top_publishers.drop(['isbn13', 'publication_date'], axis=1)
Prompt 6-7: Extract data and create new column
Lastly, I noticed that there was book series information between parentheses in the
title column. I wanted to extract this information and then save it as a new column.
Save book series information saved in () in title column
Value counts of book series
# PROMPT: Save book series information saved in () in title column # Import the necessary libraries import pandas as pd import re # Create a new column named 'book_series' in the 'top_publishers' dataframe top_publishers['book_series'] = "" # Use regular expressions to extract the book series information from the 'title' column and assign it to the 'book_series' column top_publishers['book_series'] = top_publishers['title'].apply(lambda x: re.findall(r'\((.*?)\)', x) if re.findall(r'\((.*?)\)', x) else "") # Display the updated dataframe display(top_publishers.head(3))
# PROMPT: Value counts of book series # Get the value counts of the 'book_series' column in the 'top_publishers' dataframe book_series_counts = top_publishers['book_series'].value_counts() # Print the value counts display(book_series_counts)
1279 Ranma 1/2 (US 2nd 4 Drifting Classroom 3 The Drifting Classroom 2 Kristin Lavransdatter #2 2 ... Death Note #7 1 The Master of Hestviken #2 1 Mitch Rapp #9 1 Orphans #4 1 Voyage of the Jerle Shannara #1-3 1 Name: book_series, Length: 303, dtype: int64
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.