From reviews to summaries to tweets, there is a myriad of text data to analyze. Text data can be particularly interesting--and messy to work with. Thankfully, with the rise of generative AI, working with text data has never been easier. By leveraging generative AI, you can speed up tasks like replacing rogue values, standardizing formatting, and creating new columns from existing text data. In this tutorial, we'll use a Kaggle dataset of IMDB movies to provide examples of natural language queries that can help you manipulate string data.
Video Walkthrough of String Manipulation with Generative AI
If you would prefer to watch a video tutorial, you can check out the 4-minute video below, which goes through all of the natural language prompts I tried out to manipulate the string data columns.
To follow along in the canvas, you can open and fork the one embedded below. All the natural language prompts shown in the video are provided in Markdown cells to the left of the Python cells generated.
Prompts 1-3: Filter and change data types
First, I had noticed earlier that the
Released_Year column contained
string values rather than
int values. A lot of times this can be due to a data shift somewhere, such that the wrong data type ends up in a column. So I tried the following prompt:
Print value counts of release year
NOTE: The comments in the code reproduced below were all generated by Prompt, with the exception of the top comment in each code block with starts with "PROMPT:." I added the natural language prompt used to generate the code for reference.
# PROMPT: Print value counts of release year import pandas as pd # Use the value_counts() function on the 'Released_Year' column of the 'imdb_top_1000' dataframe released_year_counts = imdb_top_1000['Released_Year'].value_counts() # Print the value counts of the 'Released_Year' column display(released_year_counts)
2014 32 2004 31 2009 29 2013 28 2016 28 .. 1926 1 1936 1 1924 1 1921 1 PG 1 Name: Released_Year, Length: 100, dtype: int64
Sure enough, there is at least one row that is nonnumerical, and the data from the rating column was shifted over. So I try to isolate that row, and remove it before converting the whole column to integers.
Print rows where releasee year is PG
Drop the row where released year is PG. Convert released year to int
# PROMPT: Print rows where releasee year is PG # Filter the 'imdb_top_1000' dataframe to only show rows where the 'Released_Year' column is 'PG' filtered_df = imdb_top_1000[imdb_top_1000['Released_Year'] == 'PG'] # Display the filtered dataframe filtered_df
Then we can drop the row and convert the column.
# PROMPT: Drop the row where released year is PG. Convert released year to int # Modify the 'filtered_df' dataframe by dropping the row where 'Released_Year' is 'PG' filtered_df = imdb_top_1000.drop(imdb_top_1000[imdb_top_1000['Released_Year'] == 'PG'].index) # Convert the 'Released_Year' column to integer data type filtered_df['Released_Year'] = filtered_df['Released_Year'].astype(int)
From the Table cell, we can see that the
Released_Year column is now numerical, indicated by the hash symbol, rather than the capital A, which denotes a string column.
Prompts 4-5: Reformat string data
Next, I had noticed earlier that the
Runtime column contained the number value for the number of minutes, followed by the word
" min". So I asked Prompt to remove that part of the string throughout the column, and then to convert the column to integers again.
Get value counts of runtime
Remove " min" from each value in runtime, and convert runtime to int
# PROMPT: Get value counts of runtime # Import the necessary libraries import pandas as pd # Get the value counts of the 'Runtime' column from the 'filtered_df' dataframe runtime_counts = filtered_df['Runtime'].value_counts() # Display the frequency distribution of different runtimes in the dataset display(runtime_counts)
100 min 23 130 min 23 129 min 22 101 min 22 113 min 22 .. 185 min 1 205 min 1 212 min 1 238 min 1 78 min 1 Name: Runtime, Length: 140, dtype: int64
# PROMPT: Remove " min" from each value in runtime, and convert runtime to int # Remove the ' min' substring from each value in the 'Runtime' column filtered_df['Runtime'] = filtered_df['Runtime'].str.replace(' min', '') # Convert the data type of the 'Runtime' column to integer filtered_df['Runtime'] = filtered_df['Runtime'].astype(int)
Now the data looks like this:
Now we can see that the
Runtime column is numerical, as indicated by the hash symbol in the column title.
Prompts 6-9: Split strings and preprocess categorical variable
Lastly, I'm going to work with the
Genre column, which currently has all of the genres for each film as one long string, with genres separated by commas. If I want to use a predictive model on this data, I will likely need the genre data to be converted to several binary columns, with one column per genre. To do this, I will use the following prompts to examine and clean the data.
Get value counts of genre
Convert genre to a list of strings split on ", "
Get a unique list of genres
For each unique genre, create a binary column in filtered df. Set as 1 if the genre is in genres, and 0 otherwise
TIP: if you ever hit any errors in Einblick (regardless of if you're using Prompt), you can click the "Fix with Prompt" button to resolve the issue. Prompt will fix the problem and tell you what it changed. You can also use the "Change this cell" option to edit any code cell with Prompt.
# PROMPT: Get value counts of genre # Import the necessary libraries import pandas as pd # Get the value counts of the 'Genre' column from the 'filtered_df' dataframe genre_counts = filtered_df['Genre'].value_counts() # Display the frequency distribution of different genres in the dataset display(genre_counts)
Drama 85 Drama, Romance 37 Comedy, Drama 35 Comedy, Drama, Romance 31 Action, Crime, Drama 30 .. Adventure, Thriller 1 Animation, Action, Sci-Fi 1 Action, Crime, Comedy 1 Animation, Crime, Mystery 1 Adventure, Comedy, War 1 Name: Genre, Length: 202, dtype: int64
The data all seems to fit a similar format, so let's go ahead and try to split on the commas.
# PROMPT: Convert genre to a list of strings split on ", " # Split the 'Genre' column in the 'filtered_df' dataframe by the separator ', ' to create a list of strings filtered_df['Genres'] = filtered_df['Genre'].str.split(', ') # Display the modified dataframe filtered_df.head()
From the code and the output, we can see that Prompt created a new column called
Genres, plural to hold the lists of genres, and has preserved the original
# PROMPT: Get a unique list of genres # Extract the 'Genres' column from the 'filtered_df' dataframe genres_column = filtered_df['Genres'] # Create an empty list to store the unique genres unique_genres =  # Loop through each row in the 'Genres' column for genres in genres_column: # Split the genres by comma and add them to the unique genres list unique_genres.extend(genres) # Print the unique genres list display(list(set(unique_genres)))
['War', 'Sport', 'Romance', 'Drama', 'Adventure', 'Thriller', 'Music', 'Action', 'Western', 'Film-Noir', 'Sci-Fi', 'Musical', 'History', 'Family', 'Biography', 'Fantasy', 'Comedy', 'Animation', 'Crime', 'Mystery', 'Horror']
Now with the unique list of genres, it'll be easier to check that the columns get created correctly in the following step.
# PROMPT: For each unique genre, create a binary column in filtered df. Set as 1 if the genre is in genres, and 0 otherwise # For each movie, set the value in the binary column corresponding to its genre to 1 if the genre is present in the 'Genres' column, and 0 otherwise for index, row in filtered_df.iterrows(): genres = row['Genres'] for genre in filtered_df.columns[17:]: if genre in genres: filtered_df.at[index, genre] = 1 else: filtered_df.at[index, genre] = 0 # Display the modified dataframe filtered_df.head()
And there we have it. You can continue working with Prompt to further process the data, perform EDA, or start creating predictive models.
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.