From reviews to summaries to tweets, there is a myriad of text data to analyze. Text data can be particularly interesting--and messy to work with. Thankfully, with the rise of generative AI, working with text data has never been easier. By leveraging generative AI, you can speed up tasks like replacing rogue values, standardizing formatting, and creating new columns from existing text data. In this tutorial, we'll use a Kaggle dataset of IMDB movies to provide examples of natural language queries that can help you manipulate string data.

Video Walkthrough of String Manipulation with Generative AI
If you would prefer to watch a video tutorial, you can check out the 4-minute video below, which goes through all of the natural language prompts I tried out to manipulate the string data columns.
To follow along in the canvas, you can open and fork the one embedded below. All the natural language prompts shown in the video are provided in Markdown cells to the left of the Python cells generated.
Prompts 1-3: Filter and change data types
First, I had noticed earlier that the Released_Year
column contained string
values rather than int
values. A lot of times this can be due to a data shift somewhere, such that the wrong data type ends up in a column. So I tried the following prompt:
Print value counts of release year
NOTE: The comments in the code reproduced below were all generated by Prompt, with the exception of the top comment in each code block with starts with "PROMPT:." I added the natural language prompt used to generate the code for reference.
Prompt Output
# PROMPT: Print value counts of release year
import pandas as pd
# Use the value_counts() function on the 'Released_Year' column of the 'imdb_top_1000' dataframe
released_year_counts = imdb_top_1000['Released_Year'].value_counts()
# Print the value counts of the 'Released_Year' column
display(released_year_counts)
Code Output:
2014 32
2004 31
2009 29
2013 28
2016 28
..
1926 1
1936 1
1924 1
1921 1
PG 1
Name: Released_Year, Length: 100, dtype: int64
Sure enough, there is at least one row that is nonnumerical, and the data from the rating column was shifted over. So I try to isolate that row, and remove it before converting the whole column to integers.
Print rows where releasee year is PG
Drop the row where released year is PG. Convert released year to int
# PROMPT: Print rows where releasee year is PG
# Filter the 'imdb_top_1000' dataframe to only show rows where the 'Released_Year' column is 'PG'
filtered_df = imdb_top_1000[imdb_top_1000['Released_Year'] == 'PG']
# Display the filtered dataframe
filtered_df
Code Output:

Then we can drop the row and convert the column.
# PROMPT: Drop the row where released year is PG. Convert released year to int
# Modify the 'filtered_df' dataframe by dropping the row where 'Released_Year' is 'PG'
filtered_df = imdb_top_1000.drop(imdb_top_1000[imdb_top_1000['Released_Year'] == 'PG'].index)
# Convert the 'Released_Year' column to integer data type
filtered_df['Released_Year'] = filtered_df['Released_Year'].astype(int)
Code Output:

From the Table cell, we can see that the Released_Year
column is now numerical, indicated by the hash symbol, rather than the capital A, which denotes a string column.
Prompts 4-5: Reformat string data
Next, I had noticed earlier that the Runtime
column contained the number value for the number of minutes, followed by the word " min"
. So I asked Prompt to remove that part of the string throughout the column, and then to convert the column to integers again.
Get value counts of runtime
Remove " min" from each value in runtime, and convert runtime to int
Prompt Output
# PROMPT: Get value counts of runtime
# Import the necessary libraries
import pandas as pd
# Get the value counts of the 'Runtime' column from the 'filtered_df' dataframe
runtime_counts = filtered_df['Runtime'].value_counts()
# Display the frequency distribution of different runtimes in the dataset
display(runtime_counts)
Code Output:
100 min 23
130 min 23
129 min 22
101 min 22
113 min 22
..
185 min 1
205 min 1
212 min 1
238 min 1
78 min 1
Name: Runtime, Length: 140, dtype: int64
# PROMPT: Remove " min" from each value in runtime, and convert runtime to int
# Remove the ' min' substring from each value in the 'Runtime' column
filtered_df['Runtime'] = filtered_df['Runtime'].str.replace(' min', '')
# Convert the data type of the 'Runtime' column to integer
filtered_df['Runtime'] = filtered_df['Runtime'].astype(int)
Now the data looks like this:

Now we can see that the Runtime
column is numerical, as indicated by the hash symbol in the column title.
Prompts 6-9: Split strings and preprocess categorical variable
Lastly, I'm going to work with the Genre
column, which currently has all of the genres for each film as one long string, with genres separated by commas. If I want to use a predictive model on this data, I will likely need the genre data to be converted to several binary columns, with one column per genre. To do this, I will use the following prompts to examine and clean the data.
Get value counts of genre
Convert genre to a list of strings split on ", "
Get a unique list of genres
For each unique genre, create a binary column in filtered df. Set as 1 if the genre is in genres, and 0 otherwise
TIP: if you ever hit any errors in Einblick (regardless of if you're using Prompt), you can click the "Fix with Prompt" button to resolve the issue. Prompt will fix the problem and tell you what it changed. You can also use the "Change this cell" option to edit any code cell with Prompt.
Prompt Output
# PROMPT: Get value counts of genre
# Import the necessary libraries
import pandas as pd
# Get the value counts of the 'Genre' column from the 'filtered_df' dataframe
genre_counts = filtered_df['Genre'].value_counts()
# Display the frequency distribution of different genres in the dataset
display(genre_counts)
Code Output:
Drama 85
Drama, Romance 37
Comedy, Drama 35
Comedy, Drama, Romance 31
Action, Crime, Drama 30
..
Adventure, Thriller 1
Animation, Action, Sci-Fi 1
Action, Crime, Comedy 1
Animation, Crime, Mystery 1
Adventure, Comedy, War 1
Name: Genre, Length: 202, dtype: int64
The data all seems to fit a similar format, so let's go ahead and try to split on the commas.
# PROMPT: Convert genre to a list of strings split on ", "
# Split the 'Genre' column in the 'filtered_df' dataframe by the separator ', ' to create a list of strings
filtered_df['Genres'] = filtered_df['Genre'].str.split(', ')
# Display the modified dataframe
filtered_df.head()
Code Output:

From the code and the output, we can see that Prompt created a new column called Genres
, plural to hold the lists of genres, and has preserved the original Genre
column.
# PROMPT: Get a unique list of genres
# Extract the 'Genres' column from the 'filtered_df' dataframe
genres_column = filtered_df['Genres']
# Create an empty list to store the unique genres
unique_genres = []
# Loop through each row in the 'Genres' column
for genres in genres_column:
# Split the genres by comma and add them to the unique genres list
unique_genres.extend(genres)
# Print the unique genres list
display(list(set(unique_genres)))
Code Output:
['War',
'Sport',
'Romance',
'Drama',
'Adventure',
'Thriller',
'Music',
'Action',
'Western',
'Film-Noir',
'Sci-Fi',
'Musical',
'History',
'Family',
'Biography',
'Fantasy',
'Comedy',
'Animation',
'Crime',
'Mystery',
'Horror']
Now with the unique list of genres, it'll be easier to check that the columns get created correctly in the following step.
# PROMPT: For each unique genre, create a binary column in filtered df. Set as 1 if the genre is in genres, and 0 otherwise
# For each movie, set the value in the binary column corresponding to its genre to 1 if the genre is present in the 'Genres' column, and 0 otherwise
for index, row in filtered_df.iterrows():
genres = row['Genres']
for genre in filtered_df.columns[17:]:
if genre in genres:
filtered_df.at[index, genre] = 1
else:
filtered_df.at[index, genre] = 0
# Display the modified dataframe
filtered_df.head()
Code Output:

And there we have it. You can continue working with Prompt to further process the data, perform EDA, or start creating predictive models.
About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.