Logistic Regression with Prompt AI

Becca Weng - October 23rd, 2023

In this tutorial, I go through an example workflow to built a first classification model, in this case, a logistic regression model. Logistic regression, along with linear regression, are a fundamental part of supervised learning for any data analyst or data scientist. With Prompt, Einblick's AI agent that can write and fix code, create beautiful charts, and much much more, building a model has never been faster or simpler. You still need to examine your data and the results to guide Prompt effectively, but natural language is truly the next step for programming.

The Data and Notebook

If you want to follow along with the video, feel free to open and fork the canvas-based notebook below.

Next, you can take a quick look at the movie data we have available:

Prompts 1-6: Data Cleaning

Based on a quick glance of the data, I saw that there were many genres of movies represented in the dataset, but comedy and action movies were the most represented, so I started by filtering for just those two genres. Then, through a few more prompts, I continued to narrow the data set before doing some EDA and building the model. The prompts I used are as follows:

Subset for comedy and action movies
Print the value counts of genre
Bar chart of rating
Subset for movies rated PG, PG-13, and R
Value counts of location
Subset for location = United States
# PROMPT: Subset for comedy and action movies

# Subset the 'movies_data' dataframe by filtering the 'genre' column for 'Comedy' and 'Action' genres

movies_data_subset = movies_data[(movies_data['genre'] == 'Comedy') | (movies_data['genre'] == 'Action')]
# PROMPT: Print the value counts of genre

import pandas as pd

# Get the count of each unique value in the 'genre' column of the 'movies_data_subset' dataframe
genre_counts = movies_data_subset['genre'].value_counts()


Code Output:

Comedy    2245
Action    1705
Name: genre, dtype: int64

After subsetting for comedy and action movies, I wanted to explore ratings.

# PROMPT: Bar chart of rating

import matplotlib.pyplot as plt

# Create a bar chart of the 'rating' column from the 'movies_data_subset' dataframe
plt.figure(figsize=(10, 6))
plt.title('Distribution of Movie Ratings')

Code Output:

Based on the bar chart, I decided just to subset for three movie ratings.

# PROMPT: Subset for movies rated PG, PG-13, and R

# Filter the 'rating' column of the 'movies_data_subset' dataframe for movies that are rated 'PG', 'PG-13', and 'R'

movies_data_subset_filtered = movies_data_subset[movies_data_subset['rating'].isin(['PG', 'PG-13', 'R'])]


Code Output:

Next I wanted to check out location and subset accordingly.

# PROMPT: Value counts of location

# Task: Get the value counts of the 'location' column

# Use the 'value_counts()' function on the 'location' column of the 'movies_data_subset_filtered' dataframe
location_counts = movies_data_subset_filtered['location'].value_counts()


Code Output:

United States                     3529
United Kingdom                      59
France                              27
Australia                           18
Canada                              17
Germany                             16
Brazil                              10
Italy                               10
Russia                               9
Japan                                8
Hong Kong                            8
Spain                                7
Israel                               6
South Korea                          6
Singapore                            5
Argentina                            5
China                                5
Norway                               5
Thailand                             4
Greece                               4
Denmark                              4
Sweden                               4
Ireland                              3
Iceland                              3
New Zealand                          2
South Africa                         2
Finland                              2
Belgium                              2
West Germany                         2
India                                1
Bahamas                              1
United Arab Emirates                 1
Portugal                             1
Kazakhstan                           1
Soviet Union                         1
Federal Republic of Yugoslavia       1
Latvia                               1
Poland                               1
Turkey                               1
Taiwan                               1
Mexico                               1
Puerto Rico                          1
Hungary                              1
Netherlands                          1
Name: location, dtype: int64
# PROMPT: Subset for location = United States

# Filter the 'location' column of the 'movies_data_subset_filtered' dataframe for rows where the value is equal to 'United States'

movies_data_subset_filtered_us = movies_data_subset_filtered[movies_data_subset_filtered['location'] == 'United States']


Code Output:

Prompts 7-9: Exploratory Data Analysis

Next I wanted to take a look at scores and budgets based on genre.

Distribution of score by genre
Distribution of budget by genre
Change: Use seaborn boxplots instead
# PROMPT: Distribution of score by genre

# Importing the necessary libraries
import matplotlib.pyplot as plt

# Grouping the data by genre and calculating the average score for each genre
genre_scores = movies_data_subset_filtered_us.groupby('genre')['score'].mean()

# Plotting the bar chart
plt.bar(genre_scores.index, genre_scores.values)
plt.ylabel('Average Score')
plt.title('Distribution of Movie Scores by Genre')

Code Output:

# PROMPT: Distribution of budget by genre
# CHANGE: Use seaborn boxplots instead

# Importing the necessary libraries
import seaborn as sns

# Grouping the data by genre and calculating the average budget for each genre
genre_budgets = movies_data_subset_filtered_us.groupby('genre')['budget'].mean()

# Plotting the box plot
sns.boxplot(x=movies_data_subset_filtered_us['genre'], y=movies_data_subset_filtered_us['budget'])
plt.title('Distribution of Movie Budgets by Genre')

Code Output:

Prompts 9-11: Logistic Regression and Evaluation

Once preliminary EDA was finished, I moved on to removing extraneous columns and building out a first logistic regression model,

New dataframe of rating, genre, score, votes, budget, gross, and runtime
Predict genre, print accuracy
Change: Use logistic regression instead
Plot the confusion matrix
Change: Add axis tick labels for Action and Comedy
# PROMPT: New dataframe of rating, genre, score, votes, budget, gross, and runtime

# Create a new dataframe called 'new_dataframe' by selecting the columns 'rating', 'genre', 'score', 'votes', 'budget', 'gross', and 'runtime' from the 'movies_data_subset_filtered_us' dataframe.

new_dataframe = movies_data_subset_filtered_us[['rating', 'genre', 'score', 'votes', 'budget', 'gross', 'runtime']]
# PROMPT: Predict genre, print accuracy
# CHANGE: Use logistic regression instead

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split the 'new_dataframe' into features (X) and target variable (y)
X = new_dataframe[['rating', 'score', 'votes', 'budget', 'gross', 'runtime']]
y = new_dataframe['genre']

# Check for missing values in X

# Fill missing values in X with the mean of each column
X = X.fillna(X.mean())

# Encode the categorical variable 'rating' using one-hot encoding
X_encoded = pd.get_dummies(X, columns=['rating'])

# Split the data into training and testing sets using a 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.3, random_state=42)

# Train a logistic regression classifier on the training data
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the testing data using the trained model
y_pred = model.predict(X_test)

# Evaluate the accuracy of the model by comparing the predicted genre with the actual genre in the testing data
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy of the prediction model
display("Accuracy:", accuracy)

Code Output:


In the YouTube video linked earlier you can see that Prompt initially uses a decision tree to predict movie genre, but I was able to use the "Change the above cell" option to ask Prompt to use logistic regression instead. Once the model was built, Prompt was able to print the accuracy score, which is about 0.546.

# PROMPT: Plot the confusion matrix
# CHANGE: Add axis tick labels for Action and Comedy

# Import the necessary libraries for plotting and evaluating the model
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Create a confusion matrix using the actual genres from the 'y_test' dataframe and the predicted genres
cm = confusion_matrix(y_test, y_pred)

# Define the axis tick labels
labels = ['Action', 'Comedy']

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted Genre')
plt.ylabel('Actual Genre')
plt.title('Confusion Matrix')

# Set the axis tick labels
plt.xticks(ticks=[0.5, 1.5], labels=labels)
plt.yticks(ticks=[0.5, 1.5], labels=labels)


Code Output:

You can continue exploring the data or tuning and editing the model as needed with Prompt! Happy coding!


Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators