In this tutorial, I go through an example workflow to built a first classification model, in this case, a logistic regression model. Logistic regression, along with linear regression, are a fundamental part of supervised learning for any data analyst or data scientist. With Prompt, Einblick's AI agent that can write and fix code, create beautiful charts, and much much more, building a model has never been faster or simpler. You still need to examine your data and the results to guide Prompt effectively, but natural language is truly the next step for programming.
The Data and Notebook
If you want to follow along with the video, feel free to open and fork the canvas-based notebook below.
Next, you can take a quick look at the movie data we have available:
Prompts 1-6: Data Cleaning
Based on a quick glance of the data, I saw that there were many genres of movies represented in the dataset, but comedy and action movies were the most represented, so I started by filtering for just those two genres. Then, through a few more prompts, I continued to narrow the data set before doing some EDA and building the model. The prompts I used are as follows:
Subset for comedy and action movies
Print the value counts of genre
Bar chart of rating
Subset for movies rated PG, PG-13, and R
Value counts of location
Subset for location = United States
# PROMPT: Subset for comedy and action movies # Subset the 'movies_data' dataframe by filtering the 'genre' column for 'Comedy' and 'Action' genres movies_data_subset = movies_data[(movies_data['genre'] == 'Comedy') | (movies_data['genre'] == 'Action')]
# PROMPT: Print the value counts of genre import pandas as pd # Get the count of each unique value in the 'genre' column of the 'movies_data_subset' dataframe genre_counts = movies_data_subset['genre'].value_counts() genre_counts
Comedy 2245 Action 1705 Name: genre, dtype: int64
After subsetting for comedy and action movies, I wanted to explore ratings.
# PROMPT: Bar chart of rating import matplotlib.pyplot as plt # Create a bar chart of the 'rating' column from the 'movies_data_subset' dataframe plt.figure(figsize=(10, 6)) movies_data_subset['rating'].value_counts().plot(kind='bar') plt.xlabel('Rating') plt.ylabel('Count') plt.title('Distribution of Movie Ratings') plt.show()
Based on the bar chart, I decided just to subset for three movie ratings.
# PROMPT: Subset for movies rated PG, PG-13, and R # Filter the 'rating' column of the 'movies_data_subset' dataframe for movies that are rated 'PG', 'PG-13', and 'R' movies_data_subset_filtered = movies_data_subset[movies_data_subset['rating'].isin(['PG', 'PG-13', 'R'])] movies_data_subset_filtered
Next I wanted to check out location and subset accordingly.
# PROMPT: Value counts of location # Task: Get the value counts of the 'location' column # Use the 'value_counts()' function on the 'location' column of the 'movies_data_subset_filtered' dataframe location_counts = movies_data_subset_filtered['location'].value_counts() location_counts
United States 3529 United Kingdom 59 France 27 Australia 18 Canada 17 Germany 16 Brazil 10 Italy 10 Russia 9 Japan 8 Hong Kong 8 Spain 7 Israel 6 South Korea 6 Singapore 5 Argentina 5 China 5 Norway 5 Thailand 4 Greece 4 Denmark 4 Sweden 4 Ireland 3 Iceland 3 New Zealand 2 South Africa 2 Finland 2 Belgium 2 West Germany 2 India 1 Bahamas 1 United Arab Emirates 1 Portugal 1 Kazakhstan 1 Soviet Union 1 Federal Republic of Yugoslavia 1 Latvia 1 Poland 1 Turkey 1 Taiwan 1 Mexico 1 Puerto Rico 1 Hungary 1 Netherlands 1 Name: location, dtype: int64
# PROMPT: Subset for location = United States # Filter the 'location' column of the 'movies_data_subset_filtered' dataframe for rows where the value is equal to 'United States' movies_data_subset_filtered_us = movies_data_subset_filtered[movies_data_subset_filtered['location'] == 'United States'] movies_data_subset_filtered_us
Prompts 7-9: Exploratory Data Analysis
Next I wanted to take a look at scores and budgets based on genre.
Distribution of score by genre
Distribution of budget by genre
Change: Use seaborn boxplots instead
# PROMPT: Distribution of score by genre # Importing the necessary libraries import matplotlib.pyplot as plt # Grouping the data by genre and calculating the average score for each genre genre_scores = movies_data_subset_filtered_us.groupby('genre')['score'].mean() # Plotting the bar chart plt.bar(genre_scores.index, genre_scores.values) plt.xlabel('Genre') plt.ylabel('Average Score') plt.title('Distribution of Movie Scores by Genre') plt.show()
# PROMPT: Distribution of budget by genre # CHANGE: Use seaborn boxplots instead # Importing the necessary libraries import seaborn as sns # Grouping the data by genre and calculating the average budget for each genre genre_budgets = movies_data_subset_filtered_us.groupby('genre')['budget'].mean() # Plotting the box plot sns.boxplot(x=movies_data_subset_filtered_us['genre'], y=movies_data_subset_filtered_us['budget']) plt.xlabel('Genre') plt.ylabel('Budget') plt.title('Distribution of Movie Budgets by Genre') plt.show()
Prompts 9-11: Logistic Regression and Evaluation
Once preliminary EDA was finished, I moved on to removing extraneous columns and building out a first logistic regression model,
New dataframe of rating, genre, score, votes, budget, gross, and runtime
Predict genre, print accuracy
Change: Use logistic regression instead
Plot the confusion matrix
Change: Add axis tick labels for Action and Comedy
# PROMPT: New dataframe of rating, genre, score, votes, budget, gross, and runtime # Create a new dataframe called 'new_dataframe' by selecting the columns 'rating', 'genre', 'score', 'votes', 'budget', 'gross', and 'runtime' from the 'movies_data_subset_filtered_us' dataframe. new_dataframe = movies_data_subset_filtered_us[['rating', 'genre', 'score', 'votes', 'budget', 'gross', 'runtime']]
# PROMPT: Predict genre, print accuracy # CHANGE: Use logistic regression instead # Import necessary libraries import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Split the 'new_dataframe' into features (X) and target variable (y) X = new_dataframe[['rating', 'score', 'votes', 'budget', 'gross', 'runtime']] y = new_dataframe['genre'] # Check for missing values in X X.isnull().sum() # Fill missing values in X with the mean of each column X = X.fillna(X.mean()) # Encode the categorical variable 'rating' using one-hot encoding X_encoded = pd.get_dummies(X, columns=['rating']) # Split the data into training and testing sets using a 70:30 ratio X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.3, random_state=42) # Train a logistic regression classifier on the training data model = LogisticRegression() model.fit(X_train, y_train) # Make predictions on the testing data using the trained model y_pred = model.predict(X_test) # Evaluate the accuracy of the model by comparing the predicted genre with the actual genre in the testing data accuracy = accuracy_score(y_test, y_pred) # Print the accuracy of the prediction model display("Accuracy:", accuracy)
In the YouTube video linked earlier you can see that Prompt initially uses a decision tree to predict movie genre, but I was able to use the "Change the above cell" option to ask Prompt to use logistic regression instead. Once the model was built, Prompt was able to print the accuracy score, which is about 0.546.
# PROMPT: Plot the confusion matrix # CHANGE: Add axis tick labels for Action and Comedy # Import the necessary libraries for plotting and evaluating the model import matplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import confusion_matrix # Create a confusion matrix using the actual genres from the 'y_test' dataframe and the predicted genres cm = confusion_matrix(y_test, y_pred) # Define the axis tick labels labels = ['Action', 'Comedy'] # Plot the confusion matrix plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, cmap='Blues', fmt='g') plt.xlabel('Predicted Genre') plt.ylabel('Actual Genre') plt.title('Confusion Matrix') # Set the axis tick labels plt.xticks(ticks=[0.5, 1.5], labels=labels) plt.yticks(ticks=[0.5, 1.5], labels=labels) plt.show()
You can continue exploring the data or tuning and editing the model as needed with Prompt! Happy coding!
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.