In this tutorial, I walk through an example workflow to go from raw data to first linear regression model. Linear regression models are some of the most fundamental supervised learning models any data analyst or data scientist can use. Now, with generative AI, building a model has never been faster or simpler. You still need to examine your data and the results throughout though so you can guide Prompt, Einblick's AI agent that can write and fix code, create beautiful charts, and much much more.

**NOTE:** In the video, you can see that I make a few typos in the prompts I type in, and I didn't notice these until video editing, because Prompt AI was still able to process my requests correctly!

## The Data and Notebook

First, let's take a quick snapshot of our data, which provides information on various Adidas sales.

If you would like to, you can open and fork the canvas below to follow along with the video tutorial or as you read our in-depth guide below.

## Prompts 1-3: Data Cleaning

After some initial examination of the data, I could see that there were different kinds of Adidas products sold, from apparel to footwear, and there was some class imbalance in the Retailers column, so I wanted to filter for just a few. I accomplished this with the following prompts.

Filter for only footwear

Print the frequency counts for retailer and region

Filter for foot locker, west gear, and sports direct

```
# PROMPT: Filter for only footwear
adidas_sales_filtered = adidas_sales[adidas_sales['Product'].str.contains('footwear', case=False)]
```

After filtering for footwear (excluding clothing), I wanted to see if there were any retailers or regions that were overrepresented.

```
# PROMPT: Print the frequency counts for retailer and region
# Import the pandas library
import pandas as pd
# Use the value_counts() function on the 'Retailer' column of the 'adidas_sales_filtered' dataframe
retailer_counts = adidas_sales_filtered['Retailer'].value_counts()
# Use the value_counts() function on the 'Region' column of the 'adidas_sales_filtered' dataframe
region_counts = adidas_sales_filtered['Region'].value_counts()
# Print the frequency counts for both columns
print("Retailer Counts:")
print(retailer_counts)
print("\nRegion Counts:")
display(region_counts)
```

**Code Output:**

```
Retailer Counts:
Foot Locker 1771
West Gear 1585
Sports Direct 1352
Kohl's 687
Amazon 633
Walmart 406
Name: Retailer, dtype: int64
Region Counts:
West 1634
Northeast 1584
Midwest 1248
South 1152
Southeast 816
Name: Region, dtype: int64
```

From this, I could see that there were way more instances of Foot Locker, West Gear, and Sports Direct, so I focused on those.

```
# PROMPT: Filter for foot locker, west gear, and sports direct
# Filter the 'adidas_sales_filtered' dataframe for the retailers 'Foot Locker', 'West Gear', and 'Sports Direct'
retailers = ['Foot Locker', 'West Gear', 'Sports Direct']
filtered_df = adidas_sales_filtered[adidas_sales_filtered['Retailer'].isin(retailers)]
filtered_df
```

**Code Output:**

## Prompts 4-5: Exploratory Data Analysis

Next I wanted to take a look at sales and profit metrics based on retailer and region, so I used the following prompts.

Plot the distributions of units sold, total sales, operating profit, and operating margin. Color byretailer. Use a 2 x 2 grid of plots.

Plot the distributions of units sold, total sales, operating profit, and operating margin. Color byregion. Use a 2 x 2 grid of plots.

**NOTE:** the request to "plot the distribution" is ambiguous, so Prompt first returned histograms (as seen in the video above), but I thought box plots would be more illustrative in these cases, so I just used the **Change this cell** button to ask Prompt to update the code to box plots instead.

```
# PROMPT: Plot the distributions of units sold, total sales, operating profit, and operating margin. Color by retailer. Use a 2 x 2 grid of plots.
# CHANGE: Use box plots instead
# Import the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Create a 2x2 grid of plots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
# Plot the boxplot of 'Units Sold' colored by 'Retailer'
sns.boxplot(data=filtered_df, x='Retailer', y='Units Sold', ax=axes[0, 0])
axes[0, 0].set_title('Boxplot of Units Sold')
# Plot the boxplot of 'Total Sales' colored by 'Retailer'
sns.boxplot(data=filtered_df, x='Retailer', y='Total Sales', ax=axes[0, 1])
axes[0, 1].set_title('Boxplot of Total Sales')
# Plot the boxplot of 'Operating Profit' colored by 'Retailer'
sns.boxplot(data=filtered_df, x='Retailer', y='Operating Profit', ax=axes[1, 0])
axes[1, 0].set_title('Boxplot of Operating Profit')
# Plot the boxplot of 'Operating Margin' colored by 'Retailer'
sns.boxplot(data=filtered_df, x='Retailer', y='Operating Margin', ax=axes[1, 1])
axes[1, 1].set_title('Boxplot of Operating Margin')
# Adjust the layout and spacing of the subplots
plt.tight_layout()
# Display the plots
plt.show()
```

**Code Output:**

We can see there isn't much difference in average value for each metric across retailers, but perhaps some difference in spread. Since this didn't yield a lot of information, I wanted to take a look at metrics across different regions.

```
# PROMPT: Plot the distributions of units sold, total sales, operating profit, and operating margin. Color by region. Use a 2 x 2 grid of plots.
# CHANGE: Use box plots instead
# Import the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Create a 2x2 grid of plots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
# Plot the boxplot of 'Units Sold' colored by 'Region'
sns.boxplot(data=filtered_df, x='Region', y='Units Sold', ax=axes[0, 0])
axes[0, 0].set_title('Boxplot of Units Sold')
# Plot the boxplot of 'Total Sales' colored by 'Region'
sns.boxplot(data=filtered_df, x='Region', y='Total Sales', ax=axes[0, 1])
axes[0, 1].set_title('Boxplot of Total Sales')
# Plot the boxplot of 'Operating Profit' colored by 'Region'
sns.boxplot(data=filtered_df, x='Region', y='Operating Profit', ax=axes[1, 0])
axes[1, 0].set_title('Boxplot of Operating Profit')
# Plot the boxplot of 'Operating Margin' colored by 'Region'
sns.boxplot(data=filtered_df, x='Region', y='Operating Margin', ax=axes[1, 1])
axes[1, 1].set_title('Boxplot of Operating Margin')
# Adjust the layout and spacing of the subplots
plt.tight_layout()
# Display the plots
plt.show()
```

**Code Output:**

There was a visible difference across regions for each metric, with some having a higher or lower average value, or having a much more spread out distribution.

## Prompts 6-8: Linear Regression and Evaluation

Once preliminary EDA was finished, I moved on to predicting operating margin based on some of the variables I had examined.

Use linear regression to predict operating margin based on the retailer, region, product, total sales, and sales method

Plot the key drivers of the regression model

Evaluate the multicollinearity in the model, plot the correlation matrix as a heatmap

```
# PROMPT: Use linear regression to predict operating margin based on the retailer, region, product, total sales, and sales method
# FIX (to handle missing values)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.impute import SimpleImputer
# Create a new dataframe 'df' with selected columns from 'filtered_df'
df = filtered_df[['Retailer', 'Region', 'Product', 'Total Sales', 'Sales Method', 'Operating Margin']]
# Drop missing values
df.dropna(inplace=True)
# Convert categorical variables into numerical variables using one-hot encoding
df_encoded = pd.get_dummies(df, drop_first=True)
# Split the data into input features (X) and target variable (y)
X = df_encoded.drop('Operating Margin', axis=1)
y = df_encoded['Operating Margin']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Instantiate a linear regression model
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = model.predict(X_test)
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mse, mae, r2
```

**Code Output:**

`(0.004242134817663707, 0.05070368745508383, 0.43978402669787175)`

From the output, we can see that this first model has an R-squared value of 0.44, or about 44% of the variance of operating margin is explained by this model. But to get a deeper understanding of the model, let's take a look at the key drivers.

```
# PROMPT: Plot the key drivers of the regression model
# Import necessary libraries
import matplotlib.pyplot as plt
import pandas as pd
# Retrieve the coefficients of the regression model
coefficients = model.coef_
# Create a pandas DataFrame to store the coefficients and their corresponding feature names
coefficients_df = pd.DataFrame({'Feature': X_train.columns, 'Coefficient': coefficients})
# Sort the DataFrame by the absolute values of the coefficients
coefficients_df['Absolute Coefficient'] = coefficients_df['Coefficient'].abs()
coefficients_df = coefficients_df.sort_values('Absolute Coefficient', ascending=False)
# Plot the coefficients using a bar chart
plt.figure(figsize=(10, 6))
plt.bar(coefficients_df['Feature'], coefficients_df['Coefficient'])
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.title('Key Drivers of Regression Model')
plt.xticks(rotation=90)
plt.show()
```

**Code Output:**

Since Prompt has automatically sorted the key drivers by significance, we can see that whether or not a sale was made online in the South had the biggest effect on the model, followed by whether or not the product was men's street footwear.

```
# PROMPT: Evaluate the multicollinearity in the model, plot the correlation matrix as a heatmap
# CHANGE: Round the values to the nearest hundredth, and make the font smaller
# Import the necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Create a correlation matrix using the 'X' dataframe
correlation_matrix = X[['Total Sales', 'Retailer_Sports Direct', 'Retailer_West Gear', 'Region_Northeast', 'Region_South', 'Region_Southeast', 'Region_West', "Product_Men's Street Footwear", "Product_Women's Athletic Footwear", "Product_Women's Street Footwear", 'Sales Method_Online', 'Sales Method_Outlet']].corr().round(2)
# Set the font size
sns.set(font_scale=0.8)
# Plot the correlation matrix as a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
```

**Code Output:**

Prompt created the heatmap quickly, and after a quick request to round the values to the nearest hundredth and make the font smaller, the graph is quite illustrative. We can see that there is high correlation between sales that are made online and in outlets, so we may need to drop one of those variables, but there is not much multicollinearity present in the model.

You can continue exploring the data or tuning and editing the model as needed with Prompt! Happy coding!

### About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.