Einblick Prompt vs. GitHub Copilot

Becca Weng - August 17th, 2023

Everyone wants an AI assistant or AI pair programmer to speed up how they code and how they work. Data scientists and analysts are no different. But not all AI applications are made equal, and not all are built specifically for the data domain. With the increasing prevalence of ads and products claiming to speed up work with generative AI, we wanted to create several deep dives on some major apps in the space. In this article, we’ll be focusing on one of the most talked about products from every coder’s favorite version control platform: GitHub Copilot.

This is the second article in a series on leading generative AI apps in the data domain. If you’re interested, check out our first article covering Jupyter AI.

What is GitHub Copilot

GitHub Copilot is an “AI pair programmer” offering from GitHub. Available through code editors like VS Code, Neovim, and JetBrains, Copilot is GitHub’s answer to the generative AI craze. Copilot has been trained on publicly available code and text, including public GitHub repositories. Once GitHub Copilot is enabled, Copilot will automatically offer up autocomplete suggestions for comments, lines of code, and even functions. You can "query" the AI by typing out code comments or by beginning your line of code. The suggestions appear as light text, and you can use keyboard shortcuts to accept the entirety of a suggestion, accept one word at a time, or you can just keep coding, and ignore Copilot’s offerings.

The Copilot for Individuals plan costs $10 per month or $100 per year, and the Copilot for Business plan costs $19 per user per month.

What is GitHub Copilot Chat

GitHub Copilot Chat is the chat interface that is currently in a limited public beta. As an individual, you can sign up on the waitlist, assuming you have a GitHub Copilot for Individuals for subscription. Copilot Chat is also available for those with a GitHub Copilot for Business subscription. Note that you can download the extension on VS Code, for example, but if your account has not been activated for Copilot Chat via the waitlist or other means, the chat icon will not appear in your code editor, and you will not be able to use Copilot Chat.

What is Einblick Prompt

Einblick is an AI-native data notebook that can write and fix code, create beautiful charts, build models, and much more. Einblick Prompt is the context-aware AI agent at the core of Einblick’s notebooks. Prompt was built for the data domain and can create and run entire data workflows in as little as one sentence. With a unique 2-D canvas layout, Einblick allows users to easily compare code and charts, as well as iterate quickly on different workflows. Once in Einblick, users can query Prompt by clicking anywhere on the canvas and selecting Prompt, or by selecting Prompt off an existing cell.

While running, Prompt will provide the steps it will take before the code populates and auto-runs, and it will ask the user any follow-up questions to fill in missing information. Since it is context-aware, Prompt will intuit most of what it needs, including metadata such as column names and dataset names. If unsatisfied with the output, you can ask Prompt to change the existing cell, or to fix any errors. Then you can continue building manually or use Prompt to add on to existing work.

Prompt vs. GitHub Copilot Feature Comparison

Einblick PromptGitHub Copilot
Generates commented code
Directly fixes and debugs code
Context-aware
Can auto-run code
Built for data tasks
Generates blocks of code❌ (*See section on Results in Copilot)
Has a free tier❌ (*See section on What is GitHub Copilot)
Interface with model in a data notebook
Generates executable code in code cells
Installation requiredN/ARequired
Prompt persistenceHistory of last 5 prompts run in notebookN/A
In-line code autocompletion

Installation and getting started

Using GitHub Copilot in VS Code

Simply install the GitHub Copilot extension via the Extensions tab on VS Code. Once you’ve reloaded VS Code, GitHub Copilot will be automatically enabled, and will provide you with grayed out, inline code suggestions as you work. There is currently no way to easily toggle GitHub Copilot. The behavior of the extension is much like autocomplete in Google Docs. As you type, Copilot will start providing suggestions. These suggestions change as you keep typing.

Using GitHub Copilot Chat in VS Code

Based on promotional materials from GitHub, you can install the GitHub Copilot Chat extension, and once reloaded, there will be a chat icon on the left hand panel of VS code. At this moment, GitHub Copilot Chat is in limited beta, and unavailable to the general public.

Using Prompt in Einblick

There’s no installation required for Einblick Prompt. Prompt is available to all Einblick users at no extra cost. Just open up an Einblick data notebook, click anywhere and select Prompt. Type in your query, and you are ready to start building with Prompt. You don’t have to supply your own API key as Einblick has you covered.

Using Einblick Prompt vs. GitHub Copilot

To assess the two apps, I used the same natural language prompts and dataset in both platforms to see what kind of code, charts, and models Prompt and Copilot could create for me. For this experiment, I used the Adidas sales dataset from Kaggle. The data includes sales data from retailers across the US, including units sold, operating profit, operating margin, total sales, and sales method.

Results in GitHub Copilot

Since GitHub Copilot Chat was unavailable at the time of writing, I only used GitHub Copilot. For the code editor, I used VS Code or Visual Studio Code, as it is my code editor of choice already. At first, I was worried that GitHub Copilot would only work in .py files, but I was pleasantly surprised that it worked as expected in the .ipynb file as well. It took a little bit to get used to autocomplete in a code editor, just as it took a bit to get used to autocomplete in Google Docs. Sometimes the suggestions were off, but you can just keep typing, and the suggestions tended to correct course.

*Interestingly, Copilot was pretty resistant to generating blocks of code, unless I specified in a comment I was defining a function. 99% of the time Copilot would just provide suggestions to complete the line, even if the line was just a code comment. Once I started a new line, it would suggest a new line of code or another comment. In order to get the code examples I show below, I had to keep hitting new line to receive and accept suggestions.

Results in Einblick Prompt

Screenshot of Einblick Prompt showing list of stepsScreenshot of Einblick Prompt showing list of steps

To use Prompt in Einblick, you have to actively choose to start Prompt, as stated above. Rather than providing inline suggestions, Prompt will generate blocks of commented code. But you can also use Prompt to fix any bugs or errors you hit, whether you are coding manually or using Prompt to kickstart a workflow.

Below, I have reproduced the prompts and results from Prompt and Copilot. You can see the full analysis in the Einblick canvas below. The Prompts are provided in Markdown cells to the left of each relevant Python cell.

Use-case: Data cleaning

Prompt 1: Filter data

Filter for only footwear

Einblick Prompt Output

adidas_sales_filtered = adidas_sales[adidas_sales['Product'].str.contains('footwear', case=False)]

This was Prompt's first attempt, and it met all of my needs. Prompt checked for the word "footwear," ensuring it wasn't case sensitive, and it did not assume that the category would just be "footwear." In this case, there are words describing the category, such as "Men's Apparel," "Men's Athletic Footwear," and "Women's Street Footwear." Prompt also was able to infer which column to check without any additional user input.

GitHub Copilot Attempt 1

Getting Copilot to give the output I expected took a bit of prodding. First, Copilot tried to read in the data. But because Copilot lacks awareness of the file system, it didn’t know the name of the file. This would be helpful if you don’t know the syntax for pd.read_csv, but having to look up the filename counters some of the time saved.

After manually editing code to load the file, I asked GitHub Copilot via comment to “Filter for only footwear.” All of Copilot’s suggestions were pretty literal. Copilot also did not know that the dataframe was saved as adidas, rather than df, nor was it aware of what column to look into. As a result, I tried a second time, with a different prompt.

GitHub Copilot Attempt 2

This time, I asked Copilot to “Filter for footwear in product name.” This yielded better results.

Copilot still struggled with the column name and dataframe name, but with the new comment, it was able to search just for the word “footwear” in the column. Then I manually fixed the code to match my particular dataset.

Prompt 2: Frequency counts and filter

Print frequency counts for Retailer and Region
Filter for foot locker, west gear, and sports direct

For the next step in the analysis, I wanted to check the frequency counts for the retailers and regions, to see if there was any imbalance that I would want to filter for.

Einblick Prompt Output: Print frequency counts

retailer_counts = adidas_sales_filtered['Retailer'].value_counts()
print(retailer_counts)

region_counts = adidas_sales_filtered['Region'].value_counts()
print(region_counts)

Output:

Foot Locker      1771
West Gear        1585
Sports Direct    1352
Kohl's            687
Amazon            633
Walmart           406
Name: Retailer, dtype: int64
West         1634
Northeast    1584
Midwest      1248
South        1152
Southeast     816
Name: Region, dtype: int64

Based on the frequency counts, which Prompt served up easily, I decided to filter for 3 retailers. Prompt was able to determine what the name of the dataframe was now, after the previous filter, which was super convenient.

Einblick Prompt Output: Filter for retailers

Then, I asked Prompt to “Filter for foot locker, west gear, and sports direct.” Even though I didn’t type the retailers in with title case, Prompt was able to properly case the strings, without additional prompting or any manual editing on my part.

import pandas as pd

retailers = ['Foot Locker', 'West Gear', 'Sports Direct']
filtered_sales = adidas_sales_filtered[adidas_sales_filtered['Retailer'].isin(retailers)]
filtered_sales.head()

Output:

GitHub Copilot Attempt 1

Since Copilot continues not to know what the dataframe is called, if you’re moving too quickly, you will run into NameErrors, like this one. After manually updating the code, I was able to move on to the next step.

After entering the initial prompt: “Filter for foot locker, west gear, and sports direct,” Copilot kept generating more comments, one line at a time, that seemed tangentially related, but irrelevant to my existing dataset or task.

GitHub Copilot Attempt 2

After deleting all of the comments, except for the first, I started manually coding what I wanted, and with that input, Copilot was able to generate executable code.

Use-case: EDA (exploratory data analysis)

Prompt 1: Complex charts

Plot the distributions of price per unit, units sold, total sales, operating profit, and operating margin by region

Changing and updating code

For both Prompt and GitHub, the initial charts used the default font size, which was too big for the number of charts. So I had to add a prompt to make the font smaller.

Make the font of the axis tick labels smaller

Einblick Prompt Output

import matplotlib.pyplot as plt
import seaborn as sns

# Create a new dataframe 'region_sales' by grouping 'filtered_sales' by 'Region' and calculating the sum of relevant columns
region_sales = filtered_sales.groupby('Region').sum()[['Price per Unit', 'Units Sold', 'Total Sales', 'Operating Profit', 'Operating Margin']]

# Plot the distributions of 'Price per Unit', 'Units Sold', 'Total Sales', 'Operating Profit', and 'Operating Margin' by 'Region' using boxplots
plt.figure(figsize=(12, 10))
plt.subplot(2, 3, 1)
sns.boxplot(x='Region', y='Price per Unit', data=filtered_sales)
plt.title('Distribution of Price per Unit by Region')
plt.xticks(fontsize=8)

plt.subplot(2, 3, 2)
sns.boxplot(x='Region', y='Units Sold', data=filtered_sales)
plt.title('Distribution of Units Sold by Region')
plt.xticks(fontsize=8)

plt.subplot(2, 3, 3)
sns.boxplot(x='Region', y='Total Sales', data=filtered_sales)
plt.title('Distribution of Total Sales by Region')
plt.xticks(fontsize=8)

plt.subplot(2, 3, 4)
sns.boxplot(x='Region', y='Operating Profit', data=filtered_sales)
plt.title('Distribution of Operating Profit by Region')
plt.xticks(fontsize=8)

plt.subplot(2, 3, 5)
sns.boxplot(x='Region', y='Operating Margin', data=filtered_sales)
plt.title('Distribution of Operating Margin by Region')
plt.xticks(fontsize=8)

plt.tight_layout()
plt.show()

Output:

GitHub Copilot Attempt 1

When I just let Copilot provide suggestions, it seemed to start the analysis from scratch, and tried to calculate potential columns to store price per unit and operating profit. While this would have been a good skeleton if I were starting from scratch, it would have been nice if the context of the prior prompts had been considered.

GitHub Copilot Attempt 2

After deleting some of the suggested libraries and redundant lines of code, I tried to change the style of the charts to push the plotting chart goal via sns.set_theme(). This generated suggested code that aligned much more closely with what I had hoped. But I still had to update the column names according to the dataset I had loaded previously.

Prompt 2: Simple charts

Plot operating margin by retailer
Plot operating margin by sales method

Einblick Prompt Output

Prompt created plots that show the average operating margin by retailer and sales method rather than the distribution without further prompting. If you're looking for the distribution of the operating margin, you may need to ask Prompt to change the cell.

import matplotlib.pyplot as plt

# Group the 'filtered_sales' dataframe by 'Retailer' and calculate the average of 'Operating Margin' column
average_margin_by_retailer = filtered_sales.groupby('Retailer')['Operating Margin'].mean()

# Plot the average operating margin by retailer
plt.bar(average_margin_by_retailer.index, average_margin_by_retailer.values)
plt.xlabel('Retailer')
plt.ylabel('Average Operating Margin')
plt.title('Average Operating Margin by Retailer')
plt.show()

Output:

import matplotlib.pyplot as plt

# Group the 'filtered_sales' dataframe by 'Sales Method' and calculate the average of 'Operating Margin' column
average_margin_by_sales_method = filtered_sales.groupby('Sales Method')['Operating Margin'].mean()

# Plot the average operating margin by sales method
plt.bar(average_margin_by_sales_method.index, average_margin_by_sales_method.values)
plt.xlabel('Sales Method')
plt.ylabel('Average Operating Margin')
plt.title('Average Operating Margin by Sales Method')
plt.show()

Output:

GitHub Copilot Output

While Copilot generated syntax that could work for these prompts immediately, including coloring the histograms, the lack of context-awareness meant needing to change the dataframe name and column names as in prior examples. Once updated, the charts produced served their purpose.

Use-case Building ML models

Prompt 1: Build a regression model

Use linear regression to predict operating margin based on the retailer, region, product, price per unit, total sales, and sales method.

Einblick Prompt Output

For the Prompt output, I really appreciated the details–the code comments, the automated preprocessing under certain conditions, and printing out common evaluation metrics. I never mentioned any of these in the prompt, so it’s great to see Prompt including these things that make the code more readable. While this might not be the final model, it’s a great starting point for developing a more advanced model to push to production or to present to stakeholders.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Create a new dataframe with relevant columns
prediction_data = filtered_sales[['Retailer', 'Region', 'Product', 'Price per Unit', 'Total Sales', 'Sales Method', 'Operating Margin']]

# Check for missing values
prediction_data.dropna(inplace=True)

# Preprocess text columns
for column in prediction_data.select_dtypes(include='object').columns:
if len(prediction_data[column].unique()) > 50:
prediction_data.drop(column, axis=1, inplace=True)
print(f"Dropped column '{column}' due to too many distinct values.")
else:
prediction_data = pd.get_dummies(prediction_data, columns=[column], drop_first=True)

# Convert datetime columns to int (if any)
for column in prediction_data.select_dtypes(include='datetime').columns:
prediction_data[column] = pd.to_datetime(prediction_data[column]).astype(int)

# Split the data into training and testing sets
X = prediction_data.drop('Operating Margin', axis=1)
y = prediction_data['Operating Margin']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Output:

Mean Squared Error: 0.004176766340000534
R-squared: 0.44841658245373084

GitHub Copilot Attempt 1

In the first attempt, I kept accepting Copilot’s suggestions to see if it would break itself out of the loop of importing way too many libraries and functions, but it continued for quite a while.

GitHub Copilot Attempt 2

After cleaning up the import list, I added a new prompt “Create dataframe with X variables,” which seemed to help Copilot to create more productive code (with the earlier caveats about lack of context-awareness and needing to adjust code for the dataset and column names).

I was pretty happy that it was able to even add in some preprocessing, but Copilot was still only suggesting 1 line comments or 1 line of code at a time, so it took a while to actually generate the code block shown below.

Output:

I did however really appreciate that it automatically included a plot of actual and predicted values. I thought that was a nice touch, and probably indicative of some of the training data that Copilot used, which included public repos.

Prompt 2: Interpret the model

Plot the key drivers of the regression model

Einblick Prompt Output

The output was really great, and what I expected. If I wanted to tune this chart, I would ask Prompt to change the cell so that the key drivers were ordered by importance.

import matplotlib.pyplot as plt

# Get the feature importances from the trained model
feature_importances = model.coef_

# Create a bar plot to visualize the feature importances
plt.figure(figsize=(10, 6))
plt.bar(X.columns, feature_importances) # Change X_train.columns to X.columns
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.xticks(rotation=90)
plt.show()

Output:

GitHub Copilot Attempt 1

In the first attempt, Copilot suggested using an attribute that does not exist in the model object I had generated, so I redirected Copilot in the right direction.

GitHub Copilot Attempt 2

Once I did that, it was able to generate a plot pretty easily.

Prompt 3: Check model assumptions

Evaluate the multicollinearity in the regression model

Einblick Prompt Output

Initially Prompt didn’t plot the heatmap, and just produced the values, and just like the Copilot output, the font was too large for the size of the plot. I used the following two additional prompts to update the output.

  • Plot the correlation matrix as a heatmap
  • Make the font in the plot smaller
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix for the independent variables in the 'X_train' dataframe
correlation_matrix = X_train.corr()

# Plot the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", annot_kws={"fontsize": 8}) # Modified line
plt.title('Correlation Matrix')
plt.show()

Output:

GitHub Copilot Output

By adding the second line “Calculate the correlation matrix for the independent variables in the ‘X_train’ dataframe”, I was able to get Copilot’s code to work for the dataset initially. But tweaking the size of the font was a bit tougher.

As you can see below, I added the phrase “use small font,” but the result is still a bit rough.

Fixing Errors

Screenshot highlighting fix with prompt button in EinblickScreenshot highlighting fix with prompt button in Einblick

Einblick Prompt

If you ever hit an error in Einblick, you will see a “Fix with Prompt” button appear, and Prompt will then determine how to debug your code based on the error message. If you just want to adjust some of the output, for example changing the columns referenced or the appearance of a visualization, you can use the “Change this cell” option and inputting a new natural language query. Prompt is context-aware, so you can chat with it like an assistant using prompts like “Make the graph pastel” or “Only include columns A, B, and C in the model.”

GitHub Copilot

There is currently no way to use GitHub Copilot to fix errors in code automatically. If there is something you want to tweak about a visualization or a chunk of code, you can add more code comments where relevant and hope that Copilot will generate the correct code or you can fix it yourself manually

Conclusion

GitHub Copilot is a great use of AI for code completion, particularly if your preferred code editor supports the extension. Copilot requires the user to be pretty hands-on, and provides flexibility as it will sometimes provide multiple suggestions at once. But, Copilot is not context-aware, and it can be tedious to only receive one line of code at a time. There is no direct way to fix code, but the ability to “query” the AI simply with code comments is an efficient way to use the existing UI, and a hacky way to “fix” any errors or missed steps in the AI-generated code. Since Copilot is not made for the data domain, and is not aware of your file system, the suggestions are mainly based on public repos’ code style, so you may find yourself heavily editing the suggestions for small things, such as column names and dataframe names. Additionally, if Copilot does not provide a suggestion that works for you, you do just have to continue coding manually or consulting documentation.

Einblick Prompt, on the other hand, leverages the immense power of OpenAI's technology specifically for the data science and data analytics domain. This allows Prompt to complete complex tasks, such as model building, with little supervision. As we saw above, Prompt automatically preprocessed the data before building the regression model to avoid errors. While Prompt does not provide multiple options with every query, Prompt is able to generate lines and lines of code with as little as one sentence. The code is commented, and if there are any issues, Prompt can fix the code with just a click of the button. Utilizing the context of the canvas, Einblick Prompt provides highly accurate and tailored answers that align with the user's specific needs and requirements. Prompt is able to reason which dataframe to use out of the many that may be in a notebook, and is able to auto-run code, and fix any issues with one click of a button.

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.