Earlier this year we released Einblick Prompt, a large language model (LLM) powered code and no-code generation assistant built specifically for data scientists. Since its launch, thousands of people have used Prompt and we’ve received a lot of great feedback that allowed us to iterate on its capabilities quickly. The key to Einblick Prompt’s success is its ability to produce accurate results due to superior context-awareness.
LLM Context for Data Notebooks
Context is crucial for code generation. General GenAI tools like ChatGPT can quickly return boilerplate code to solve a high-level task but are too generic as they lack information about the rest of the codebase. LLM-powered code generators like Github Copilot improve the accuracy of results by identifying relevant code to include with a user prompt. However, code as the only context is typically not sufficient for data science code generation.
Data scientists write code that is tightly coupled to the content of the datasets and files they work with, e.g., code to load data into a dataframe, perform joins, apply transformations, etc. Omitting this domain-specific context will often just produce overly boilerplate code that won’t run without additional refinement by the programmer. For better performance, LLMs need access to the additional domain-specific data context.
Before we dive into how this can be accomplished within a data notebook, let’s look at some popular code generators and see if they can produce error-free code for a basic task (spoiler: they can’t).
We pick an example where the starting point is a notebook containing one cell with the following code that loads and displays the contents of a CSV file.
import pandas as pd import numpy as np marketing = pd.read_csv("marketing.csv") marketing.head(3) Age Married EducationLevel Employed 0 32 widow/widower Associate's degree, occupational False 1 46 married Master's degree False 2 65 married Master's degree True
The user now wants, in a second cell, to create a visualization of the age columns. The input prompt they use is: “plot age”.
Wrong Variable Names: Here’s what the first notebook AI we tested returned:
# plot age import matplotlib.pyplot as plt plt.figure(figsize=(10,5)) plt.hist(df['age'], bins=20) plt.xlabel('age') plt.ylabel('count') ❌ Error
The code errors out. It used the wrong dataframe name (df instead of marketing) as well as the wrong column name (age instead of Age).
Request is too narrowly and literally interpreted: Let’s see what the second notebook AI returns:
# plot age plt.hist(df['age'], bins=20) ❌ Error
Executing this code yields a number of errors: Missing import (plt is not defined), wrong data frame name (df instead of marketing), wrong column name (age instead of Age)
Hallucinates a Data Source: And a third one
import matplotlib.pyplot as plt ages = [19, 23, 27, 30, 35, 41, 52, 60, 65, 72] counts = [5, 8, 12, 9, 7, 5, 3, 2, 2, 1] plt.scatter(ages, counts) plt.xlabel('Age') plt.ylabel('Count') plt.title('Age Distribution') plt.show()
The code generated by the third notebook AI we tested does not use the marketing data frame at all but uses hard-coded values instead.
Not a single one of the systems we tested generates code that runs on this basic example! Of course, a user could tweak the prompt to include more details about what it is they want to accomplish, or alternatively, accept these incorrect responses and refine them to make the code run without errors. But that would defeat the purpose of a useful code generator if a user had to carefully type out variable names and write long context into the input prompt.
The good news is that many of these issues can be resolved by preparing and exposing more context available in a data notebook to the LLM. Most importantly,
- Dataframes that are currently in memory and their contents
- Constants and variables stored in kernel memory
- Code in other cells
- Files in the filesystem
Today we’re releasing our Einblick Prompt extension for JupyterLab that’ll help users generate, modify, and fix code in JupyterLab. In the following, we’re going to shed some light on the internals of the extension, explaining the importance of the different pieces of context and how we process and compute them to enhance prompts that are sent to the underlying LLM.
Context generation in Einblick’s JupyterLab extension
So what exactly makes up the context we want to provide to the data science-oriented code generator?
- What existing variables can I reference, and which of those are dataframes? What kind of data is in the dataframes?
- What cells exist near where I’m generating new code? In particular, which code is above (i.e., “before”) the new code I’m generating?
- What CSV and TSV files are available to me in the kernel’s filesystem?
Here’s a rough overview.
Since the notebook (or canvas if you’re using Einblick) is backed by a running kernel, we can retrieve which variables exist in memory and what types they are. In particular, it’s useful to learn which variables are dataframes and to get some information on them such as column names and a few sample rows.
Next, we want to narrow down this list of variables to those that are actually relevant; there can be a lot of junk floating around in memory. It’s useful to assign a weight to each variable, and use these weights to cull the list or hint to the code generator. This weight is based on a combination of two basic spatial heuristics and a semantic relevance heuristic.
Spatial heuristics are simple. Given the location of the new cell we’re generating and the location of some other variable in code: 1) what is the distance between the code blocks and 2) is the variable above (i.e., “before”) our location in code.
But sometimes a variable created at the beginning of a notebook needs to be used later. We can also identify those variables with semantic relevance to the input query. This is done by first creating embeddings of variables using context metadata; for instance, for dataframes, the embedding might be generated from a summary of the data contents. Since all variables are saved in a vector DB, when we receive a user query, we can use vector search and retrieval to identify which variables are semantically most relevant to the query.
The steps above provide us with context about specific variables, but sometimes it’s important to include the full code in which those variables are referenced or assigned. We use the same spatial heuristics mentioned above and make some assumptions that whatever was immediately previously executed is likely linked to the new code that should be generated.
Finally, the context would be incomplete without knowledge of the data files available to be read. For this, we look for CSV and TSV files in the current directory the notebook has open and make sure to include the full path to the file.
Thus, our context includes all the basic information we might need to quickly read the landscape. This includes: data files available to be read, nearby code, variables assigned or referenced in that nearby code, and which of those variables are dataframes. Once this context is assembled, it is straightforward to plug it into an LLM-prompt template and request Python code generation. Once the returned response is parsed, the code should be ready for execution in the Jupyter cell with no changes needed.
Within the Einblick app itself (not the plugin), we actually slow things down and take a few steps to restate, interpret, and plan for code generation. This allows for intermediate LLM calls to reason about and highlight relevant important context (and elimination of irrelevant information). More accurate incorporation of context leads to higher-quality results.
As a notebook grows in size and complexity, there is more and more information that we can potentially include in the context that we send the data generator. Yet as we include more context, we begin to reach some limitations. For one, more context is not always better, as some of it can often only serve as noise. The larger your notebook, the more pieces of logic you have that might not be immediately relevant to one another. Another limitation you begin to reach is size, especially in the LLM layer. Thus it becomes important to be able to intelligently limit what is included in context.
In Einblick, rather than having cells in a linear notebook layout, a user places cells on a 2D canvas. This provides an entirely new dimension (literally) of spatial context that we can use when deciding what additional context to send a data generator. Since cells can be moved around freely, distances between cells on a canvas provide a richer heuristic than distances between cells in a linear notebook. On a canvas, the user’s decision on where to place cells is more deliberate and thus more meaningful. Additionally, cells can be loosely grouped in Einblick into blocks called “zones.” We can use these groupings as another heuristic.
Another advantage Einblick has over a regular notebook is that it is able to include external data connections in its context. Data connections (e.g.,a connection to a SQL database) are defined in the application layer instead of defined in code. Thus, once Einblick knows about a new connection, a user can opt to include knowledge of that connection in the context. This would allow a code generator to know about different tables or schemas in a database automatically.
Our extension for JupyterLab is available free of charge. Give it a try and let us know what you think!
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.