Generative AI is hyped, but likely appropriately. While there are plenty of articles discussing the hype, we want to talk about the hard problems facing engineers building production-ready software applications. AI models like GPT-4 are foundational technology, which are flexible and powerful, but can be tricky to use.
Here, we share practical guidance from building Einblick Prompt, which is still in beta. We’re all building with large language models for the first time, and wanted to share our learnings ahead of launch in a few weeks.
Background
To date, most successful commercial applications of LLMs in software applications have been relatively straightforward. In most cases, the software function uses generative AI services by sending a single prompt to the service and getting a single response back. Some parameters of the prompt might be tunable by user input, but mostly the API call to the language model can be thinly wrapped.
For example, an application that helps content marketers write blog posts is almost complete right out of the box. Here, we clipped part of a blog post written by GPT3.5 with a simple input prompt, and it’s pretty good already at marketing Snowflake! The magic of the latest GPT models is that they just work, and work well out of the box.

Other examples of “give-one-request get-one-response” include translation tools, chatbots, docs helpers, coding function writers, etc... And the language models are great at translation, summarization, and generally generating text so the use cases fit. To be clear, this does not diminish the complexity or usefulness of these applications. These are just the solutions that can use these language models natively and naturally, and therefore have moved the fastest.
But these initial use cases are not all LLMs can do. Microwaves were only for detecting tanks and planes until 1945, when an engineer wandered in front of a radar array with a chocolate bar in his pocket. A much broader scope of applications is emerging. People correctly noticed that if you can ask the language models to make plans and also to write code, then you conceptually have a system that can do anything. All you need is an execution engine to follow the guide of language model results aka the rat in Ratatouille.

This quickly makes the scale of possible useful prompts–rather than “text in, new text out,” we have “job to be done” in and “completed action” out–and this is possible for anything. But everything is hard to build for, so using LLMs becomes harder as well.
A Quick Detour - Key Terms
Let’s detour to introduce a few key terms first, though you might want a refresher from our blog posts on GPT (generative pre-trained transformers) or large language models (LLMs) first:
- Large Language Models (LLMs): neural networks that are built to work with text, whether predicting, summarizing, or in general creating more text from some existing input.
- Transformer Models: A type of LLM that has become popular recently because they are phenomenal at generating text.
- GPT-3, GPT-4: Specific models under the class of transformer model s
- Chat-GPT, Google Bard, etc…: Products that are built from the core language models
- Prompts: The text input that is given to a large language model to elicit a specific response. This might be anything from “give me a blog post of 500 words…” to “respond yes or no…” to “give me a JSON which…”
Context
Language models have general knowledge of the world, but for specific user needs, we likely need to provide the context. Some have called this giving the language model “long-term memory,” but that’s marketing talk. For instance, we might want to share data types and column names with the model if we want to generate code to make a chart or get correlations.
This is simple if there’s a single object to work on–one Excel file for instance. But likely the user has a much wider universe of things that they could want to draw from–1000s of data tables in a database.
Why not give everything we know to the language model? If I want to write a paragraph about Taylor Swift, but I include context about every Taylor alive, that already feels wrong. Several specific considerations exist:
- There are hard limits to context windows. Some models have a cap in the 1000s of words; OpenAI recently did lift their limit to 16,000 tokens.
- Not all context is relevant, but a lot of it could seem relevant. We can end up distracting the language model from the correct answer by providing a bunch of irrelevant information.
- You are billed for input tokens, so unintelligently throwing garbage into the context can rack up bigger bills.
- More complex inputs take longer to return a response. By throwing on unnecessary context, simple requests turn into monstrous 30-second long delays.
There are two main ways we have tried to pick the right information.
Vector Storage and Retrieval

For the objects that could be provided as context, you can generate an embedding of the object, store that in a vector database, and then retrieve potentially relevant pieces of context based on vector similarity to new incoming requests. There’s plenty of documentation about this concept, so we will not discuss a precise how-to, but we will just note that it’s relatively straightforward to set up.
We use Faiss and build a vector representation of all the data objects that a user might want to draw from in analysis. This works remarkably well out-of-the-box, and was quite easy to set up. In particular, we were actually surprised by how accurately it could retrieve the relevant elements without a lot of fine-tuning. Granted we typically have fewer context objects that might need to be in memory than your application (data objects + variables tends to be <100), so your mileage may vary.
Inference from User Interface
Instead of using vector similarity, one of the most powerful ways to filter and identify context is to log usage. From the user interface, you can draw cues about what is potentially relevant. Because Einblick is on a 2-dimensional canvas, we expect users to navigate and place objects in an ordered way, using our organization tools. And because data science is an iterative process, we expect a user to be following a thought process.
Therefore, we can draw upon temporal and spatial relationships between past user interactions and the current prompt to determine what context is most relevant. For instance, at the beginning of a project, when there are no datasets already in memory, the user likely wants to go to the data source to pull data. With each subsequent step, the user may transform the dataset a little bit, but likely the embedding that represents each of these transformed objects will look similar to each other–for instance, the before and after of casting columns from float to int. Only by tracking user interactions can we accurately determine what is most relevant.
You will end up wanting to combine these two approaches. But figuring out the right heuristic to choose between semantic relevance (from vector retrieval) and temporal/spatial relevance (from user interactions) is non-trivial. Typically, you expect the user to be moving along an ordered workflow and prioritize user cues. However, if there are frequent context switches in how you expect a user to use your tool, then prioritize vector similarity.
Latency...Will...Get...Bad

There are a few different causes of latency when using LLM services. You mostly will not be able to solve these problems unless you have a million dollars and Sam Altman’s email address. When users are used to snappy workflows and responsiveness, this leads to degraded customer likelihood to adopt and enjoy using LLM-powered functionality.
The first problem comes when you provide complex inputs to transformer models, like GPT-4. First, the input text needs to be split into smaller units called tokens, where longer inputs have more tokens, increasing the computational workload. Then, storing and processing information in the model's attention mechanisms requires more memory and computation as the input length grows. When the input is longer, the number of pairwise relationships that need to be computed grows quadratically, resulting in increased computation time. This means that growing the length of the input will increase the delay in getting a response non-linearly.
The second is just that the LLMs are probably running on someone else’s computer (i.e. “cloud service”), and so sometimes there’s just going to be traffic. Not for nothing, even major cloud providers like Azure have had problems scaling up their LLM services due to the boom in developers suddenly demanding these services, and an extant shortage of graphics processors.
There are a few things you can do to prophylactically cure these issues from the user’s perspective:
- Share the plan: For complex, multistep processes that might require many language model API calls, share with the user what is going on. It helps the user understand that it is not a simple “one thing that is slow” but rather an evolving network of processes. This is a level of transparency that maybe is not typical in software (I don’t say “now calling X function”), but has long been used by installers which tend to be long-running processes.
- Return things as they come back: Ever notice how ChatGPT returns words one at a time? It returns them faster or slower based on load and complexity, but it makes me, the user, feel like something is going in either case…and I can read as it generates.
- UI tricks: For user-facing steps, you can generate visual distractions along the way. You don’t just have to return the actual outputs of the language model bit by bit. We actually slowly print out statements that we have hardcoded, and have added more introductory language.
- Be upfront: Set expectations with the user outside of the software usage, and then communicate when things are slow. A few “sorries” when you reach a certain length of time do not hurt.
Humans in the Loop

Yes, the whole point of using multi-step complex generative AI processes is to remove the human. But the LLM doesn’t think; it just returns text that is the highest probabilistically to come next in a sequence of words. Thinking, luckily, still remains in the domain of humans, and we can hijack that to smooth out the workflow.
Build an “Ask Question” Off-ramp
LLMs tend to return convincing and confident answers. By default, the models don’t know how to “fail” – since they are just choosing the highest probability answers. This is what leads to fake facts and hallucinations. However, instead of letting it hallucinate, we can design prompts with “off-ramps” and ask the model to ask questions when needed. Then, we just have to build the functionality to surface that back to the user in the front end. Dialogue is acceptable; confidently wrong is not.
Make Revision Easy
When a bad result is achieved or there is some error thrown in the process, make it easy for the user to trigger a revision, and feed the error back into the original process. For instance, if we ever hit a syntax error, we immediately offer the user an option to let us fix it. Alternatively, maybe just prompt dialogue from the user, success or fail–“What can I do next” is easily merged with “Let me know if you have any issues.”
Give a Rating Option
There’s no better way to assess the quality of your generative AI processes than asking the human who has a request to leave a rating. Furthermore, you might give a frustrated user some catharsis when they have an immediate ability to release their anger. Even if most users will not engage, you will hopefully capture a meaningful percentage of the “bad” cases that you need to fix. We think thumbs up/down is the easiest for users to engage with.
Small Changes Have a Big Impact
Building with LLMs can be frustratingly dissimilar to the objectivity that comes with code. Unlike code, there’s no objective tie from input to output. For instance, if I add a word to a print statement in code, it just adds a word to the printed output. By contrast, if I add a word to a prompt, it might return the same output 95% of the time, but suddenly break a few cases. One of the most common cases for us is when we edit the instructions–the subtle differences between “Be sure to !pip install if any packages are used” and “Make sure to !pip install imported packages” caused errors in 10% of cases recently.
Also, remember that LLMs tend to be auto-regressive. This means that both a) the order of the text as it comes in, and b) the initial generated words in the output, will affect the results that get generated. If you are having the LLM build a JSON, you might want to swap different names around, since the value generated for the prior item might affect the value of the subsequent one as well. We ran into a situation when having “tool”: “generate-python” would cause the subsequent value in the JSON to be Python code, rather than a different metadata field as intended.
Practically though, this means that it is harder to fix small bugs. You need to thoroughly test every change, and debugging can become infuriating. We definitely feel like we are playing “whack-a-mole” sometimes.
There are some technical solutions here.
- Fine-tuning models outside of changing around prompts can help ensure that overall responses are better. However, you probably will not fine-tune for every prompt, and we have over a dozen different basic input prompts that we apply to different situations.
- One or few-shot learning in the prompt itself seems to help with formatting, but you also run the risk of biasing the answer. This is a fancy way of saying that if you provide a few example input/output pairs in the prompt, the true input will likely have an output that looks more like the example outputs. This is great for formatting problems, but can hurt open-ended questions as the model might learn substantively but incorrectly.
- OpenAI recently released a function-calling piece of the API, with the purpose to help developers “more reliably get structured data back from the model.” We have started making use of this, and while not perfect, it reduces the likelihood of misformatted responses causing bugs in software.
Otherwise, our recommendation is to collect a number of issues to be fixed at once. Be ready to spend quality time fixing issues.
A Few Errors to Catch
These are a few error cases that we did not think about handling until we hit them head-on:
- Censorship: When we pass context, we don’t always know what goes into those blocks. For example, a movies dataset with rating tags might say “R - Sex, Violence” which then flags the censorship trigger when the code generated contains `isSex` and `isViolence.` Make sure you read the AI services and explicitly catch these cases since they are coded differently and have payloads you might want to log.
- Timeouts: Sometimes the language model services will slow down to a crawl; responses that take 1 second typically have reached 400 seconds of latency. At some point, kill the job even if there’s no “failure” and it will eventually succeed.
- Endless Question Loops: If you give models the chance to ask users for clarification, make sure there is an escape condition from outside of the language model itself. Sometimes, the language model can get stuck clarifying forever, or fail to progress to the next step. For multi-step chains of prompts, set a limit at every given step.
- Verbosity: The language model sometimes likes to ramble on after the end of a block (for instance, a description of code after writing code). Make sure you have a contingency for schema validation when a returned object doesn't match what you were expecting, whether JSON, code, or other.
In Summary
Most early commercial applications of Large Language Models (LLMs) have been relatively straightforward, involving a single prompt and response. Examples include content marketing, translation tools, chatbots, etc. These applications work well because the language models can naturally handle these use cases.
To build more complex applications, we need to learn new engineering skills and creatively work with the LLM services. During this article, we covered a few key areas:
- Adding Context: To provide context to language models for specific user needs, embedding and retrieval systems can be used. Vector storage and retrieval is one approach, while inference from user interface interactions is another. Combining these approaches can help in determining the most relevant context.
- Working with Latency: Latency can be a challenge when using LLMs due to the complexity of inputs and potential traffic on cloud services. We suggest strategies like sharing the plan with the user, returning outputs as they come in, using UI tricks for visual distractions, and setting upfront expectations to mitigate latency issues.
- Human in the Loop: It is important for human involvement in the workflow of LLM-powered processes. Off-ramps and question-asking mechanisms can be designed to avoid hallucinations and solicit user input in case of errors. Providing options for revision and incorporating rating systems can also enhance the quality of generative AI processes.
- Small Changes are Hard: Even small changes in the input or prompts can have a significant impact on the results generated. It is crucial to thoroughly test and debug any changes made to ensure desired outcomes. Technical solutions such as fine-tuning models and using one or few-shot learning in the prompt can help improve overall responses and formatting. OpenAI's recent release of a function calling API can also assist in getting structured data from the model more reliably.
- Errors to Catch: We finally highlight some error cases that engineers may encounter when using LLMs, such as censorship triggers, timeouts, endless question loops, and verbosity in generated outputs. It emphasizes the need to handle these error cases proactively to improve the reliability of the applications.
About
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.