Whenever there is extreme hype surrounding a particular technology, art piece, or restaurant, it makes sense to be a bit skeptical. But, luckily for everyone–those in and out of the data community–the hype around large language models (LLM) is probably justified. For the first time, we have a number of generative models that allow us to create text that reads naturally, follows from logical thought, and takes into account context.
Even if you are skeptical, you might have noticed that Google Translate (DeepL, etc.) has gotten quite good in the last few years. 10 years ago, the translated text always read awkwardly, and colloquialisms might be missed. Now, Google can do your Spanish homework (what this means for teachers and educators is a separate topic). Importantly, some of the advancements in translation that we’ve been consciously or unconsciously aware of, are founded on the same principles that power the LLM revolution.
The space of LLMs is now a gold rush, meaning that innovation can be clouded by marketing speak, vaporware, and a thousand new tools and brands. The top influencer article for many of these concepts is empty of facts and content. Our goal is to share semi-technical content that will help ground you factually in this fast-shifting hype cycle.
This piece will cover four sections:
- Transformers: What is the underlying technology?
- GPT: How does it work, in brief?
- Making the Model Better: What are the technical ways to work with the model?
- From a Great Model to a Great App: What are people doing to make usable software?
Transformers (the T in GPT)
Transformers are a new-ish type of AI model that does really well at natural language tasks because it captures contextual relationships or long-range dependencies between words better than any previous models.
Transformer models are a relatively new type of neural network architecture that has gained popularity in the last five years. Previous AI approaches, including recurrent neural networks, had already become popular in the natural language processing (NLP) space. Transformers improve on handling natural language text, namely by handling longer input sequences and capturing contextual relationships between words.
Under the hood, the transformer model consists of an encoder and a decoder, both made up of many layers of subprograms that work together to process the input sequence and generate the output sequence. The encoder processes the input sequence and generates a fixed-length vector representation, also known as a context vector, which captures the meaning of the entire input sequence. The decoder then takes the context vector and generates an output sequence, based on the context vector and the previous tokens generated. During training, the transformer model learns how to minimize the difference between its predicted output sequence and the correct output sequence.
Overall, the ability of transformers to model long-range dependencies and capture contextual relationships between words in natural language makes them highly effective for a wide range of NLP tasks.
Here are some things that are related, but not Transformer models:
- Neural Networks: A neural network is a machine learning model that uses interconnected nodes to process input data and generate output predictions, and the transformer model is a specific type of neural network designed for natural language processing tasks. All Transformers are neural networks, but many other types of neural networks exist.
- ChatGPT: An application built off of GPT3.5, a specific transformer model, with additional features to manage memory and a front-end that’s accessible through the web.
- Generative Images: Transformers can produce images, but there are other approaches as well like Generative Adversarial Networks (GAN), which employ two neural networks in tandem.
Generative Pre-Trained Transformers (GPT!)
GPTs are trained to predict the next word.
As the title states, GPT stands for "Generative Pre-trained Transformer". It is a type of neural network architecture that is based on the Transformer architecture.
During training, the GPT model is fed with a sequence of words, and at each step, it predicts the probability distribution over the vocabulary for the next word in the sequence based on the previous words. The model then samples from the predicted distribution to select the next word in the sequence, and this process is repeated until the desired length of the text is generated.
For instance, the sentence so far might already be “Paul needs apples. He is going to the” and the task is to complete the sentence. The next words might be with some different probability, “supermarket,” “grocery,” or “orchard.” The first two probably don’t make a huge difference, but correctly determining from context whether orchard makes sense is important. The model, through training over huge datasets and its ability to understand relationships between words across sentences, would assign “orchard” higher probability if the prior sentence was “Paul is a farmer. Paul needs apples.”
This has significantly advanced the state-of-the-art in NLP and led to the development of highly effective language model-based applications which can generate human-like responses to text-based conversations. But it is important to remember that all the model is doing is guessing the next word. Remember - they don’t think!
Three Ways to Get Better Responses: Context, Fine-Tuning, vs. New Models
Providing context, fine-tuning, and retraining are all ways to improve the performance of a GPT model, but they differ in their level of complexity and the amount of data required.
GPT models have been around for years, but all of a sudden, ChatGPT burst onto the scene. Within weeks, it had amassed 100 million users, providing a truly helpful and enjoyable AI-to-user experience. It is important to distinguish that ChatGPT is the software and GPT is the engine. We can try and dissect ChatGPT’s application of GPT models as a loose case study to understand how language models can be tailored to become wildly successful applications.
Giving GPT a Memory
Providing context to a GPT model involves giving the model additional information about the task at hand, which helps it generate more accurate and relevant responses. For example, if the task is to generate a response to a customer support query, providing the model with information about the customer's history or the product they are using can help the model generate a more accurate response. This process is relatively simple and can be done with a small amount of additional data provided as part of the prompt. The caveat is that models are limited in the amount of information that can be provided as context, usually between one and ten thousand words.
When you are in ChatGPT, it is able to leverage your existing conversations to build upon and respond to your further inquiries for clarification. The GPT model itself has no memory, and so it relies on the application to provide it with historical conversation.
Fine-tuning a GPT Model
This involves training the model on a specific task using a small or moderate amount of task-specific data. This is done by initializing the model with pre-trained weights and then training it on the new task with a small amount of task-specific data. Fine-tuning can significantly improve the model's performance on the new task with relatively little additional training data. Down the line, we get higher quality results with less contextual tokens (this is the unit of measurement rather than word count) needed (we could always give a lot of context to try and force the same result as a fine-tune).
ChatGPT currently uses a version of GPT that was fine-tuned for conversation. Of all the use-cases that exist for GPT models, conversational chat is just one example but tuning for chatting creates a much more normal experience for a human user. In this case, we are neither providing context nor are we fully retraining the whole entire model, but fine-tuning for chat.
A New Model
Retraining a GPT model involves training the model from scratch on a new dataset. This is a much more complex and time-consuming process than fine-tuning and requires a large amount of data to achieve good results. Retraining is often used when the task at hand is very different from the tasks that the model was pre-trained on, such as image captioning or speech recognition. Retraining allows the model to learn new patterns and relationships in the data and can result in a much more accurate and effective model for the new task.
While the count of parameters is largely marketing mumbo-jumbo since there’s not a direct correlation to model quality, it is important to note that GPT 1 had just 120 million parameters and GPT 3 had 175 billion…meaning each generation of released GPT is materially more powerful than the last. When ChatGPT moves to GPT 4, it will be a version upgrade that will be interesting to see.
The Next Generation of GPT-powered Apps
Building applications on GPT models will depend on taking the core engine and building a chassis around the model itself to fit user needs.
While the ChatGPT chatbot is well-suited to answering general questions in dialogue with a user, it is not the only possible extension of the core GPT completion model. There are cynical and naive implementations of GPT technology that are just repackaging the existing model endpoint and selling it at a markup…but there are also many creative ways to take our “very smart next word guesser” and create really impactful applications.
Fine-Tuning for Specific Domains
This one largely is self-explanatory, but it bears repeating. By intelligently tuning the GPT model that exists for the specific use-case that must be solved, domain-specific applications can be significantly improved. The trick here is that even though the dataset required to fine-tune is small (“few-shot learning”), finding the right example + answer requires deep customer knowledge. Someone who is outside the medical profession will not be able to easily produce fine-tuning examples for a medical use-case. Testing and then user-testing their fine-tunings is what made OpenAI’s ChatGPT so great.
A chatbot has it easy–just grab context from the last few blocks of chat. For an application of arbitrary purpose, finding the right context at the right time is hard. A prompt for a medical chatbot might include a patient's symptoms and medical history, and then also studies that relate to those symptoms to help diagnose. But we cannot feed every patient’s entire history into the model as context…in fact we’re limited to just 3000 words of context. Intelligently saving all possible context, designing a retrieval system, and then providing just the right context at the right time will make an application seem smart.
Even outside of context management, designing effective prompts can help ensure that the model produces helpful responses. If the prompt result is intended to feed into a technical function, we can request that GPT returns in a JSON schema with the prompt defining the schema specifically. We can request as part of a prompt that code completions that are returned are devoid of comments…or are richly commented. An application designed for younger students might always carry the imperative to “use simple language” as part of any prompts.
Knowing that all GPT can do is predict the next word in a sequence of words, it takes a clever user to trick it into reasoning through complex problems. If we can ask GPT to return us the list of steps to execute, and then ask for each step how it would execute it, we’ve essentially tricked the model into results step-by-step. We actually discuss this in more detail in our blog post about why LangChain is so exciting. The hype cycle has also come around for Auto-GPT, which is an implementation of using GPT to do multi-step reasoning.
From text to data pipeline with Einblick Prompt
At Einblick, we’re utilizing the latest in large language models to speed up development for data analysts and data scientists. Currently, we’re building Einblick Prompt, which will let users build entire data workflows with just one sentence. We believe the data space is a great domain for leveraging the technology behind LangChain and transformers like GPT because data projects all begin with natural language. The natural language tasks and requests then need to be converted into code, graphs, and decks by data analysts and data scientists, who know what the data pipeline needs to look like. This is a complex, iterative, multi-step process. So if you’re interested in getting access to Einblick Prompt first, join our waitlist today.
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.