Although chatbots have evolved a lot in the last 50 years, it’s important to examine the changes thus far to better understand the technology underlying large language models or LLMs. Since OpenAI ’s innovations in 2022, from Dall-E’s image generation to ChatGPT’s raps, essays, soliloquies, and other generated text, LLMs have entered the popular lexicon in a big way. Previously, we brought you a crash course in GPTs (Generative Pre-trained Transformers), the model underlying OpenAI’s technology. In this post, we’ll go a bit broader, looking at large language models as a whole category, where GPT is just one example.
What is an LLM? What is a large language model?
Put simply, a large language model (LLMs) is a type of model that uses deep learning to process natural language. They are trained on large datasets from various sources, such as books, articles, websites, and scripts, and use neural networks to learn the relationships between words, phrases, and sentences. They learn patterns, grammar, and semantics from the data, enabling them to generate coherent, relevant, and natural-sounding answers and interpretations.
Large language models can be used for tasks such as machine translation, text summarization, question answering, sentiment analysis, and more. A few things to keep in mind regarding large language models–although when we, as end-users work with LLMs, they can seem really amazing, the same basic things are true of LLMs as other machine learning models:
- Training Time: large language models require a lot of training data and can take days or even weeks to train, depending on the size of the model and amount of data available.
- Compute Resources: training large language models requires significant compute resources.
- Data Quality: language models are only as good as the data they are trained on, so it is important to ensure that any datasets used have high quality and accuracy before training begins.
- Data collection and preparation: creating datasets that accurately represent a domain of interest takes time and effort in data collection, cleaning, normalization and annotation before model training can begin. Additionally, existing datasets may not always reflect real-world scenarios leading to difficulty generalizing learned behavior when tested in production settings
Ways LLMs can differ
Large language models can diverge from each other along several axes or characteristics:
- Model Size: one important axis of divergence is the size of the model, often measured in terms of the number of parameters it contains. Larger models tend to have more expressive power and can capture finer-grained language nuances, but they also require more computational resources for training and inference.
- Training Data: the corpus of training data used to pre-train the models can vary. Models can be trained on different sources, such as books, websites, or domain-specific data. The size and diversity of the training data can impact the models' ability to generalize and handle specific domains or topics.
- Architecture: language models can employ different architectures, such as the Transformer architecture or its variants. Architectural choices can affect the models' capacity to capture long-range dependencies, handle context, and facilitate parallel processing.
- Pre-training: how the language models are initially trained and tuned is an important way in which each of these models differ. A model can be pre-trained with a particular industry or corpus in mind, such as texts relevant to the medical field. This can change how effective the model is in generating natural language in particular contexts.
- Fine-tuning: the fine-tuning process can also vary across models. Different models may require specific fine-tuning techniques, data setups, or task-specific adaptations. The approach to fine-tuning can impact how well the model generalizes and performs on specific downstream tasks.
- Model Bias: language models can exhibit biases inherited from the training data. Models may be more or less prone to biases based on their training data sources and the methods used to mitigate biases during training.
- Language Support: some models are specifically designed for a particular language, while others are multilingual and can handle multiple languages.
These are just a few dimensions along which large language models can diverge. Each model's characteristics can have implications for performance, generalization, and suitability for specific tasks or domains. Researchers and developers consider these dimensions when selecting and evaluating models for their specific needs.
Pre-training and fine-tuning
In the context of large language models, pre-training refers to the initial phase of training where the model is exposed to a large corpus of unlabeled text data. During pre-training, the model learns to predict missing or masked words within the input text, essentially developing a deeper understanding of language patterns, grammar, and contextual relationships. This task is commonly known as masked language modeling (MLM) or masked token prediction.
To perform masked language modeling, a certain percentage of words in the input text are randomly masked, and the model is trained to predict the original words. By training on this task, the model learns to capture the statistical regularities and contextual dependencies present in the text. This pre-training process allows the model to acquire a broad understanding of language, including grammar, semantics, and world knowledge, without any specific task in mind.
Once the pre-training phase is complete, the model is then fine-tuned on specific downstream tasks. Fine-tuning involves training the model on labeled data for specific tasks, such as text classification, question answering, or language translation. By fine-tuning on task-specific data, the model adapts its pre-trained knowledge to the specifics of the target task, leveraging the broad understanding of language acquired during pre-training.
Pre-training is a crucial step in large language models as it enables the model to learn from vast amounts of unsupervised data and develop a rich representation of language. This pre-trained knowledge serves as a foundation that can be further fine-tuned for specific tasks, allowing the model to generalize well and achieve strong performance on a wide range of natural language processing tasks.
Major LLMs throughout (recent) history
BERT is a large-scale language model introduced by Google researchers in 2018. It was developed by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT is based on the Transformer architecture and is trained on a large corpus of unlabeled text data.
What sets BERT apart is its bidirectional training approach. Unlike previous models that processed text in one direction (either left-to-right or right-to-left), BERT is trained in both directions, enabling it to better understand the context and meaning of words. This bidirectional training allows BERT to capture the dependencies between the preceding and succeeding words, resulting in improved language understanding.
T5 or Text-to-Text Transformer is a large-scale language model developed by researchers at Google Research. It was introduced in 2019 and represents a significant advancement in language modeling. T5 is built on the Transformer architecture and is trained using a "text-to-text" framework, where it learns to map input text to output text.
Unlike traditional models that are trained for specific tasks, T5 is trained on a vast range of tasks and can be fine-tuned for various downstream applications. It is trained in a supervised manner using a large dataset consisting of pairs of input and target text. T5's flexibility allows it to be easily adapted to different tasks by simply providing the appropriate input-output text pair during fine-tuning.
XLNet is another notable large-scale language model introduced in 2019 by researchers at Carnegie Mellon University and Google. It was developed by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet is built upon the Transformer-XL architecture and incorporates an autoregressive training method.
Unlike previous models that rely on left-to-right or right-to-left context, XLNet leverages a permutation-based approach. It considers all possible permutations of the input sequence during training, enabling it to capture dependencies beyond the traditional causal context. This approach allows XLNet to better model long-range dependencies and improve language understanding.
LaMDA (Language Model for Dialogue Applications) is a language model developed by Google, and first released in 2021. LaMDA is designed specifically for conversational interactions and aims to improve natural language understanding and generation in dialogue-based systems.
Traditional language models often generate responses in a one-turn-at-a-time manner, lacking a broader understanding of the ongoing conversation. LaMDA addresses this limitation by considering the conversation as a whole, allowing it to maintain context and generate responses that are more relevant and coherent.
Google has highlighted that LaMDA's underlying technology involves improvements in language understanding rather than simply scaling up the model size to develop a better understanding of language nuances, context, and the subtleties of conversation.
The goal of LaMDA is to enable more fluid and natural conversations between humans and AI systems. It holds promise for applications such as chatbots, virtual assistants, and other dialogue-based systems where maintaining coherent and context-aware conversations is crucial for a satisfying user experience.
GPT through the years
GPT (Generative Pre-trained Transformer) refers to a series of large language models developed by OpenAI. Each iteration of the GPT series builds upon the previous one, introducing improvements in size, capabilities, and performance. Here's a breakdown of the key differences between GPT, GPT-2, GPT-3, and GPT-4:
- GPT was the original model in the GPT series, introduced in June 2018.
- GPT’s innovation focused on pre-training the model on unlabeled data, combined with the transformer architecture. The model was then fine-tuned on labeled data.
- GPT demonstrated impressive performance on various natural language processing tasks, showcasing the potential of large language models.
- GPT-2 was released in February 2019 and represented a significant advancement over its predecessor.
- It featured a much larger model size, with 1.5 billion parameters, more than 10 times the parameters as GPT, and trained on more than 10 times the data, allowing it to capture more complex language patterns.
- GPT-2 exhibited remarkable language generation capabilities, generating coherent and contextually relevant responses.
- Due to caution surrounding malicious actors, OpenAI only released a limited version of GPT-2.
- GPT-3 was introduced in May 2020.
- It is significantly larger than GPT-2, with a whopping 175 billion parameters, making it one of the largest language models ever created.
- GPT-3 showcased extraordinary language understanding and generation abilities, generating human-like articles, which human readers found hard to distinguish as computer-generated.
GPT-3.5 and ChatGPT
- The InstructGPT models were built to follow instructions using reinforcement learning from human feedback (RLHF). These models were first released in January 2022.
- This technique was subsequently used for GPT-3.5, which powers ChatGPT.
- ChatGPT was fine-tuned using human conversations.
- The most recent model to date, and released in March 2023.
- The main difference in GPT-4 versus the earlier iterations is that the model is multimodal, and thus can take text input as well as image input.
- GPT-4 also seems much more adept at completing human benchmarks, such as the bar exam.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. ‘BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding’. ArXiv [Cs.CL], 2019. arXiv. http://arxiv.org/abs/1810.04805.
Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. ‘Deep Contextualized Word Representations’. ArXiv [Cs.CL], 2018. arXiv. http://arxiv.org/abs/1802.05365.
Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. ‘Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer’. ArXiv [Cs.LG], 2020. arXiv. http://arxiv.org/abs/1910.10683.
Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. ‘XLNet: Generalized Autoregressive Pre-training for Language Understanding’. ArXiv [Cs.CL], 2020. arXiv. http://arxiv.org/abs/1906.08237.
Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.