What is LoRA or Low-Rank Adaptation? How is it applied to LLMs?

Paul Yang - November 21st, 2023

LoRA, or Low-Rank Adaptation, is a technique used to fine-tune large language models in a more efficient way. To understand LoRA at a CS 101 level, let's break down the concept.

Defining Large Language Models (LLMs) and Fine Tuning

We presume that you already know what Large Language Models are – models that receive a text input and return the probability of the next words in the sequence. Language models have millions or even billions of parameters, which are like switches that need to be adjusted to make the machine work correctly. Adjusting all these parameters requires a lot of computational power and data, and large language models are trained on billions and now trillions of input words.

Fine-tuning is taking a model that already has been pre-trained, and teaching it something more specific. For instance, common use-cases include:

  • Changing the tone and voice of the outputs, to more consistently reflect a way of talking without having to prompt in detail.
  • Informing the output structure, helping the model return results in consistent ways like a JSON, or ensuring the response contains exactly the fields and specifications required.
  • Control over model behavior, ensuring specific response qualities like preferring to be always concise, or matching the language of the input request instead of responding in English when the input was non-English.

Remember that other than fine-tuning, many other approaches exist to receive the style and structure of response that you want.

  • One- and Few-Shot Learning which just involves including one or a few examples in the input prompt itself. For instance, including a couple of sentences written by Shakespeare
  • OpenAI Function Calling which involves defining specific schema that responses must be returned in, as can be defined by a JSON schema

Low-Rank Adaptation (LoRA)

LoRA is a clever shortcut for fine-tuning. Instead of adjusting all the parameters, LoRA proposes to only add and adjust a small set of new parameters.

Linear Algebra Reminders

  • Multiplying a m x k matrix with a k x n matrix yields a matrix of size m x n
  • The rank of an m x n matrix is equal to min(m, n)

Imagine we have a pre-trained neural network layer that transforms input data x into output data y using a weight matrix W and a bias vector b. The transformation looks like this:

y=Wβˆ—x+by=W*x+b

Here, W is a matrix of size m x n, where m is the number of output features and n is the number of input features, and b is a vector of size m.

Put into practical terms, imagine we had LLM layer represented by a matrix which is 40,000 x 10,000 (not an unreasonable estimate for a GPT3 layer for instance) which represents 400MM values in the matrix.

Now, we want to fine-tune this layer for a new task using LoRA. Instead of updating the entire weight matrix W, we will introduce two smaller matrices A and B such that A is of size m x k and B is of size k x n, where k is much smaller than both m and n (this is the low-rank part – it’s not very important).

Just as an example, imagine we pick k = 2 as the most extreme case. Then we can have A: 40,000 x 2, and B: 2 x 10,000, with a grand total of 100,000 potential parameters to tune across these two matrices, or a 1/4000 reduction. Thus, rather than relearn W, we simply have to learn A and B through the fine-tuning process.

Then, it is easy to apply, as the LoRA adaptation updates the weight matrix:

Wβ€²=W+Aβˆ—BW' = W + A * B

This example shows the key idea is to introduce a low-rank structure that captures the essence of the changes needed for the new task, thereby reducing the number of parameters that need to be trained.

Benefits of LoRA

  • Efficiency: Since only a small number of parameters are trained, LoRA is much faster and requires less memory than traditional fine-tuning methods, it is much more memory efficient and potentially adapts faster.
  • Flexibility: Using low-rank matrices allows selective adaptation of different parts of the model. For example, you may decide to adapt only certain layers or components, providing a modular approach to model personalization.

In summary, LoRA is like a smart hack in the world of machine learning that allows us to update giant language models quickly and efficiently by focusing on a small, manageable set of changes, all of which can be understood through the lens of linear algebra and matrix operations.

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.