The Cheapest, Fastest, and Easiest Way to Host an LLM

Paul Yang - November 30th, 2023

In late 2022, ChatGPT took the world by storm. At that point, only OpenAI’s hosted GPT-3 had the power to deliver quality answers. But by late 2023, there are now a variety of open-access models that match the performance of GPT-3. Of course, by most metrics, OpenAI’s GPT-4 is still the highest-quality model across almost every benchmark of readability, code generation, and reasoning. But not every task requires the best-of-the-best. Whether for cost control, data privacy, or customizability, there are many reasons why you might want to host your own large language model (LLM).

ChatGPT at capacity message (Retrieved Jan. 16, 2023)ChatGPT at capacity message (Retrieved Jan. 16, 2023)

Many models take a pre-trained model, like Facebook’s powerful LLaMA2, and fine-tune for better practical performance. For instance, many different derivative models exist of LLaMA2 which have undergone fine tuning for instruction following (obeying input requests), chat (question/answer), or writing code.

We can take one of these fine-tuned models, and serve them ourselves from our preferred deployment platform. One note though; many open-access models can be used for research and commercial purposes. However, be aware of terms when deploying models for commercial use; for instance, OpenAI prohibits its model outputs from being used to fine-tune other projects, but some popular LLaMA fine-tunings are tuned as academic projects with GPT-4 outputs.

You can jump below for "Just the Python Code." Or you can fork the canvas embedded here.

Why is this tutorial the cheapest, fastest, and easiest?

  • Cheapest: We will show you how to load a (small) model into a CPU-only environment, with no need for heavy GPU machinery. Of course, deploying bigger and more complex models likely will require dedicated machinery.
  • Fastest: We use a model that has undergone an approach called “model quantization” which greatly reduces the size and difficulty of returning results of the model.
  • Easiest: Here, we will show you how to load a model in 10 lines of code and serve it in 20.

What Are We Using for the Demo?

We use Python for the entirety of this program. You will need command line access for some portions. First, we need to install four packages, and download one model.

The Model: Zephyr-7B quantized to 4bits – https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF

Zephyr is a model fine-tuned by HuggingFace from the pre-trained model Mistral-7B by Mistral AI.

Then, the 16-bit model parameters were quantized by contributor TheBloke and converted a GGUF (GPU-Generated Unifed Format) that can be used with the llama.cpp project we discuss below.

What is Quantization?

LLMs are really big and slow, and this is one of the techniques we can use to make them smaller without too much loss of info. Models are stored as 16-bit or 32-bit float (remember that LLMs are just multiple giant matrices of numbers), but we can use 8-bit or 4-bit integers instead.

Obviously, this conversion is not 1:1; these data types don't cover the same range. So you create a mapping function: Say every time a -10 to 10 exists in the original model, map them to a 0 in the reduced model. You lose accuracy given this inexact mapping, but empirically this does not entirely destroy performance, while shrinking model size by 2-8x and enabling much faster inference on smaller sources.

Loading the Model: Llama.cpp

Llama.cpp is an open source package designed to help enable the deployment of small models “at the edge” with the original goal of getting LLaMA to run on the MacBook. There are a few packages that help us use and deploy models, especially quantized models.

Helper Packages

Transformers

!pip install transformers

Hugging Face Transformers is a popular library that provides pre-trained models and a suite of tools for natural language processing (NLP), including state-of-the-art models for tasks like translation, text generation, and sentiment analysis.

PyTorch

!pip install torch

PyTorch is an open-source machine learning library based on the Torch library. It provides tools for model deployment, but also in other contexts for building and training neural networks.

Accelerate

!pip install accelerate

This is a helper here. The Hugging Face Accelerate package is designed to simplify distributed training and inference for deep learning models.

Downloading the Model

We use HuggingFace here, and the huggingface_hub library to download the model. We run two huggingface-cli commands, first to authenticate and then to actually download the model. You will need to register and receive an API token from HuggingFace first. The model components are then loaded to our working directory.

from huggingface_hub import hf_hub_download
!huggingface-cli login --token {token}
!huggingface-cli download TheBloke/zephyr-7B-beta-GGUF zephyr-7b-beta.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

It’s a tight 4GB on disk.

Installing and Loading the Model

The llama.cpp project makes it super easy to use quantized models of the GGUF format. Hey it’s just two lines:

from llama_cpp import Llama
llm = Llama(model_path="zephyr-7b-beta.Q4_K_M.gguf")

We can see an example of the output it returns with a simple command (remember that we should put Q: and A: to help signal that the model should respond with an answer)

output = llm("Q: Name the planets in the solar system? A: ", max_tokens=128, stop=["Q:", "\n"], echo=True)
print(output)

Output:

You can choose whatever framework you prefer to deploy. Full example code is provided below, but the relevant snippet is simply that we should return:

result = llm(input_string, max_tokens=max_tokens, stop=["Q:", "\n"], echo=True)

Within Einblick, you simply have to go within Kernel settings, enable incoming connections, and just launch it at host 0.0.0.0 port 7000.

Einblick Incoming Connections Tab ScreenshotEinblick Incoming Connections Tab Screenshot

Once this is ready, I can go to Terminal on my Macbook and get a response from my very own self-hosted model:

Or in Python:

Just the Python Code

!pip install llama-cpp-python
!pip install transformers
!pip install torch
!pip install accelerate
!pip install huggingface_hub
!pip install flask

from huggingface_hub import hf_hub_download
import torch
!huggingface-cli login --token {token}
!huggingface-cli download TheBloke/zephyr-7B-beta-GGUF zephyr-7b-beta.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

from llama_cpp import Llama
llm = Llama(model_path="zephyr-7b-beta.Q4_K_M.gguf")

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/llama_tokenize', methods=['POST'])
def process_llama_tokenize():
   # Get the input string from the request
   data = request.get_json()
   if 'input_string' not in data:
       return jsonify({'error': 'No input_string provided'}), 400

   input_string = data['input_string']
  
   # Call the llama_tokenize function with the provided input
   tokens = []  # Assuming tokens is a list
   max_tokens = 128  # Set your desired max_tokens value here

   # Call the function and handle the result
   try:
       result = llm(input_string, max_tokens=max_tokens, stop=["Q:", "\n"], echo=True)
       return jsonify({'response': result['choices'][0]['text']}), 200
   except Exception as e:
       return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
   app.run(host='0.0.0.0', port=7000)

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.