Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

13 minute read


Open In Colab

Throughout the last year, we have seen the Wild West of Large Language Models (LLMs). The pace at which new technology and models were released was astounding! As a result, we have many different standards and ways of working with LLMs.

In this article, we will explore one such topic, namely loading your local LLM through several (quantization) standards. With sharding, quantization, and different saving and compression strategies, it is not easy to know which method is suitable for you.

Throughout the examples, we will use Zephyr 7B, a fine-tuned variant of Mistral 7B that was trained with Direct Preference Optimization (DPO).

🔥 TIP: After each example of loading an LLM, it is advised to restart your notebook to prevent OutOfMemory errors. Loading multiple LLMs requires significant RAM/VRAM. You can reset memory by deleting the models and resetting your cache like so:

# Delete any models previously created
del model, tokenizer, pipe

# Empty VRAM cache
import torch
torch.cuda.empty_cache()

UPDATE: I uploaded a video version to YouTube that goes more in-depth into how to use these quantization methods:

1. HuggingFace

The most straightforward, and vanilla, way of loading your LLM is through 🤗 Transformers. HuggingFace has created a large suite of packages that allow us to do amazing things with LLMs!

We will start by installing HuggingFace, among others, from its main branch to support newer models:

# Latest HF transformers version for Mistral-like models
pip install git+https://github.com/huggingface/transformers.git
pip install accelerate bitsandbytes xformers

After installation, we can use the following pipeline to easily load our LLM:

from torch import bfloat16
from transformers import pipeline

# Load in your LLM without any compression tricks
pipe = pipeline(
    "text-generation", 
    model="HuggingFaceH4/zephyr-7b-beta", 
    torch_dtype=bfloat16, 
    device_map="auto"
)

This method of loading an LLM generally does not perform any compression tricks for saving VRAM or increasing efficiency.

To generate our prompt, we first have to create the necessary template. Fortunately, this can be done automatically if the chat template is saved in the underlying tokenizer:

# We use the tokenizer's chat template to format each message
# See https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot.",
    },
    {
        "role": "user", 
        "content": "Tell me a funny joke about Large Language Models."
    },
]
prompt = pipe.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

The generated prompt, using the internal prompt template, is constructed like so:

The prompt template is automatically generated with the internal prompt template. Notice that there are different tags for differentiating between the user and the assistant.
The prompt template is automatically generated with the internal prompt template. Notice that there are different tags for differentiating between the user and the assistant.

Then, we can start passing the prompt to the LLM to generate our answer:

outputs = pipe(
    prompt, 
    max_new_tokens=256, 
    do_sample=True, 
    temperature=0.1, 
    top_p=0.95
)
print(outputs[0]["generated_text"])

This gives us the following output:

Why did the Large Language Model go to the party?

To network and expand its vocabulary!

The punchline may be a bit cheesy, but Large Language Models are all about expanding their vocabulary and networking with other models to improve their language skills. So, this joke is a perfect fit for them!

For pure inference, this method is generally the least efficient as we are loading the entire model without any compression or quantization strategies.

It is, however, a great method to start with as it allows for easy loading and using the model!

2. Sharding

Before we go into quantization strategies, there is another trick that we can employ to reduce the necessary VRAM for loading our model. With sharding, we are essentially splitting our model up into small pieces or shards.

Sharding an LLM is nothing more than breaking it up into pieces. Each individual piece is much easier to handle and might prevent memory issues.
Sharding an LLM is nothing more than breaking it up into pieces. Each individual piece is much easier to handle and might prevent memory issues.

Each shard contains a smaller part of the model and aims to work around GPU memory limitations by distributing the model weights across different devices.

Remember when I said we did not perform any compression tricks before?

That was not entirely true…

The model that we loaded, Zephyr-7B-β, was actually already sharded for us! If you go to the model and click the “Files and versions” link, you will see that the model was split up into eight pieces.

The model was split up into eight small pieces or shards. This decreases the necessary VRAM as we only need to handle these small pieces.
The model was split up into eight small pieces or shards. This decreases the necessary VRAM as we only need to handle these small pieces.

Although we can shard a model ourselves, it is generally advised to be on the lookout for quantized models or even quantize them yourself.

Sharding is quite straightforward using the Accelerate package:

from accelerate import Accelerator

# Shard our model into pieces of 1GB
accelerator = Accelerator()
accelerator.save_model(
    model=pipe.model, 
    save_directory="/content/model", 
    max_shard_size="4GB"
)

And that is it! Because we sharded the model into pieces of 4GB instead of 2GB, we created fewer files to load:

3. Quantize with Bitsandbytes

A Large Language Model is represented by a bunch of weights and activations. These values are generally represented by the usual 32-bit floating point (float32) datatype.

The number of bits tells you something about how many values it can represent. Float32 can represent values between 1.18e-38 and 3.4e38, quite a number of values! The lower the number of bits, the fewer values it can represent.

Common value representation methods. We aim to keep the number of bits as low as possible whilst maximizing both the range and precision of the representation.
Common value representation methods. We aim to keep the number of bits as low as possible whilst maximizing both the range and precision of the representation.

As you might expect, if we choose a lower bit size, then the model becomes less accurate but it also needs to represent fewer values, thereby decreasing its size and memory requirements.

A different representation method might negatively affect the precision with which to represent value. To the extent that some values are not even represented (too large values for float16 for example). Examples were calculated with PyTorch.
A different representation method might negatively affect the precision with which to represent value. To the extent that some values are not even represented (too large values for float16 for example). Examples were calculated with PyTorch.

Quantization refers to converting an LLM from its original Float32 representation to something smaller. However, we do not simply want to use a smaller bit variant but map a larger bit representation to a smaller bit without losing too much information.

In practice, we see this often done with a new format, named 4bit-NormalFloat (NF4). This datatype does a few special tricks in order to efficiently represent a larger bit datatype. It consists of three steps:

  1. Normalization: The weights of the model are normalized so that we expect the weights to fall within a certain range. This allows for more efficient representation of more common values.

  2. Quantization: The weights are quantized to 4-bit. In NF4, the quantization levels are evenly spaced with respect to the normalized weights, thereby efficiently representing the original 32-bit weights.

  3. Dequantization: Although the weights are stored in 4-bit, they are dequantized during computation which gives a performance boost during inference.

To perform this quantization with HuggingFace, we need to define a configuration for the quantization with Bitsandbytes:

from transformers import BitsAndBytesConfig
from torch import bfloat16

# Our 4-bit configuration to load the LLM with less GPU memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

This configuration allows us to specify which quantization levels we are going for. Generally, we want to represent the weights with 4-bit quantization but do the inference in 16-bit.

Loading the model in a pipeline is then straightforward:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Zephyr with BitsAndBytes Configuration
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-alpha",
    quantization_config=bnb_config,
    device_map='auto',
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

Next up, we can use the same prompt as we did before:

# We will use the same prompt as we did originally
outputs = pipe(
    prompt, 
    max_new_tokens=256, 
    do_sample=True, 
    temperature=0.7, 
    top_p=0.95
)
print(outputs[0]["generated_text"])

This will give us the following output:

Why did the Large Language Model go to the party?

To network and expand its vocabulary!

The punchline may be a bit cheesy, but Large Language Models are all about expanding their vocabulary and networking with other models to improve their language skills. So, this joke is a perfect fit for them!

Quantization is a powerful technique to reduce the memory requirements of a model whilst keeping performance similar. It allows for faster loading, using, and fine-tuning LLMs even with smaller GPUs.

4. Pre-Quantization (GPTQ vs. AWQ vs. GGUF)

Thus far, we have explored sharding and quantization techniques. Albeit useful techniques to have in your skillset, it seems rather wasteful to have to apply them every time you load the model.

Instead, these models have often already been sharded and quantized for us to use. TheBloke in particular is a user on HuggingFace that performs a bunch of quantizations for us to use.

At the moment of writing this, he has uploaded more than 2000 quantized models for us!

These quantized models actually come in many different shapes and sizes. Most notably, the GPTQ, GGUF, and AWQ formats are most frequently used to perform 4-bit quantization.

GPTQ: Post-Training Quantization for GPT Models

GPTQ is a post-training quantization (PTQ) method for 4-bit quantization that focuses primarily on GPU inference and performance.

The idea behind the method is that it will try to compress all weights to a 4-bit quantization by minimizing the mean squared error to that weight. During inference, it will dynamically dequantize its weights to float16 for improved performance whilst keeping memory low.

For a more detailed guide to the inner workings of GPTQ, definitely check out the following post: 4-bit Quantization with GPTQ

We start with installing a number of packages we need to load in GPTQ-like models in HuggingFace Transformers:

pip install optimum
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

After doing so, we can navigate to the model that we want to load, namely “TheBloke/zephyr-7B-beta-GPTQ” and choose a specific revision.

These revisions essentially indicate the quantization method, compression level, size of the model, etc.

For now, we are sticking with the “main” branch as that is generally a nice balance between compression and accuracy:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load LLM and Tokenizer
model_id = "TheBloke/zephyr-7B-beta-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=False,
    revision="main"
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

Although we installed a few additional dependencies, we could use the same pipeline as we used before which is a great benefit of using GPTQ.

After loading the model, we can run a prompt as follows:

# We will use the same prompt as we did originally
outputs = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)
print(outputs[0]["generated_text"])

This gives us the following generated text:

Why did the Large Language Model go to the party?

To show off its wit and charm, of course!

But unfortunately, it got lost in the crowd and couldn’t find its way back to its owner. The partygoers were impressed by its ability to blend in so seamlessly with the crowd, but the Large Language Model was just confused and wanted to go home. In the end, it was found by a group of humans who recognized its unique style and brought it back to its rightful place. From then on, the Large Language Model made sure to wear a name tag at all parties, just to be safe.

GPTQ is the most often used compression method since it optimizes for GPU usage. It is definitely worth starting with GPTQ and switching over to a CPU-focused method, like GGUF if your GPU cannot handle such large models.

GGUF: GPT-Generated Unified Format

Although GPTQ does compression well, its focus on GPU can be a disadvantage if you do not have the hardware to run it.

GGUF, previously GGML, is a quantization method that allows users to use the CPU to run an LLM but also offload some of its layers to the GPU for a speed up.

Although using the CPU is generally slower than using a GPU for inference, it is an incredible format for those running models on CPU or Apple devices. Especially since we are seeing smaller and more capable models appearing, like Mistral 7B, the GGUF format might just be here to stay!

Using GGUF is rather straightforward with the ctransformers package which we will need to install first:

pip install ctransformers[cuda]

After doing so, we can navigate to the model that we want to load, namely “TheBloke/zephyr-7B-beta-GGUF” and choose a specific file.

Like GPTQ, these files indicate the quantization method, compression, level, size of the model, etc.

We are using “zephyr-7b-beta.Q4_K_M.gguf” since we focus on 4-bit quantization:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-beta-GGUF",
    model_file="zephyr-7b-beta.Q4_K_M.gguf",
    model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta", use_fast=True
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

After loading the model, we can run a prompt as follows:

# We will use the same prompt as we did originally
outputs = pipe(prompt, max_new_tokens=256)
print(outputs[0]["generated_text"])

This gives us the following output:

Why did the Large Language Model go to the party? To impress everyone with its vocabulary! But unfortunately, it kept repeating the same jokes over and over again, making everyone groan and roll their eyes. The partygoers soon realized that the Large Language Model was more of a party pooper than a party animal. Moral of the story: Just because a Large Language Model can generate a lot of words, doesn’t mean it knows how to be funny or entertaining. Sometimes, less is more!

GGUF is an amazing format if you want to leverage both the CPU and GPU when you, like me, are GPU-poor and do not have the latest and greatest GPU available.

AWQ: Activation-aware Weight Quantization

A new format on the block is AWQ (Activation-aware Weight Quantization) which is a quantization method similar to GPTQ. There are several differences between AWQ and GPTQ as methods but the most important one is that AWQ assumes that not all weights are equally important for an LLM’s performance.

In other words, there is a small fraction of weights that will be skipped during quantization which helps with the quantization loss.

As a result, their paper mentions a significant speed-up compared to GPTQ whilst keeping similar, and sometimes even better, performance.

The method is still relatively new and has not been adopted yet to the extent of GPTQ and GGUF, so it is interesting to see if all these methods can co-exist.

For AWQ, we will use the vLLM package as that was, at least in my experience, the road of least resistance to using AWQ:

pip install vllm

With vLLM, loading and using our model becomes painless:

from vllm import LLM, SamplingParams

# Load the LLM
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=256)
llm = LLM(
    model="TheBloke/zephyr-7B-beta-AWQ", 
    quantization='awq', 
    dtype='half', 
    gpu_memory_utilization=.95, 
    max_model_len=4096
)

Then, we can easily run the model with .generate:

# Generate output based on the input prompt and sampling parameters
output = llm.generate(prompt, sampling_params)
print(output[0].outputs[0].text)

This gives us the following output:

Why did the Large Language Model go to the party? To network and expand its vocabulary! Why did the Large Language Model blush? Because it overheard another model saying it was a little too wordy! Why did the Large Language Model get kicked out of the library? It was being too loud and kept interrupting other models’ conversations with its endless chatter! …

Although it is a new format, AWQ is gaining popularity due to its speed and quality of compression!

🔥 TIP: For a more detailed comparison between these techniques with respect to VRAM/Perplexity, I highly advise reading this in-depth post with a follow-up here.

Thank you for reading!

If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:

All images without a source credit were created by the author — Which means all of them, I like creating my own images ;)