Maarten Grootendorst

A Visual Guide to Quantization

2024-07-22T00:00:00+00:00

As their name suggests, Large Language Models (LLMs) are often too large to run on consumer hardware. These models may exceed billions of parameters and generally need GPUs with large amounts of VRAM to speed up inference.

As such, more and more research has been focused on making these models smaller through improved training, adapters, etc. One major technique in this field is called quantization.

In this post, I will introduce the field of quantization in the context of language modeling and explore concepts one by one to develop an intuition about the field. We will explore various methodologies, use cases, and the principles behind quantization.

As a visual guide, expect many visualizations to develop an intuition about quantization!

Part 1: The “Problem“ with LLMs

LLMs get their name due to the number of parameters they contain. Nowadays, these models typically have billions of parameters (mostly weights ) which can be quite expensive to store.

During inference, activations are created as a product of the input and the weights, which similarly can be quite large.

As a result, we would like to represent billions of values as efficiently as possible, minimizing the amount of space we need to store a given value.

Let’s start from the beginning and explore how numerical values are represented in the first place before optimizing them.

How to Represent Numerical Values

A given value is often represented as a floating point number (or floats in computer science): a positive or negative number with a decimal point.

These values are represented by “ bits ”, or binary digits. The IEEE-754 standard describes how bits can represent one of three functions to represent the value: the sign , exponent , or fraction ( or mantissa ).

Together, these three aspects can be used to calculate a value given a certain set of bit values:

The more bits we use to represent a value, the more precise it generally is:

Memory Constraints

The more bits we have available, the larger the range of values that can be represented.

The interval of representable numbers a given representation can take is called the dynamic range whereas the distance between two neighboring values is called precision.

A nifty feature of these bits is that we can calculate how much memory your device needs to store a given value. Since there are 8 bits in a byte of memory, we can create a basic formula for most forms of floating point representation.

NOTE : In practice, more things relate to the amount of (V)RAM you need during inference, like the context size and architecture.

Now let’s assume that we have a model with 70 billion parameters. Most models are natively represented with float 32-bit (often called full-precision ), which would require 280GB of memory just to load the model.

As such, it is very compelling to minimize the number of bits to represent the parameters of your model (as well as during training!). However, as the precision decreases the accuracy of the models generally does as well.

We want to reduce the number of bits representing values while maintaining accuracy… This is where quantization comes in!

Part 2: Introduction to Quantization

Quantization aims to reduce the precision of a model’s parameter from higher bit-widths (like 32-bit floating point) to lower bit-widths (like 8-bit integers).

There is often some loss of precision (granularity) when reducing the number of bits to represent the original parameters.

To illustrate this effect, we can take any image and use only 8 colors to represent it:

Image adapted from the original by Slava Sidorov.

Notice how the zoomed-in part seems more “grainy” than the original since we can use fewer colors to represent it.

The main goal of quantization is to reduce the number of bits (colors) needed to represent the original parameters while preserving the precision of the original parameters as best as possible.

Common Data Types

First, let’s look at common data types and the impact of using them rather than 32-bit (called full-precision or FP32 ) representations.

FP16

Let’s look at an example of going from 32-bit to 16-bit (called half precision or FP16 ) floating point:

Notice how the range of values FP16 can take is quite a bit smaller than FP32.

BF16

To get a similar range of values as the original FP32, bfloat 16 was introduced as a type of “truncated FP32”:

BF16 uses the same amount of bits as FP16 but can take a wider range of values and is often used in deep learning applications.

INT8

When we reduce the number of bits even further, we approach the realm of integer-based representations rather than floating-point representations. To illustrate, going FP32 to INT8, which has only 8 bits, results in a fourth of the original number of bits:

Depending on the hardware, integer-based calculations might be faster than floating-point calculations but this isn’t always the case. However, computations are generally faster when using fewer bits.

For each reduction in bits, a mapping is performed to “squeeze” the initial FP32 representations into lower bits.

In practice, we do not need to map the entire FP32 range [-3.4e38, 3.4e38] into INT8. We merely need to find a way to map the range of our data (the model’s parameters) into IN8.

Common squeezing/mapping methods are symmetric and asymmetric quantization and are forms of linear mapping.

Let’s explore these methods to quantize from FP32 to INT8.

Symmetric Quantization

In symmetric quantization, the range of the original floating-point values is mapped to a symmetric range around zero in the quantized space. In the previous examples, notice how the ranges before and after quantization remain centered around zero.

This means that the quantized value for zero in the floating-point space is exactly zero in the quantized space.

A nice example of a form of symmetric quantization is called absolute maximum ( absmax ) quantization.

Given a list of values, we take the highest absolute value ( α ) as the range to perform the linear mapping.

Note the [-127, 127] range of values represents the restricted range. The unrestricted range is [-128, 127] and depends on the quantization method.

Since it is a linear mapping centered around zero, the formula is straightforward.

We first calculate a scale factor ( _ s_ ) using:

_ b_ is the number of bytes that we want to quantize to (8),
α _ **_ is the highest absolute value,

Then, we use the s to quantize the input x :

Filling in the values would then give us the following:

To retrieve the original FP32 values, we can use the previously calculated scaling factor ( _ s_ ) to dequantize the quantized values.

Applying the quantization and then dequantization process to retrieve the original looks as follows:

You can see certain values, such as 3.08 and 3.02 being assigned to the INT8, namely 36. When you dequantize the values to return to FP32, they lose some precision and are not distinguishable anymore.

This is often referred to as the quantization error which we can calculate by finding the difference between the original and dequantized values.

Generally, the lower the number of bits, the more quantization error we tend to have.

Asymmetric Quantization

Asymmetric quantization, in contrast, is not symmetric around zero. Instead, it maps the minimum ( β ) and maximum ( α ) values from the float range to the minimum and maximum values of the quantized range.

The method we are going to explore is called zero-point quantization.

Notice how the 0 has shifted positions? That’s why it’s called asymmetric quantization. The min/max values have different distances to 0 in the range [-7.59, 10.8].

Due to its shifted position, we have to calculate the zero-point for the INT8 range to perform the linear mapping. As before, we also have to calculate a scale factor ( _ s_ ) but use the difference of INT8’s range instead [-128, 127]

Notice how this is a bit more involved due to the need to calculate the zeropoint ( _ z_ ) in the INT8 range to shift the weights.

As before, let’s fill in the formula:

To dequantize the quantized from INT8 back to FP32, we will need to use the previously calculated scale factor ( _ s_ ) and zeropoint ( _ z_ ).

Other than that, dequantization is straightforward:

When we put symmetric and asymmetric quantization side-by-side, we can quickly see the difference between methods:

Note the zero-centered nature of symmetric quantization versus the offset of asymmetric quantization.

Range Mapping and Clipping

In our previous examples, we explored how the range of values in a given vector could be mapped to a lower-bit representation. Although this allows for the full range of vector values to be mapped, it comes with a major downside, namely outliers.

Imagine that you have a vector with the following values:

Note how one value is much larger than all others and could be considered an outlier. If we were to map the full range of this vector, all small values would get mapped to the same lower-bit representation and lose their differentiating factor:

This is the absmax method we used earlier. Note that the same behavior happens with asymmetric quantization if we do not apply clipping.

Instead, we can choose to clip certain values. Clipping involves setting a different dynamic range of the original values such that all outliers get the same value.

In the example below, if we were to manually set the dynamic range to [-5, 5] all values outside that will either be mapped to -127 or to 127 regardless of their value:

The major advantage is that the quantization error of the non-outliers is reduced significantly. However, the quantization error of outliers increases.

Calibration

In the example, I showed a naive method of choosing an arbitrary range of [-5, 5]. The process of selecting this range is known as calibration which aims to find a range that includes as many values as possible while minimizing the quantization error.

Performing this calibration step is not equal for all types of parameters.

Weights (and Biases)

We can view the weights and biases of an LLM as static values since they are known before running the model. For instance, the ~20GB file of Llama 3 consists mostly of its weight and biases.

Since there are significantly fewer biases (millions) than weights (billions), the biases are often kept in higher precision (such as INT16), and the main effort of quantization is put towards the weights.

For weights, which are static and known, calibration techniques for choosing the range include:

Manually chosing a percentile of the input range
Optimize the mean squared error (MSE) between the original and quantized weights.
Minimizing entropy (KL-divergence) between the original and quantized values

Choosing a percentile, for instance, would lead to similar clipping behavior as we have seen before.

Activations

The input that is continuously updated throughout the LLM is typically referred to as “ activations ”.

Note that these values are called activations since they often go through some activation function, like sigmoid or relu.

Unlike weights, activations vary with each input data fed into the model during inference, making it challenging to quantize them accurately.

Since these values are updated after each hidden layer, we only know what they will be during inference as the input data passes through the model.

Broadly, there are two methods for calibrating the quantization method of the weights and activations:

Post-Training Quantization (PTQ)
- Quantization after training
Quantization Aware Training (QAT)
- Quantization during training/fine-tuning

Part 3: Post-Training Quantization

One of the most popular quantization techniques is post-training quantization (PTQ). It involves quantizing a model’s parameters (both weights and activations) after training the model.

Quantization of the weights is performed using either symmetric or asymmetric quantization.

Quantization of the activations , however, requires inference of the model to get their potential distribution since we do not know their range.

There are two forms of quantization of the activations:

Dynamic Quantization
Static Quantization

Dynamic Quantization

After data passes a hidden layer, its activations are collected:

This distribution of activations is then used to calculate the zeropoint ( _ z_ ) and scale factor ( s ) values needed to quantize the output:

The process is repeated each time data passes through a new layer. Therefore, each layer has its own separate z and s values and therefore different quantization schemes.

Static Quantization

In contrast to dynamic quantization, static quantization does not calculate the zeropoint ( _ z_ ) and scale factor ( _ s_ ) during inference but beforehand.

To find those values, a calibration dataset is used and given to the model to collect these potential distributions.

After these values have been collected, we can calculate the necessary s and z values to perform quantization during inference.

When you are performing actual inference, the s and z values are not recalculated but are used globally over all activations to quantize them.

In general, dynamic quantization tends to be a bit more accurate since it only attempts to calculate the s and z values per hidden layer. However, it might increase compute time as these values need to be calculated.

In contrast, static quantization is less accurate but is faster as it already knows the s and z values used for quantization.

The Realm of 4-bit Quantization

Going below 8-bit quantization has proved to be a difficult task as the quantization error increases with each loss of bit. Fortunately, there are several smart ways to reduce the bits to 6, 4, and even 2-bits (although going lower than 4-bits using these methods is typically not advised).

We will explore two methods that are commonly shared on HuggingFace:

GPTQ (full model on GPU)
GGUF (potentially offload layers on the CPU)

GPTQ

GPTQ is arguably one of the most well-known methods used in practice for quantization to 4-bits.1

It uses asymmetric quantization and does so layer by layer such that each layer is processed independently before continuing to the next:

During this layer-wise quantization process, it first converts the layer’s weights into the inverse- Hessian. It is a second-order derivative of the model’s loss function and tells us how sensitive the model’s output is to changes in each weight.

Simplified, it essentially demonstrates the ( inverse ) importance of each weight in a layer.

Weights associated with smaller values in the Hessian matrix are more crucial because small changes in these weights can lead to significant changes in the model’s performance.

In the inverse-Hessian, lower values indicate more “important” weights.

Next, we quantize and then dequantize the weight of the first row in our weight matrix:

This process allows us to calculate the quantization error ( _ q_ ) which we can weigh using the inverse-Hessian ( _ h_1 )_ that we calculated beforehand.

Essentially, we are creating a weighted-quantization error based on the importance of the weight:

Next, we redistribute this weighted quantization error over the other weights in the row. This allows for maintaining the overall function and output of the network.

For example, if we were to do this for the second weight, namely .3 ( _ x_2_ ), we would add the quantization error ( _ q_ ) multiplied by the inverse-Hessian of the second weight ( _ h_ _2 )

We can do the same process over the third weight in the given row:

We iterate over this process of redistributing the weighted quantization error until all values are quantized.

This works so well because weights are typically related to one another. So when one weight has a quantization error, related weights are updated accordingly (through the inverse-Hessian).

NOTE : The authors used several tricks to speed up computation and improve performance, such as adding a dampening factor to the Hessian, “lazy batching”, and precomputing information using the Cholesky method. I would highly advise checking out this YouTube video on the subject.

TIP : Check out EXL2 if you want a quantization method aimed at performance optimizations and improving inference speed.

GGUF

While GPTQ is a great quantization method to run your full LLM on a GPU, you might not always have that capacity. Instead, we can use GGUF to offload any layer of the LLM to the CPU. 2

This allows you to use both the CPU and GPU when you do not have enough VRAM.

The quantization method GGUF is updated frequently and might depend on the level of bit quantization. However, the general principle is as follows.

First, the weights of a given layer are split into “super” blocks each containing a set of “sub” blocks. From these blocks, we extract the scale factor ( _ s_ ) and alpha ( _ α_ ):

To quantize a given “sub” block, we can use the absmax quantization we used before. Remember that it multiplies a given weight by the scale factor ( _ s_ ) :

The scale factor is calculated using the information from the “sub” block but is quantized using the information from the “super” block which has its own scale factor:

This block-wise quantization uses the scale factor ( s_super ) from the “super” block to quantize the scale factor ( s_sub ) from the “sub” block.

The quantization level of each scale factor might differ with the “super” block generally having a higher precision than the scale factor of the “sub” block.

To illustrate, let’s explore a couple of quantization levels (2-bit, 4-bit, and 6-bit):

NOTE : Depending on the quantization type, an additional minimum value ( _ m_ ) is needed to adjust the zero-point. These are quantized the same as the scale factor ( _ s_ ).

Check out the original pull request for an overview of all quantization levels. Also, see this pull request for more information on quantization using importance matrices.

Part 4: Quantization Aware Training

In Part 3, we saw how we could quantize a model after training. A downside to this approach is that this quantization does not consider the actual training process.

This is where Quantization Aware Training (QAT) comes in. Instead of quantizing a model after it was trained with post-training quantization (PTQ), QAT aims to learn the quantization procedure during training.

QAT tends to be more accurate than PTQ since the quantization was already considered during training. It works as follows:

During training, so-called “ fake ” quants are introduced. This is the process of first quantizing the weights to, for example, INT4 and then dequantizing back to FP32:

This process allows the model to consider the quantization process during training, the calculation of loss, and weight updates.

QAT attempts to explore the loss landscape for “ wide ” minima to minimize the quantization errors as “ narrow ” minima tend to result in larger quantization errors.

For example, imagine if we did not consider quantization during the backward pass. We choose the weight with the smallest loss according to gradient descent. However, that would introduce a larger quantization error if it’s in a “ narrow ” minima.

In contrast, if we consider quantization, a different updated weight will be selected in a “ wide ” minima with a much lower quantization error.

As such, although PTQ has a lower loss in high precision (e.g., FP32), QAT results in a lower loss in lower precision (e.g., INT4) which is what we aim for.

The Era of 1-bit LLMs: BitNet

Going to 4-bits as we saw before is already quite small but what if we were to reduce it even further?

This is where BitNet comes in, representing the weights of a model single 1-bit, using either -1 or 1 for a given weight.3

It does so by injecting the quantization process directly into the Transformer architecture.

Remember that the Transformer architecture is used as the foundation of most LLMs and is composed of computations that involve linear layers:

These linear layers are generally represented with higher precision, like FP16, and are where most of the weights reside.

BitNet replaces these linear layers with something they call the BitLlinear :

A BitLinear layer works the same as a regular linear layer and calculates the output based on the weights multiplied by the activation.

In contrast, a BitLinear layer represents the weights of a model using 1-bit and activations using INT8:

A BitLinear layer, like Quantization-Aware Training (QAT) performs a form of “fake” quantization during training to analyze the effect of quantization of the weights and activations:

NOTE : In the paper they used γ instead of α but since we used a throughout our examples, I’m using that. Also, note that β is not the same as we used in zero-point quantization but the average absolute value.

Let’s go through the BitLinear step-by-step.

Weight Quantization

While training, the weights are stored in INT8 and then quantized to 1-bit using a basic strategy, called the signum function.

In essence, it moves the distribution of weights to be centered around 0 and then assigns everything left to 0 to be -1 and everything to the right to be 1:

Additionally, it tracks a value β ( average absolute value ) that we will use later on for dequantization.

Activation Quantization

To quantize the activations, BitLinear makes use of absmax quantization to convert the activations from FP16 to INT8 as they need to be in higher precision for the matrix multiplication (×).

Additionally, it tracks α ( highest absolute value ) that we will use later on for dequantization.

Dequantization

We tracked α ( highest absolute value of activations ) and β ( average absolute value of weights ) as those values will help us dequantize the activations back to FP16.

The output activations are rescaled with { α , γ} to dequantize them to the original precision:

And that’s it! This procedure is relatively straightforward and allows models to be represented with only two values, either -1 or 1.

Using this procedure, the authors observed that as the model size grows, the smaller the performance gap between a 1-bit and FP16-trained becomes.

However, this is only for larger models (>30B parameters) and the gab with smaller models is still quite large.

All Large Language Models are in 1.58 Bits

BitNet 1.58b was introduced to improve upon the scaling issue previously mentioned.4

In this new method, every single weight of the is not just -1 or 1 , but can now also take 0 as a value, making it ternary. Interestingly, adding just the 0 greatly improves upon BitNet and allows for much faster computation.

The Power of 0

So why is adding 0 such a major improvement?

It has everything to do with matrix multiplication!

First, let’s explore how matrix multiplication in general works. When calculating the output, we multiply a weight matrix by an input vector. Below, the first multiplication of the first layer of a weight matrix is visualized:

Note that this multiplication involves two actions, multiplying individual weights with the input and then adding them all together.

BitNet 1.58b, in contrast, manages to forego the act of multiplication since ternary weights essentially tell you the following:

1: I want to add this value
0: I do not want this value
-1: I want to subtract this value

As a result, you only need to perform addition if your weights are quantized to 1.58 bit:

Not only can this speed up computation significantly, but it also allows for feature filtering.

By setting a given weight to 0 you can now ignore it instead of either adding or subtracting the weights as is the case with 1-bit representations.

Quantization

To perform weight quantization BitNet 1.58b uses absmean quantization which is a variation of the absmax quantization that we saw before.

It simply compresses the distribution of weights and uses the absolute mean ( α ) to quantize values. They are then rounded to either -1, 0, or 1:

Compared to BitNet the activation quantization is the same except for one thing. Instead of scaling the activations to range [ 0 , 2ᵇ⁻¹ ], they are now scaled to
[ -2ᵇ⁻¹ , 2ᵇ⁻¹ ] instead using absmax quantization.

And that’s it! 1.58-bit quantization required (mostly) two tricks:

Adding 0 to create ternary representations [-1, 0, 1]
absmean quantization for weights

As a result, we get lightweight models due to having only 1.58 computationally efficient bits!

Thank you for reading!

This concludes our journey in quantization! Hopefully, this post gives you a better understanding of the potential of quantization, GPTQ, GGUF, and BitNet. Who knows how small the models will be in the future?!

To see more visualizations related to LLMs and to support this newsletter, check out the book I’m writing with Jay Alammar. It will be released soon!

You can view the book with a free trial on the O’Reilly website or pre-order the book on Amazon. All code will be uploaded to Github.

If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:

Additional Resources

Hopefully, this was an accessible introduction to quantization! If you want to go deeper, I would suggest the following resources:

A HuggingFace blog about the LLM.int8() quantization method: you can find the paper here.
Another great HuggingFace blog about quantization for embeddings.
A blog about Transformer Math 101, describing the basic math related to computation and memory usage for transformers.
This and this are two nice resources to calculate the ( V)RAM you need for a given model.
If you want to know more about QLoRA5, a quantization technique for fine-tuning, it is covered extensively in my upcoming book: Hands-On Large Language Models.
A truly amazing YouTube video about GPTQ explained incredibly intuitively.

A Visual Guide to Mamba and State Space Models

2024-02-21T00:00:00+00:00

The Transformer architecture has been a major component in the success of Large Language Models (LLMs). It has been used for nearly all LLMs that are being used today, from open-source models like Mistral to closed-source models like ChatGPT.

To further improve LLMs, new architectures are developed that might even outperform the Transformer architecture. One of these methods is Mamba, a State Space Model.

Mamba was proposed in the paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces. You can find its official implementation and model checkpoints in its repository.

In this post, I will introduce the field of State Space Models in the context of language modeling and explore concepts one by one to develop an intuition about the field. Then, we will cover how Mamba might challenge the Transformers architecture.

As a visual guide, expect many visualizations to develop an intuition about Mamba and State Space Models!

Part 1: The Problem with Transformers

To illustrate why Mamba is such an interesting architecture, let’s do a short re-cap of transformers first and explore one of its disadvantages.

A Transformer sees any textual input as a sequence that consists of tokens.

A major benefit of Transformers is that whatever input it receives, it can look back at any of the earlier tokens in the sequence to derive its representation.

The Core Components of Transformers

Remember that a Transformer consists of two structures, a set of encoder blocks for representing text and a set of decoder blocks for generating text. Together, these structures can be used for several tasks, including translation.

We can adopt this structure to create generative models by using only decoders. This Transformer-based model, Generative Pre-trained Transformers (GPT), uses decoder blocks to complete some input text.

Let’s take a look at how that works!

A Blessing with Training…

A single decoder block consists of two main components, masked self-attention followed by a feed-forward neural network.

Self-attention is a major reason why these models work so well. It enables an uncompressed view of the entire sequence with fast training.

So how does it work?

It creates a matrix comparing each token with every token that came before. The weights in the matrix are determined by how relevant the token pairs are to one another.

During training, this matrix is created in one go. The attention between “My” and “name” does not need to be calculated first before we calculate the attention between “name” and “is”.

It enables parallelization, which speeds up training tremendously!

And the Curse with Inference!

There is a flaw, however. When generating the next token, we need to re-calculate the attention for the entire sequence, even if we already generated some tokens.

Generating tokens for a sequence of length *L *needs roughly *L² *computations which can be costly if the sequence length increases.

This need to recalculate the entire sequence is a major bottleneck of the Transformer architecture.

Let’s look at how a “classic” technique, Recurrent Neural Networks, solves this problem of slow inference.

Are RNNs a Solution?

Recurrent Neural Networks (RNN) is a sequence-based network. It takes two inputs at each time step in a sequence, namely the input at time step t and a hidden state of the previous time step t-1, to generate the next hidden state and predict the output.

RNNs have a looping mechanism that allows them to pass information from a previous step to the next. We can “unfold” this visualization to make it more explicit.

When generating the output, the RNN only needs to consider the previous hidden state and current input. It prevents recalculating all previous hidden states which is what a Transformer would do.

In other words, RNNs can do inference fast as it scales linearly with the sequence length! In theory, it can even have an infinite context length.

To illustrate, let’s apply the RNN to the input text we have used before.

Each hidden state is the aggregation of all previous hidden states and is typically a compressed view.

There is a problem, however…

Notice that the last hidden state, when producing the name “Maarten” does not contain information about the word “Hello” anymore. RNNs tend to forget information over time since they only consider one previous state.

This sequential nature of RNNs creates another problem. Training cannot be done in parallel since it needs to go through each step at a time sequentially.

The problem with RNNs, compared to Transformers, is completely the opposite! Its inference is incredibly fast but it is not parallelizable.

Can we somehow find an architecture that does parallelize training like Transformers whilst still performing inference that scales linearly with sequence length?

Yes! This is what Mamba offers but before diving into its architecture, let’s explore the world of State Space Models first.

Part 2: The State Space Model (SSM)

A State Space Model (SSM), like the Transformer and RNN, processes sequences of information, like text but also signals. In this section, we will go through the basics of SSMs and how they relate to textual data.

What is a State Space?

A State Space contains the minimum number of variables that fully describe a system. It is a way to mathematically represent a problem by defining a system’s possible states.

Let’s simplify this a bit. Imagine we are navigating through a maze. The “state space” is the map of all possible locations (states). Each point represents a unique position in the maze with specific details, like how far you are from the exit.

The “state space representation” is a simplified description of this map. It shows where you are (current state), where you can go next (possible future states), and what changes take you to the next state (going right or left).

Although State Space Models use equations and matrices to track this behavior, it is simply a way to track where you are, where you can go, and how you can get there.

The variables that describe a state, in our example the X and Y coordinates, as well as the distance to the exit, can be represented as “state vectors”.

Sounds familiar? That is because embeddings or vectors in language models are also frequently used to describe the “state” of an input sequence. For instance, a vector of your current position (state vector) could look a bit like this:

In terms of neural networks, the “state” of a system is typically its hidden state and in the context of Large Language Models, one of the most important aspects of generating a new token.

What is a State Space Model?

SSMs are models used to describe these state representations and make predictions of what their next state could be depending on some input.

Traditionally, at time t, SSMs:

map an input sequence x(t) — (e.g., moved left and down in the maze)
to a latent state representation h(t) — (e.g., distance to exit and x/y coordinates)
and derive a predicted output sequence y(t) — (e.g., move left again to reach the exit sooner)

However, instead of using discrete sequences (like moving left once) it takes as input a continuous sequence and predicts the output sequence.

SSMs assume that dynamic systems, such as an object moving in 3D space, can be predicted from its state at time t through two equations.

By solving these equations, we assume that we can uncover the statistical principles to predict the state of a system based on observed data (input sequence and previous state).

Its goal is to find this state representation h(t) such that we can go from an input to an output sequence.

These two equations are the core of the State Space Model.

The two equations will be referenced throughout this guide. To make them a bit more intuitive, they are color-coded so you can quickly reference them.

The state equation describes how the state changes (through matrix A) based on how the input influences the state (through matrix B).

As we saw before, h(t) refers to our latent state representation at any given time t, and x(t) refers to some input.

The output equation describes how the state is translated to the output (through matrix C) and how the input influences the output (through matrix D).

NOTE: Matrices *A, B, C, and D *are also commonly refered to as *parameters *since they are learnable.

Visualizing these two equations gives us the following architecture:

Let’s go through the general technique step-by-step to understand how these matrices influence the learning process.

Assume we have some input signal x(t), this signal first gets multiplied by matrix B which describes how the inputs influence the system.

The updated state (akin to the hidden state of a neural network) is a latent space that contains the core “knowledge” of the environment. We multiply the state with matrix A which describes how all the internal states are connected as they represent the underlying dynamics of the system.

As you might have noticed, matrix A is applied before creating the state representations and is updated after the state representation has been updated.

Then, we use matrix C to describe how the state can be translated to an output.

Finally, we can make use of matrix D to provide a direct signal from the input to the output. This is also often referred to as a skip-connection.

Since matrix D is similar to a skip-connection, the SSM is often regarded as the following without the skip-connection.

Going back to our simplified perspective, we can now focus on matrices A, B, and C as the core of the SSM.

We can update the original equations (and add some pretty colors) to signify the purpose of each matrix as we did before.

Together, these two equations aim to predict the state of a system from observed data. Since the input is expected to be continuous, the main representation of the SSM is a continuous-time representation.

From a Continuous to a Discrete Signal

Finding the state representation h(t) is analytically challenging if you have a continuous signal. Moreover, since we generally have a discrete input (like a textual sequence), we want to discretize the model.

To do so, we make use of the Zero-order hold technique. It works as follows. First, every time we receive a discrete signal, we hold its value until we receive a new discrete signal. This process creates a continuous signal the SSM can use:

How long we hold the value is represented by a new learnable parameter, called the step size ∆. It represents the resolution of the input.

Now that we have a continuous signal for our input, we can generate a continuous output and only sample the values according to the time steps of the input.

These sampled values are our discretized output!

Mathematically, we can apply the Zero-order hold as follows:

Together, they allow us to go from a continuous SSM to a discrete SSM represented by a formulation that instead of a function-to-function, x(t) → y(t), is now a *sequence-to-sequence, xₖ** → **y*ₖ:

Here, matrices A and B now represent discretized parameters of the model.

We use k instead of t to represent discretized timesteps and to make it a bit more clear when we refer to a continuous versus a discrete SSM.

NOTE: We are still saving the continuous form of *Matrix A *and not the discretized version during training. During training, the continuous representation is discretized.

Now that we have a formulation of a discrete representation, let’s explore how we can actually compute the model.

The Recurrent Representation

Our discretized SSM allows us to formulate the problem in specific timesteps instead of continuous signals. A recurrent approach, as we saw before with RNNs is quite useful here.

If we consider discrete timesteps instead of a continuous signal, we can reformulate the problem with timesteps:

At each timestep, we calculate how the current input (Bxₖ) influences the previous state (Ahₖ₋₁) and then calculate the predicted output (Chₖ).

This representation might already seem a bit familiar! We can approach it the same way we did with the RNN as we saw before.

Which we can unfold (or unroll) as such:

Notice how we can use this discretized version using the underlying methodology of an RNN.

This technique gives us both the advantages and disadvantages of an RNN, namely fast inference and slow training.

The Convolution Representation

Another representation that we can use for SSMs is that of convolutions. Remember from classic image recognition tasks where we applied filters (kernels) to derive aggregate features:

Since we are dealing with text and not images, we need a 1-dimensional perspective instead:

The kernel that we use to represent this “filter” is derived from the SSM formulation:

Let’s explore how this kernel works in practice. Like convolution, we can use our SSM kernel to go over each set of tokens and calculate the output:

This also illustrates the effect padding might have on the output. I changed the order of padding to improve the visualization but we often apply it at the end of a sentence.

In the next step, the kernel is moved once over to perform the next step in the calculation:

In the final step, we can see the full effect of the kernel:

A major benefit of representing the SSM as a convolution is that it can be trained in parallel like Convolutional Neural Networks (CNNs). However, due to the fixed kernel size, their inference is not as fast and unbounded as RNNs.

The Three Representations

These three representations, continuous, recurrent, and convolutional all have different sets of advantages and disadvantages:

Interestingly, we now have efficient inference with the recurrent SSM and parallelizable training with the convolutional SSM.

With these representations, there is a neat trick that we can use, namely choose a representation depending on the task. During training, we use the convolutional representation which can be parallelized and during inference, we use the efficient recurrent representation:

This model is referred to as the Linear State-Space Layer (LSSL).

These representations share an important property, namely that of Linear Time Invariance (LTI). LTI states that the SSMs parameters, A, B, and C, are fixed for all timesteps. This means that matrices A, B, and C are the same for every token the SSM generates.

In other words, regardless of what sequence you give the SSM, the values of A, B, and C remain the same. We have a static representation that is not content-aware.

Before we explore how Mamba addresses this issue, let’s explore the final piece of the puzzle, matrix A.

The Importance of Matrix A

Arguably one of the most important aspects of the SSM formulation is matrix A. As we saw before with the recurrent representation, it captures information about the previous state to build the new state.

In essence, matrix A produces the hidden state:

Creating matrix A can therefore be the difference between remembering only a few previous tokens and capturing every token we have seen thus far. Especially in the context of the Recurrent representation since it only looks back at the previous state.

So how can we create matrix A in a way that retains a large memory (context size)?

We use Hungry Hungry Hippo! Or HiPPO for High-order Polynomial Projection Operators. HiPPO attempts to compress all input signals it has seen thus far into a vector of coefficients.

It uses matrix A to build a state representation that captures recent tokens well and decays older tokens. Its formula can be represented as follows:

Assuming we have a square matrix A, this gives us:

Building matrix A using HiPPO was shown to be much better than initializing it as a random matrix. As a result, it more accurately reconstructs newer signals (recent tokens) compared to older signals (initial tokens).

The idea behind the HiPPO Matrix is that it produces a hidden state that memorizes its history.

Mathematically, it does so by tracking the coefficients of a Legendre polynomial which allows it to approximate all of the previous history.

HiPPO was then applied to the recurrent and convolution representations that we saw before to handle long-range dependencies. The result was Structured State Space for Sequences (S4), a class of SSMs that can efficiently handle long sequences.

It consists of three parts:

State Space Models
HiPPO for handling long-range dependencies
Discretization for creating recurrent and convolution representations

This class of SSMs has several benefits depending on the representation you choose (recurrent vs. convolution). It can also handle long sequences of text and store memory efficiently by building upon the HiPPO matrix.

NOTE: If you want to dive into more of the technical details on how to calculate the HiPPO matrix and build a S4 model yourself, I would HIGHLY advise going through the Annotated S4.

Part 3: Mamba — A Selective SSM

We finally have covered all the fundamentals necessary to understand what makes Mamba special. State Space Models can be used to model textual sequences but still have a set of disadvantages we want to prevent.

In this section, we will go through Mamba’s two main contributions:

A selective scan algorithm, which allows the model to filter (ir)relevant information
A hardware-aware algorithm that allows for efficient storage of (intermediate) results through parallel scan, kernel fusion, and recomputation.

Together they create the selective SSM or S6 models which can be used, like self-attention, to create Mamba blocks.

Before exploring the two main contributions, let’s first explore why they are necessary.

What Problem does it attempt to Solve?

State Space Models, and even the S4 (Structured State Space Model), perform poorly on certain tasks that are vital in language modeling and generation, namely the ability to focus on or ignore particular inputs.

We can illustrate this with two synthetic tasks, namely selective copying and induction heads.

In the selective copying task, the goal of the SSM is to copy parts of the input and output them in order:

However, a (recurrent/convolutional) SSM performs poorly in this task since it is Linear Time Invariant. As we saw before, the matrices A, B, and C are the same for every token the SSM generates.

As a result, an SSM cannot perform content-aware reasoning since it treats each token equally as a result of the fixed A, B, and C matrices. This is a problem as we want the SSM to reason about the input (prompt).

The second task an SSM performs poorly on is induction heads where the goal is to reproduce patterns found in the input:

In the above example, we are essentially performing one-shot prompting where we attempt to “teach” the model to provide an “A:” response after every “Q:”. However, since SSMs are time-invariant it cannot select which previous tokens to recall from its history.

Let’s illustrate this by focusing on matrix B. Regardless of what the input x is, matrix B remains exactly the same and is therefore independent of x:

Likewise, A and C also remain fixed regardless of the input. This demonstrates the static nature of the SSMs we have seen thus far.

In comparison, these tasks are relatively easy for Transformers since they dynamically change their attention based on the input sequence. They can selectively “look” or “attend” at different parts of the sequence.

The poor performance of SSMs on these tasks illustrates the underlying problem with time-invariant SSMs, the static nature of matrices A, B, and C results in problems with content-awareness.

Selectively Retain Information

The recurrent representation of an SSM creates a small state that is quite efficient as it compresses the entire history. However, compared to a Transformer model which does no compression of the history (through the attention matrix), it is much less powerful.

Mamba aims to have the best of both worlds. A small state that is as powerful as the state of a Transformer:

As teased above, it does so by compressing data selectively into the state. When you have an input sentence, there is often information, like stop words, that does not have much meaning.

To selectively compress information, we need the parameters to be dependent on the input. To do so, let’s first explore the dimensions of the input and output in an SSM during training:

In a Structured State Space Model (S4), the matrices A, B, and C are independent of the input since their dimensions N and D are static and do not change.

Instead, Mamba makes matrices B and C, *and even the *step size ∆*, *dependent on the input by incorporating the sequence length and batch size of the input:

This means that for every input token, we now have different B and C matrices which solves the problem with content-awareness!

NOTE: Matrix A *remains the same since we want the state itself to remain static but the way it is influenced (through *B *and *C) to be dynamic

Together, they selectively choose what to keep in the hidden state and what to ignore since they are now dependent on the input.

A smaller step size ∆ results in ignoring specific words and instead using the previous context more whilst a larger step size ∆ focuses on the input words more than the context:

The Scan Operation

Since these matrices are now dynamic, they cannot be calculated using the convolution representation since it assumes a fixed kernel. We can only use the recurrent representation and lose the parallelization the convolution provides.

To enable parallelization, let’s explore how we compute the output with recurrency:

Each state is the sum of the previous state (multiplied by A) plus the current input (multiplied by B). This is called a scan operation and can easily be calculated with a for loop.

Parallelization, in contrast, seems impossible since each state can only be calculated if we have the previous state. Mamba, however, makes this possible through the [parallel scan](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda) algorithm.

It assumes the order in which we do operations does not matter through the associate property. As a result, we can calculate the sequences in parts and iteratively combine them:

Together, dynamic matrices B and C, and the parallel scan algorithm create the selective scan algorithm to represent the dynamic and fast nature of using the recurrent representation.

Hardware-aware Algorithm

A disadvantage of recent GPUs is their limited transfer (IO) speed between their small but highly efficient SRAM and their large but slightly less efficient DRAM. Frequently copying information between SRAM and DRAM becomes a bottleneck.

Mamba, like Flash Attention, attempts to limit the number of times we need to go from DRAM to SRAM and vice versa. It does so through kernel fusion which allows the model to prevent writing intermediate results and continuously performing computations until it is done.

We can view the specific instances of DRAM and SRAM allocation by visualizing Mamba’s base architecture:

Here, the following are fused into one kernel:

Discretization step with step size ∆
Selective scan algorithm
Multiplication with C

The last piece of the hardware-aware algorithm is recomputation.

The intermediate states are not saved but are necessary for the backward pass to compute the gradients. Instead, the authors recompute those intermediate states during the backward pass.

Although this might seem inefficient, it is much less costly than reading all those intermediate states from the relatively slow DRAM.

We have now covered all components of its architecture which is depicted using the following image from its article:

**The Selective SSM.** Retrieved from: Gu, Albert, and Tri Dao. “Mamba: Linear-time sequence modeling with selective state spaces.” *arXiv preprint arXiv:2312.00752* (2023).

This architecture is often referred to as a selective SSM or S6 model since it is essentially an S4 model computed with the selective scan algorithm.

The Mamba Block

The selective SSM that we have explored thus far can be implemented as a block, the same way we can represent self-attention in a decoder block.

Like the decoder, we can stack multiple Mamba blocks and use their output as the input for the next Mamba block:

It starts with a linear projection to expand upon the input embeddings. Then, a convolution before the Selective SSM is applied to prevent independent token calculations.

The Selective SSM has the following properties:

Recurrent SSM created through discretization
HiPPO initialization on matrix A to capture long-range dependencies
Selective scan algorithm to selectively compress information
Hardware-aware algorithm to speed up computation

We can expand on this architecture a bit more when looking at the code implementation and explore how an end-to-end example would look like:

Notice some changes, like the inclusion of normalization layers and softmax for choosing the output token.

When we put everything together, we get both fast inference and training and even unbounded context!

Using this architecture, the authors found it matches and sometimes even exceeds the performance of Transformer models of the same size!

Additional Resources

Hopefully, this was an accessible introduction to Mamba and State Space Models. If you want to go deeper, I would suggest the following resources:

The Annotated S4 is a JAX implementation and guide through the S4 model and is highly advised!
A great YouTube video introducing Mamba by building it up through foundational papers.
The Mamba repository with checkpoints on Hugging Face.
An amazing series of blog posts (1, 2, 3) that introduces the S4 model.
The Mamba №5 (A Little Bit Of…) blog post is a great next step to dive into more technical details about Mamba but still from an amazingly intuitive perspective.
And of course, the Mamba paper! It was even used for DNA modeling and speech generation.

Thank you for reading!

If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:

All images without a source credit were created by the author — Which means all of them, I like creating my own images ;)

BERTopic: What Is So Special About v0.16?

2023-12-19T00:00:00+00:00

My ambition for BERTopic is to make it the one-stop shop for topic modeling by allowing for significant flexibility and modularity.

That has been the goal for the last few years and with the release of v0.16, I believe we are a BIG step closer to achieving that.

First, let’s take a small step back. What is BERTopic?

Well, BERTopic is a topic modeling framework that allows users to essentially create their version of a topic model. With many variations of topic modeling implemented, the idea is that it should support almost any use case.

The modular nature of BERTopic allows you to build your topic model however you want. Switching components allows BERTopic to grow with the latest developments in Language AI.

With v0.16, several features were implemented that I believe will take BERTopic to the next level, namely:

Zero-Shot Topic Modeling
Model Merging
More Large Language Model (LLM) Support

Just a few of BERTopic’s capabilities.

In this tutorial, we will go through what these features are and for which use cases they could be helpful.

To start with, you can install BERTopic (with HF datasets) as follows:

pip install bertopic datasets

You can also follow along with the Google Colab Notebook to make sure everything works as intended.

UPDATE: I uploaded a video version to YouTube that goes more in-depth into how to use these new features:

Zero-Shot Topic Modeling: A Flexible Technique

Zero-shot techniques generally refer to having no examples to train your data on. Although you know the target, it is not assigned to your data.

In BERTopic, we use Zero-shot Topic Modeling to find pre-defined topics in large amounts of documents.

Imagine you have ArXiv abstracts about Machine Learning and you know that the topic “Large Language Models” is in there. With Zero-shot Topic Modeling, you can ask BERTopic to find all documents related to “Large Language Models”.

In essence, it is nothing more than semantic search! But… there is a neat trick ;-)

When you try to find those documents related to “Large Language Models”, there will be many left not about those topics. So, what do you do with those topics? You use BERTopic to find all topics that were left!

As a result, you will have three scenarios of Zero-shot Topic Modeling:

No zero-shot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.
Only zero-shot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.
Both zero-shot topics and clustered topics were detected. This means that some documents would fit with the predefined topics whereas others would not. For the latter, new topics were found.

Using Zero-shot BERTopic is straightforward:

from datasets import load_dataset

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

# We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

# We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

# We fit our model using the zero-shot topics
# and we define a minimum similarity. For each document,
# if the similarity does not exceed that value, it will be used
# for clustering instead.
topic_model = BERTopic(
    embedding_model="thenlper/gte-small", 
    min_topic_size=15,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.85,
    representation_model=KeyBERTInspired()
)
topics, probs = topic_model.fit_transform(docs)

We can view the three pre-defined topics along with several newly discovered topics:

topic_model.get_topic_info()

Note that although we have pre-defined names for the topics, we allow BERTopic for additional representations.

This gives exciting new insight into pre-defined topics!

So… when do you use Zero-shot Topic Modeling?

If you already know some of the topics in your data, this is a great solution for finding them! Since it can discover both pre-defined and new topics, is an incredibly flexible technique.

Model Merging: Federated and Incremental Learning

This is a fun new feature, model merging!

Model merging refers to BERTopic’s capability to combine multiple pre-trained BERTopic models to create one large topic model. It explores which topics should be merged and which should remain separate.

It works as follows. When we pass a list of models to this new feature, .merge_models, the first model in the list is chosen as the baseline. This baseline is used to check whether all other models contain new topics based on the similarity between their topic embeddings.

Dissimilar topics are added to the baseline model whereas similar topics are assigned to the topic of the baseline. This means that we need the embedding models to be the same.

When merging BERTopic models, duplicate topics will be merged and all other topics will be kept the same.

Merging pre-trained BERTopic models is straightforward and only requires a few lines of code:

from bertopic import BERTopic

# Merge 3 pre-trained BERTopic models
merged_model = BERTopic.merge_models(
    [topic_model_1, topic_model_2, topic_model_3]
)

And that is it! With a single function, .merge_models, you can merge pre-trained BERTopic models.

The benefit of merging pre-trained models is that it allows for a variety of creative and useful use cases. For instance, we could use it for:

Incremental Learning — We can continuously discover new topics by iteratively merging models. This can be used for issue tickets to quickly uncover pressing bugs/issues.
Batched Learning — Compute and memory problems can arise with large datasets or when you simply do not have the hardware for it. By splitting the training process up into smaller models, we can get similar performance whilst reducing the necessary compute.
Federated Learning — Merging models allow for the training to be distributed among different clients who do not wish to share their data. This increases privacy and security with respect to their data especially if a non-keyword-based method is used for generating the representations, such as using a Large Language Model.

Federated Learning is rather straightforward, simply run .merge_models on your central server.

The other two, incremental and batched learning, might require a bit of an example!

Incremental and Batched Learning

To perform both *incremental *and *batched *learning, we are going to mimic a typical .partial_fit pipeline. Here, we will train a base model first and then iteratively add a small newly trained model.

In each iteration, we can check any topics that were added to the base model:

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from datasets import load_dataset

# Prepare documents
all_docs = load_dataset("CShorten/ML-ArXiv-Papers")["train"]["abstract"][:20_000]
doc_chunks = [all_docs[i:i+5000] for i in range(0, len(all_docs), 5000)]

# Base Model
representation_model = KeyBERTInspired()
base_model = BERTopic(representation_model=representation_model, min_topic_size=15).fit(doc_chunks[0])

# Iteratively add small and newly trained models
for docs in doc_chunks[1:]:
    new_model = BERTopic(representation_model=representation_model, min_topic_size=15).fit(docs)
    updated_model = BERTopic.merge_models([base_model, new_model])

    # Let's print the newly discover topics
    nr_new_topics = len(set(updated_model.topics_)) - len(set(base_model.topics_))
    new_topics = list(updated_model.topic_labels_.values())[-nr_new_topics:]
    print("The following topics are newly found:")
    print(f"{new_topics}\n")

    # Update the base model
    base_model = updated_model

To illustrate, this will give back newly found topics such as:

>  The following topics are newly found:
[
 ‘50_forecasting_predicting_prediction_stocks’, 
 ‘51_activity_activities_accelerometer_accelerometers’, 
 ‘57_rnns_deepcare_neural_imputation’
]

It retains everything from the original model, including

Not only do we reduce the compute by splitting the training up into chunks, but we can monitor any new topics that were added to the model.

In practice, you can train a new model with a frequency that fits your use case. You might check for new topics monthly, weekly, or even daily if you have enough data.

More Large Language Model Support

Although we could use Large Language Models (LLMs) for a while now in BERTopic, the v0.16 release has several smaller additions that make working with LLMs a nicer experience!

To sum up, the following were added:

llama-cpp-python: Load any GGUF-compatible LLM with llama.cpp
Truncate documents: Use a variety of techniques to truncate documents when passing them to any LLM.
LangChain: Support for LCEL Runnables by @joshuasundance-swca

Let’s explore a short example of the first two features, llama.cpp and document truncation.

When you pass documents to any LLM module, they might exceed its token limit. Instead, we can truncate the documents passed to the LLM by defining a tokenizer and a doc_length.

The different methods for tokenization when truncating documents.

The definition of a doc_length depends on the tokenizer you use. For example, a value of 100 can refer to truncating by the number of tokens or even characters.

Before documents are added to the prompt, they can be truncated first based on the tokenization strategy.

To use this together with llama-cpp-python , let’s consider the following example. First, we install the necessary packages, prepare the environment, and download a small but capable model (Zephyr-7B):

pip install llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
wget https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf

Loading a GGUF model with llama-cpp-python in BERTopic is straightforward:

from bertopic import BERTopic
from bertopic.representation import LlamaCPP

# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
# and truncate each document to 50 words
representation_model = LlamaCPP(
    "zephyr-7b-alpha.Q4_K_M.gguf",
    tokenizer="whitespace",
    doc_length=50
)

# Create our BERTopic model
topic_model = BERTopic(representation_model=representation_model, verbose=True)

And that is it! We created a model that truncates input documents and creates interesting topic representations without being constrained by its token limit.

Thank you for reading!

If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:

All images without a source credit were created by the author — Which means all of them, I like creating my own images ;)

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

2023-11-12T00:00:00+00:00

Throughout the last year, we have seen the Wild West of Large Language Models (LLMs). The pace at which new technology and models were released was astounding! As a result, we have many different standards and ways of working with LLMs.

In this article, we will explore one such topic, namely loading your local LLM through several (quantization) standards. With sharding, quantization, and different saving and compression strategies, it is not easy to know which method is suitable for you.

Throughout the examples, we will use Zephyr 7B, a fine-tuned variant of Mistral 7B that was trained with Direct Preference Optimization (DPO).

🔥 TIP: After each example of loading an LLM, it is advised to restart your notebook to prevent OutOfMemory errors. Loading multiple LLMs requires significant RAM/VRAM. You can reset memory by deleting the models and resetting your cache like so:

# Delete any models previously created
del model, tokenizer, pipe

# Empty VRAM cache
import torch
torch.cuda.empty_cache()

UPDATE: I uploaded a video version to YouTube that goes more in-depth into how to use these quantization methods:

1. HuggingFace

The most straightforward, and vanilla, way of loading your LLM is through 🤗 Transformers. HuggingFace has created a large suite of packages that allow us to do amazing things with LLMs!

We will start by installing HuggingFace, among others, from its main branch to support newer models:

# Latest HF transformers version for Mistral-like models
pip install git+https://github.com/huggingface/transformers.git
pip install accelerate bitsandbytes xformers

After installation, we can use the following pipeline to easily load our LLM:

from torch import bfloat16
from transformers import pipeline

# Load in your LLM without any compression tricks
pipe = pipeline(
    "text-generation", 
    model="HuggingFaceH4/zephyr-7b-beta", 
    torch_dtype=bfloat16, 
    device_map="auto"
)

This method of loading an LLM generally does not perform any compression tricks for saving VRAM or increasing efficiency.

To generate our prompt, we first have to create the necessary template. Fortunately, this can be done automatically if the chat template is saved in the underlying tokenizer:

# We use the tokenizer's chat template to format each message
# See https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot.",
    },
    {
        "role": "user", 
        "content": "Tell me a funny joke about Large Language Models."
    },
]
prompt = pipe.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

The generated prompt, using the internal prompt template, is constructed like so:

The prompt template is automatically generated with the internal prompt template. Notice that there are different tags for differentiating between the user and the assistant.

Then, we can start passing the prompt to the LLM to generate our answer:

outputs = pipe(
    prompt, 
    max_new_tokens=256, 
    do_sample=True, 
    temperature=0.1, 
    top_p=0.95
)
print(outputs[0]["generated_text"])

This gives us the following output:

Why did the Large Language Model go to the party?

To network and expand its vocabulary!

The punchline may be a bit cheesy, but Large Language Models are all about expanding their vocabulary and networking with other models to improve their language skills. So, this joke is a perfect fit for them!

For pure inference, this method is generally the least efficient as we are loading the entire model without any compression or quantization strategies.

It is, however, a great method to start with as it allows for easy loading and using the model!

2. Sharding

Before we go into quantization strategies, there is another trick that we can employ to reduce the necessary VRAM for loading our model. With sharding, we are essentially splitting our model up into small pieces or shards.

Sharding an LLM is nothing more than breaking it up into pieces. Each individual piece is much easier to handle and might prevent memory issues.

Each shard contains a smaller part of the model and aims to work around GPU memory limitations by distributing the model weights across different devices.

Remember when I said we did not perform any compression tricks before?

That was not entirely true…

The model that we loaded, Zephyr-7B-β, was actually already sharded for us! If you go to the model and click the “Files and versions” link, you will see that the model was split up into eight pieces.

The model was split up into eight small pieces or shards. This decreases the necessary VRAM as we only need to handle these small pieces.

Although we can shard a model ourselves, it is generally advised to be on the lookout for quantized models or even quantize them yourself.

Sharding is quite straightforward using the Accelerate package:

from accelerate import Accelerator

# Shard our model into pieces of 1GB
accelerator = Accelerator()
accelerator.save_model(
    model=pipe.model, 
    save_directory="/content/model", 
    max_shard_size="4GB"
)

And that is it! Because we sharded the model into pieces of 4GB instead of 2GB, we created fewer files to load:

3. Quantize with Bitsandbytes

A Large Language Model is represented by a bunch of weights and activations. These values are generally represented by the usual 32-bit floating point (float32) datatype.

The number of bits tells you something about how many values it can represent. Float32 can represent values between 1.18e-38 and 3.4e38, quite a number of values! The lower the number of bits, the fewer values it can represent.

Common value representation methods. We aim to keep the number of bits as low as possible whilst maximizing both the range and precision of the representation.

As you might expect, if we choose a lower bit size, then the model becomes less accurate but it also needs to represent fewer values, thereby decreasing its size and memory requirements.

A different representation method might negatively affect the precision with which to represent value. To the extent that some values are not even represented (too large values for float16 for example). Examples were calculated with PyTorch.

Quantization refers to converting an LLM from its original Float32 representation to something smaller. However, we do not simply want to use a smaller bit variant but map a larger bit representation to a smaller bit without losing too much information.

In practice, we see this often done with a new format, named 4bit-NormalFloat (NF4). This datatype does a few special tricks in order to efficiently represent a larger bit datatype. It consists of three steps:

Normalization: The weights of the model are normalized so that we expect the weights to fall within a certain range. This allows for more efficient representation of more common values.
Quantization: The weights are quantized to 4-bit. In NF4, the quantization levels are evenly spaced with respect to the normalized weights, thereby efficiently representing the original 32-bit weights.
Dequantization: Although the weights are stored in 4-bit, they are dequantized during computation which gives a performance boost during inference.

To perform this quantization with HuggingFace, we need to define a configuration for the quantization with Bitsandbytes:

from transformers import BitsAndBytesConfig
from torch import bfloat16

# Our 4-bit configuration to load the LLM with less GPU memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

This configuration allows us to specify which quantization levels we are going for. Generally, we want to represent the weights with 4-bit quantization but do the inference in 16-bit.

Loading the model in a pipeline is then straightforward:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Zephyr with BitsAndBytes Configuration
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-alpha",
    quantization_config=bnb_config,
    device_map='auto',
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

Next up, we can use the same prompt as we did before:

# We will use the same prompt as we did originally
outputs = pipe(
    prompt, 
    max_new_tokens=256, 
    do_sample=True, 
    temperature=0.7, 
    top_p=0.95
)
print(outputs[0]["generated_text"])

This will give us the following output:

Why did the Large Language Model go to the party?

To network and expand its vocabulary!

Quantization is a powerful technique to reduce the memory requirements of a model whilst keeping performance similar. It allows for faster loading, using, and fine-tuning LLMs even with smaller GPUs.

4. Pre-Quantization (GPTQ vs. AWQ vs. GGUF)

Thus far, we have explored sharding and quantization techniques. Albeit useful techniques to have in your skillset, it seems rather wasteful to have to apply them every time you load the model.

Instead, these models have often already been sharded and quantized for us to use. TheBloke in particular is a user on HuggingFace that performs a bunch of quantizations for us to use.

At the moment of writing this, he has uploaded more than 2000 quantized models for us!

These quantized models actually come in many different shapes and sizes. Most notably, the GPTQ, GGUF, and AWQ formats are most frequently used to perform 4-bit quantization.

GPTQ: Post-Training Quantization for GPT Models

GPTQ is a post-training quantization (PTQ) method for 4-bit quantization that focuses primarily on GPU inference and performance.

The idea behind the method is that it will try to compress all weights to a 4-bit quantization by minimizing the mean squared error to that weight. During inference, it will dynamically dequantize its weights to float16 for improved performance whilst keeping memory low.

For a more detailed guide to the inner workings of GPTQ, definitely check out the following post: 4-bit Quantization with GPTQ

We start with installing a number of packages we need to load in GPTQ-like models in HuggingFace Transformers:

pip install optimum
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

After doing so, we can navigate to the model that we want to load, namely “TheBloke/zephyr-7B-beta-GPTQ” and choose a specific revision.

These revisions essentially indicate the quantization method, compression level, size of the model, etc.

For now, we are sticking with the “main” branch as that is generally a nice balance between compression and accuracy:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load LLM and Tokenizer
model_id = "TheBloke/zephyr-7B-beta-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=False,
    revision="main"
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

Although we installed a few additional dependencies, we could use the same pipeline as we used before which is a great benefit of using GPTQ.

After loading the model, we can run a prompt as follows:

# We will use the same prompt as we did originally
outputs = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)
print(outputs[0]["generated_text"])

This gives us the following generated text:

Why did the Large Language Model go to the party?

To show off its wit and charm, of course!

But unfortunately, it got lost in the crowd and couldn’t find its way back to its owner. The partygoers were impressed by its ability to blend in so seamlessly with the crowd, but the Large Language Model was just confused and wanted to go home. In the end, it was found by a group of humans who recognized its unique style and brought it back to its rightful place. From then on, the Large Language Model made sure to wear a name tag at all parties, just to be safe.

GPTQ is the most often used compression method since it optimizes for GPU usage. It is definitely worth starting with GPTQ and switching over to a CPU-focused method, like GGUF if your GPU cannot handle such large models.

GGUF: GPT-Generated Unified Format

Although GPTQ does compression well, its focus on GPU can be a disadvantage if you do not have the hardware to run it.

GGUF, previously GGML, is a quantization method that allows users to use the CPU to run an LLM but also offload some of its layers to the GPU for a speed up.

Although using the CPU is generally slower than using a GPU for inference, it is an incredible format for those running models on CPU or Apple devices. Especially since we are seeing smaller and more capable models appearing, like Mistral 7B, the GGUF format might just be here to stay!

Using GGUF is rather straightforward with the ctransformers package which we will need to install first:

pip install ctransformers[cuda]

After doing so, we can navigate to the model that we want to load, namely “TheBloke/zephyr-7B-beta-GGUF” and choose a specific file.

Like GPTQ, these files indicate the quantization method, compression, level, size of the model, etc.

We are using “zephyr-7b-beta.Q4_K_M.gguf” since we focus on 4-bit quantization:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-beta-GGUF",
    model_file="zephyr-7b-beta.Q4_K_M.gguf",
    model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta", use_fast=True
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

After loading the model, we can run a prompt as follows:

# We will use the same prompt as we did originally
outputs = pipe(prompt, max_new_tokens=256)
print(outputs[0]["generated_text"])

This gives us the following output:

Why did the Large Language Model go to the party? To impress everyone with its vocabulary! But unfortunately, it kept repeating the same jokes over and over again, making everyone groan and roll their eyes. The partygoers soon realized that the Large Language Model was more of a party pooper than a party animal. Moral of the story: Just because a Large Language Model can generate a lot of words, doesn’t mean it knows how to be funny or entertaining. Sometimes, less is more!

GGUF is an amazing format if you want to leverage both the CPU and GPU when you, like me, are GPU-poor and do not have the latest and greatest GPU available.

AWQ: Activation-aware Weight Quantization

A new format on the block is AWQ (Activation-aware Weight Quantization) which is a quantization method similar to GPTQ. There are several differences between AWQ and GPTQ as methods but the most important one is that AWQ assumes that not all weights are equally important for an LLM’s performance.

In other words, there is a small fraction of weights that will be skipped during quantization which helps with the quantization loss.

As a result, their paper mentions a significant speed-up compared to GPTQ whilst keeping similar, and sometimes even better, performance.

The method is still relatively new and has not been adopted yet to the extent of GPTQ and GGUF, so it is interesting to see if all these methods can co-exist.

For AWQ, we will use the vLLM package as that was, at least in my experience, the road of least resistance to using AWQ:

pip install vllm

With vLLM, loading and using our model becomes painless:

from vllm import LLM, SamplingParams

# Load the LLM
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=256)
llm = LLM(
    model="TheBloke/zephyr-7B-beta-AWQ", 
    quantization='awq', 
    dtype='half', 
    gpu_memory_utilization=.95, 
    max_model_len=4096
)

Then, we can easily run the model with .generate:

# Generate output based on the input prompt and sampling parameters
output = llm.generate(prompt, sampling_params)
print(output[0].outputs[0].text)

This gives us the following output:

Why did the Large Language Model go to the party? To network and expand its vocabulary! Why did the Large Language Model blush? Because it overheard another model saying it was a little too wordy! Why did the Large Language Model get kicked out of the library? It was being too loud and kept interrupting other models’ conversations with its endless chatter! …

Although it is a new format, AWQ is gaining popularity due to its speed and quality of compression!

🔥 TIP: For a more detailed comparison between these techniques with respect to VRAM/Perplexity, I highly advise reading this in-depth post with a follow-up here.

Thank you for reading!

If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:

All images without a source credit were created by the author — Which means all of them, I like creating my own images ;)

Introducing KeyLLM - Keyword Extraction with LLMs

2023-10-03T00:00:00+00:00

Large Language Models (LLMs) are becoming smaller, faster, and more efficient. Up to the point where I started to consider them for iterative tasks, like keyword extraction.

Having created KeyBERT, I felt that it was time to extend the package to also include LLMs. They are quite powerful and I wanted to prepare the package for when these models can be run on smaller GPUs.

As such, introducing KeyLLM, an extension to KeyBERT that allows you to use any LLM to extract, create, or even fine-tune the keywords! In this tutorial, we will go through keyword extraction with KeyLLM using the recently released Mistral 7B model.

Update: I uploaded a video version to YouTube that goes more in-depth into how to use KeyLLM

We will start by installing a number of packages that we are going to use throughout this example:

pip install --upgrade git+https://github.com/UKPLab/sentence-transformers
pip install keybert ctransformers[cuda]
pip install --upgrade git+https://github.com/huggingface/transformers

We are installing sentence-transformers from its main branch since it has a fix for community detection which we will use in the last few use cases. We do the same for transformers since it does not yet support the Mistral architecture.

Loading the Model

In previous tutorials, we demonstrated how we could quantize the original model’s weight to make it run without running into memory problems.

Over the course of the last few months, TheBloke has been working hard on doing the quantization for hundreds of models for us.

This way, we can download the model directly which will speed things up quite a bit.

We’ll start with loading the model itself. We will ofload 50 layers to the GPU. This will reduce RAM usage and use VRAM instead. If you are running into memory errors, reducing this parameter (gpu_layers) might help!

from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)

After having loaded the model itself, we want to create a 🤗 Transformers pipeline.

The main benefit of doing so is that these pipelines are found in many tutorials and are often used in packages as backend. Thus far, ctransformers is not yet natively supported as much as transformers.

Loading the Mistral tokenizer with ctransformers is not yet possible as the model is quite new. Instead, we use the tokenizer from the original repository instead.

from transformers import AutoTokenizer, pipeline

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

📄 Prompt Engineering

Let’s see if this works with a very basic example:

>>> response = generator("What is 1+1?")
>>> print(response[0]["generated_text"])
"""
What is 1+1?
A: 2
"""

Perfect! It can handle a very basic question. For the purpose of keyword extraction, let’s explore whether it can handle a bit more complexity.

prompt = """
I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still have not received mine

Extract 5 keywords from that document.
"""
response = generator(prompt)
print(response[0]["generated_text"])

We get the following output:

"""
I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still have not received mine

Extract 5 keywords from that document.

**Answer:**
1. Website
2. Mentions
3. Deliver
4. Couple
5. Days
"""

It does great! However, if we want the structure of the output to stay consistent regardless of the input text we will have to give the LLM an example.

This is where more advanced prompt engineering comes in. As with most Large Language Models, Mistral 7B expects a certain prompt format. This is tremendously helpful when we want to show it what a “correct” interaction looks like.

The prompt template is as follows:

Based on that template, let’s create a template for keyword extraction.

It needs to have two components:

Example prompt - This will be used to show the LLM what a “good” output looks like
Keyword prompt - This will be used to ask the LLM to extract the keywords

The first component, the example_prompt, will simply be an example of correctly extracting the keywords in the format that we are interested.

The format is a key component since it will make sure that the LLM will always output keywords the way we want:

example_prompt = """
[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken"""

The second component, the keyword_prompt, will essentially be a repeat of the example_prompt but with two changes:

It will not have an output yet. That will be generated by the LLM.
We use of KeyBERT’s [DOCUMENT] tag for indicating where the input document will go.

We can use the [DOCUMENT] to insert a document at a location of your choice. Having this option helps us to change the structure of the prompt if needed without being set on having the prompt at a specific location.

keyword_prompt = """
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""

Lastly, we combine the two prompts to create our final template:

>>> prompt = example_prompt + keyword_prompt
>>> print(prompt)
"""
[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say: 
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say: 
"Here are the keywords present in the document"
[/INST]
"""

Now that we have our final prompt template, we can start exploring a couple of interesting new features in KeyBERT with KeyLLM. We will start by exploring KeyLLM only using Mistral’s 7B model

🗝️ Keyword Extraction with `KeyLLM`

Keyword extraction with vanilla KeyLLM couldn’t be more straightforward; we simply ask it to extract keywords from a document.

This idea of extracting keywords from documents through an LLM is straightforward and allows for easily testing your LLM and its capabilities.

Using KeyLLM is straightforward, we start by loading our LLM through keybert.llm.TextGeneration and give it the prompt template that we created before.

🔥 TIP 🔥: If you want to use a different LLM, like ChatGPT, you can find a full overview of implemented algorithms here:

from keybert.llm import TextGeneration
from keybert import KeyLLM

# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)

After preparing our KeyLLM instance, it is as simple as running .extract_keywords over your documents:

documents = [
"The website mentions that it only takes a couple of days to deliver but I still have not received mine.",
"I received my package!",
"Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license."
]

keywords = kw_model.extract_keywords(documents)

We get the following keywords:

[['deliver',
    'days',
    'website',
    'mention',
    'couple',
    'still',
    'receive',
    'mine'],
    ['package', 'received'],
    ['LLM',
    'API',
    'accessibility',
    'release',
    'license',
    'research',
    'community',
    'model',
    'weights',
    'Meta']]

These seem like a great set of keywords!

You can play around with the prompt to specify the kind of keywords you want extracted, how long they can be, and even in which language they should be returned if your LLM is multi-lingual.

🚀 Efficient Keyword Extraction with `KeyLLM`

Iterating your LLM over thousands of documents is not the most efficient approach! Instead, we can leverage embedding models to make the keyword extraction a bit more efficient.

This works as follows. First, we embed all of our documents and convert them to numerical representations. Second, we find out which documents are most similar to one another. We assume that documents that are highly similar will have the same keywords, so there would be no need to extract keywords for all documents. Third, we only extract keywords from 1 document in each cluster and assign the keywords to all documents in the same cluster.

This is much more efficient and also quite flexible. The clusters are generated purely based on the similarity between documents, without taking cluster structures into account. In other words, it is essentially finding near-duplicate documents that we expect to have the same set of keywords.

To do this with KeyLLM, we embed our documents beforehand and pass them to .extract_keywords. The threshold indicates how similar documents will minimally need to be in order to be assigned to the same cluster.

Increasing this value to something like .95 will identify near-identical documents whereas setting it to something like .5 will identify documents about the same topic.

from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

# Extract embeddings
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(documents, convert_to_tensor=True)

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.5)

We get the following keywords:

>>> keywords
[['deliver',
    'days',
    'website',
    'mention',
    'couple',
    'still',
    'receive',
    'mine'],
    ['deliver',
    'days',
    'website',
    'mention',
    'couple',
    'still',
    'receive',
    'mine'],
    ['LLaMA',
    'model',
    'weights',
    'release',
    'noncommercial',
    'license',
    'research',
    'community',
    'powerful',
    'LLMs',
    'APIs']]

In this example, we can see that the first two documents were clustered together and received the same keywords. Instead of passing all three documents to the LLM, we only pass two documents. This can speed things up significantly if you have thousands of documents.

🏆 Efficient Keyword Extraction with `KeyBERT` & `KeyLLM`

Before, we manually passed the embeddings to KeyLLM to essentially do a zero-shot extraction of keywords. We can further extend this example by leveraging KeyBERT.

Since KeyBERT generates keywords and embeds the documents, we can leverage that to not only simplify the pipeline but suggest a number of keywords to the LLM.

These suggested keywords can help the LLM decide on the keywords to use. Moreover, it allows for everything within KeyBERT to be used with KeyLLM!

This efficient keyword extraction with both KeyBERT and KeyLLM only requires three lines of code! We create a KeyBERT model and assign it the LLM with the embedding model we previously created:

from keybert import KeyLLM, KeyBERT

# Load it in KeyLLM
kw_model = KeyBERT(llm=llm, model='BAAI/bge-small-en-v1.5')

# Extract keywords
keywords = kw_model.extract_keywords(documents, threshold=0.5)

We get the following keywords:

>>> keywords
    [['deliver',
      'days',
      'website',
      'mention',
      'couple',
      'still',
      'receive',
      'mine'],
     ['package', 'received'],
     ['LLM',
      'API',
      'accessibility',
      'release',
      'license',
      'research',
      'community',
      'model',
      'weights',
      'Meta']]

And that is it! With KeyLLM you are able to use Large Language Models to help create better keywords. We can choose to extract keywords from the text itself or ask the LLM to come up with keywords.

By combining KeyLLM with KeyBERT, we increase its potential by doing some computation and suggestions beforehand.

🔥 TIP 🔥: You can use [CANDIDATES] to pass the generated keywords in KeyBERT to the LLM as candidate keywords. That way, you can tell the LLM that KeyBERT has already generated a number of keywords and ask it to improve them.

Thank you for reading!

If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:

3 Ways To Improve Your Large Language Model

2023-09-11T00:00:00+00:00

Large Language Models (LLMs) are here to stay. With the recent release of Llama 2, LLMs are approaching the performance of ChatGPT and with proper tuning can even exceed it.

Using these LLMs is often not as straightforward as it seems especially if you want to fine-tune the LLM to your specific use case.

In this article, we will go through 3 of the most common methods for improving the performance of any LLM:

Prompt Engineering
Retrieval Augmented Generation (RAG)
Parameter Efficient Fine-Tuning (PEFT)

There are many more methods but these are the easiest and can result in major improvements without much work.

These 3 methods start from the least complex method, the so-called low-hanging fruits, to one of the more complex methods for improving your LLM.

To get the most out of LLMs, you can even combine all three methods!

Before we get started, here is a more in-depth overview of the methods for easier reference:

You can also follow along with the Google Colab Notebook to make sure everything works as intended.

Update: I uploaded a video version to YouTube that goes more in-depth into how to use these methods:

Load Llama 2 🦙

Before we get started, we need to load in an LLM to use throughout these examples. We’re going with the base Llama 2 as it shows incredible performance and because I am a big fan of sticking with foundation models in tutorials.

We will first need to accept the license before we can get started. Follow these steps:

Create a HuggingFace account here
Apply for Llama 2 access here
Get your HuggingFace token here

After doing so, we can log in with our HuggingFace credentials so that this environment knows we have permission to download the Llama 2 model that we are interested in:

from huggingface_hub import notebook_login
notebook_login()

Next, we can load in the 13B variant of Llama 2:

from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-13b-chat-hf'
pyt
# 4-bit Quanityzation to load Llama 2 with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,  
    bnb_4bit_quant_type='nf4',  
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)
model.eval()

# Our text generator
generator = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=500,
    repetition_penalty=1.1
)

Most open-source LLMs have some sort of template that you must adhere to when creating prompts. In the case of Llama 2, the following helps guide the prompts:

This means that we would have to use the prompt as follows to generate text properly:

basic_prompt = """
[INST] <>

You are a helpful assistant

<>

What is 1 + 1? [/INST]
"""
print(generator(basic_prompt)[0]["generated_text"])

Which generates the following output:

""" Oh my, that's a simple one! The answer to 1 + 1 is... (drumroll please)... 2! 😄 """

What a cheeky LLM!

The template is less complex than it seems but with a bit of practice, you should get it right in no time.

Now, let’s dive into our first method for improving the output of an LLM, prompt engineering.

1. Prompt Engineering ⚙️

How we ask the LLM something has a major effect on the quality of the output that we get. We need to be precise, complete and give examples of the output we are interested in.

This tailoring of your prompt is called prompt engineering.

Prompt engineering is such an amazing way to “tune” your model. It requires no updating of the model and you can quickly iterate over it.

There are two major concepts in prompt engineering:

Example-based

Thought-based

Example-based Prompt Engineering

In example-based prompting, such as one-shot or few-shot learning, we provide the LLM with a couple of examples of what we are looking for.

This generally generates text that is more aligned with how we want it.

For example, let’s apply sentiment classification to a short review:

prompt = """ [INST] <> You are a helpful assistant. <> Classify the text into neutral, negative or positive. Text: I think the food was okay. [/INST] """ print(generator(prompt)[0]["generated_text"])

Which generates the following output:

""" Positive. The word "okay" is a mildly positive word, indicating that the food was satisfactory or acceptable. """

Personally, I am not that convinced with the answer. I think it is more neutral than positive. Also, we have to search in the text for the answer.

Instead, let’s give it an example of how we want the answer to be generated:

prompt = """ [INST] <> You are a helpful assistant. <> Classify the text into neutral, negative or positive. Text: I think the food was alright. Sentiment: [/INST] Neutral [INST] Classify the text into neutral, negative or positive. Text: I think the food was okay. Sentiment: [/INST] """ print(generator(prompt)[0]["generated_text"])

When we look at the output, we get the expected result:

""" Neutral """

Now, the LLM is to the point and gives only the answer that we are interested in. Much better!

Thought-based Prompt Engineering

We can go a step further and ask the LLM to “reason” about its answer.

By having the LLM divide its thinking into smaller steps, it allows for more computation to be given to each step. These smaller steps are generally referred to as the *“thoughts” *of the LLM.

There are many ways that we can use thought-based prompting and we can make it quite complex as seen below:

However, let’s start with the most basic example. What if we ask the model to solve a problem by breaking the problem into steps?

prompt = """ [INST] <> You are a helpful assistant. <> Do the odd numbers in this group add up to an even number? 3, 5, 15, 32. Solve by breaking the problem into steps. Identify the odd numbers, add them, and indicate whether the result is odd or even.[/INST] """ print(generator(prompt)[0]["generated_text"])

The output we get shows its reasoning in detail:

""" Hello! I'd be happy to help you with that. Let's break down the problem into steps: Step 1: Identify the odd numbers in the group. The odd numbers in the group are: 3, 5, 15 Step 2: Add the odd numbers. 3 + 5 + 15 = 23 Step 3: Is the result odd or even? 23 is an odd number. Therefore, the sum of the odd numbers in the group is an odd number. """

As we have seen before, this is also called Chain-of-Thought where the LLM has a sequence of individual *thoughts *or steps it follows.

These individual steps also help the model to stay accountable during its computation. Because it has “reasoned” about each step individually there is structure in its “thinking” process.

2. Retrieval-Augmented Generation (RAG) 🗃️

Although prompt engineering can get us an improvement, it cannot make the LLM know something it has not learned before.

When an LLM is trained in 2022, it has no knowledge about what has happened in 2023.

This is where Retrieval-Augmented Generation (RAG) comes in. It is a method of providing external knowledge to an LLM that it can leverage.

In RAG, a knowledge base, like Wikipedia, is converted to numerical representations to capture its meaning, called embeddings. These embeddings are stored in a vector database so that the information can easily be retrieved.

Then, when you give the LLM a certain prompt, the vector database is searched for information that relates to the prompt.

The most relevant information is then passed to the LLM as the additional context that it can use to derive its response.

In practice, RAG helps the LLM to “look up” information in its external knowledge base to improve its response.

Creating a RAG Pipeline with LangChain

To create an RAG pipeline or system, we can use the well-known and easy-to-use framework called LangChain.

We’ll start with creating a tiny knowledge base about Llama 2 and writing it into a text file:

# Our tiny knowledge base knowledge_base = [ "On July 18, 2023, in partnership with Microsoft, Meta announced LLaMA-2, the next generation of LLaMA." , "Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ", "The fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases.", "Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters.", "The model architecture remains largely unchanged from that of LLaMA-1 models, but 40% more data was used to train the foundational models.", "The accompanying preprint also mentions a model with 34B parameters that might be released in the future upon satisfying safety targets." ] with open(r'knowledge_base.txt', 'w') as fp: fp.write('\n'.join(knowledge_base))

After doing so, we will need to create an embedding model that can convert text to numerical representations, namely embeddings.

We will choose a well-known sentence-transformers model, namely sentence-transformers/all-MiniLM-L6-v2.

🔥 TIP 🔥 You can find an amazing list of models at the Massive Text Embedding Benchmark (MTEB) Leaderboard.

from langchain.embeddings.huggingface import HuggingFaceEmbeddings # Embedding Model for converting text to numerical representations embedding_model = HuggingFaceEmbeddings( model_name='sentence-transformers/all-MiniLM-L6-v2' )

Now that we have an embedding model and a tiny knowledge base, we can start developing our vector database. There are many to choose from but let’s choose one that we can use locally, namely FAISS.

🔥 TIP🔥 Other great vector databases are Pinecone and Weaviate.

from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import FAISS from langchain.document_loaders import TextLoader # Load documents and split them documents = TextLoader("knowledge_base.txt").load() text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) docs = text_splitter.split_documents(documents) # Create local vector database db = FAISS.from_documents(docs, embedding_model)

The db variable contains everything we need for the database but still need to combine it with an LLM. LangChain makes this incredibly easy and straightforward:

from langchain.chains import RetrievalQA from langchain.llms import HuggingFacePipeline # Load LLM into LangChain llm = HuggingFacePipeline(pipeline=generator) # RAG Pipeline rag = RetrievalQA.from_chain_type( llm=llm, chain_type='stuff', retriever=db.as_retriever() )

The great thing about the two variables llm and rag that we created is that we can use llm to do some prompting without the RAG pipeline and rag with the RAG pipeline.

Let’s start without RAG. What happens if we ask the LLM about itself, Llama 2?

>>> llm('What is Llama 2?') """ Llama 2 is a cryptocurrency that was created as a parody of the popular cryptocurrency Dogecoin. It was designed to be a more serious and less meme-focused alternative to Dogecoin, with a focus on privacy and security. Llama 2 uses a proof-of-work consensus algorithm and has a block time of 1 minute. It also features a decentralized governance system that allows holders of the llama2 token to vote on proposals for the development and growth of the project. One of the key features of Llama 2 is its use of "ring signatures" to enhance privacy and security. Ring signatures are a type of digital signature that can be used to hide the identity of the sender in a transaction. This makes it more difficult for hackers or other malicious actors to trace the source of a transaction. Overall, Llama 2 is a privacy-focused cryptocurrency that is designed to provide users with greater control over their financial data and more secure transactions. """

Cryptocurrency? That is not exactly the answer that we were looking for… It seems that it has no knowledge about itself.

Let’s try to use the RAG pipeline instead:

>>> rag('What is Llama 2?') """ Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) announced by Meta in partnership with Microsoft on July 18, 2023. """

That is much better!

Since we have given it external knowledge about Llama 2, it can leverage that information to generate more accurate answers.

🔥 TIP 🔥 Prompting can get difficult and complex quite quickly. If you want to know the exact prompt that is given to the LLM, you can run the following before running the LLM:

import langchain langchain.debug = False

3. Parameter-Efficient Fine-Tuning 🛠️

Both prompt engineering and RAG generally do not change the LLM in itself. Its parameters remain the same and the model does not “learn” anything new, it simply leverages.

We can fine-tune the LLM for a specific use case with domain-specific data so that it learns something new.

Instead of fine-tuning the model’s billions of parameters, we can leverage PEFT instead, Parameter-Efficient Fine-Tuning. As the name implies, it is a subfield that focuses on efficiently fine-tuning an LLM with as few parameters as possible.

One of the most often used methods to do so is called Low-Rank Adaptation (LoRA). LoRA finds a small subset of the original parameters to train without having to touch the base model.

These parameters can be seen as smaller representations of the full model where only the most important or impactful parameters are trained. The beauty is that the resulting weights can be added to the base model and therefore saved separately.

Fine-Tuning Llama 2 with AutoTrain

The process of fine-tuning Llama 2 can be difficult with the many parameters out there. Fortunately, AutoTrain takes most of the difficulty away from you and allows you to fine-tune in only a single line!

We’ll start with the data. As always, it is the one thing that affects the resulting performance most!

We are going to make the base Llama 2 model, a chat model, and we will use the OpenAssistant Guanaco dataset for that:

import pandas as pd from datasets import load_dataset # Load dataset in pandas dataset = load_dataset("timdettmers/openassistant-guanaco") df = pd.DataFrame(dataset["train"][:1000]).dropna() df.to_csv("train.csv")

This dataset has a number of question/response schemes that you can train Llama 2 on. It differentiates the user with the ### Human tag and the response from the LLM with the ### Assistant tag.

We are only going to take 1000 samples from this dataset for illustration purposes but the performance will definitely increase with more quality data points.

NOTE: The dataset will need a text column which is what AutoTrain will automatically use.

The training in itself is extremely straightforward after installing AutoTrain with only a single line of code:

autotrain llm --train \ --project_name Llama-Chat \ --model abhishek/llama-2-7b-hf-small-shards \ --data_path . \ --use_peft \ --use_int4 \ --learning_rate 2e-4 \ --num_train_epochs 1 \ --trainer sft \ --merge_adapter

There are a number of parameters that are important:

data_path: The path to your data. We saved a train.csv *locally with a *text column *that AutoTrain will use during training.*

model: The base model that we are going to fine-tune. It is a sharded version of the base model that allows for easier training.

use_peft & use_int4: The parameters enable the efficient fine-tuning of the model which reduces the VRAM that is necessary. It leverages, in part, LoRA.

merge_adapter: To make it easier to use the model, we will merge the LoRA together with the base model to create a new model.

And that is it! Fine-tuning a Llama 2 model this way is incredibly easy and since we merged the LoRA weights with the original model, you can load in the updated model as we did before.

🔥 TIP 🔥 Although fine-tuning in one line is amazing, it is very much advised to go through the parameters yourself. Learning what it exactly means to fine-tune with in-depth guides helps you also understand when things are going wrong.

Thank you for reading!

If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:

Topic Modeling with Llama 2

2023-08-22T00:00:00+00:00

With the advent of Llama 2, running strong LLMs locally has become more and more a reality. Its accuracy approaches OpenAI’s GPT-3.5, which serves well for many use cases.

In this article, we will explore how we can use Llama2 for Topic Modeling without the need to pass every single document to the model. Instead, we are going to leverage BERTopic, a modular topic modeling technique that can use any LLM for fine-tuning topic representations.

Update: I uploaded a video version to YouTube that goes more in-depth into how to use BERTopic with Llama 2:

BERTopic works rather straightforward. It consists of 5 sequential steps: embedding documents, reducing embeddings in dimensionality, cluster embeddings, tokenizing documents per cluster, and finally extracting the best-representing words per topic.

The 5 main steps of BERTopic.

However, with the rise of LLMs like Llama 2, we can do much better than a bunch of independent words per topic. It is computationally not feasible to pass all documents to Llama 2 directly and have it analyze them. We can employ vector databases for search but we are not entirely sure which topics to search for.

Instead, we will leverage the clusters and topics that were created by BERTopic and have Llama 2 fine-tune and distill that information into something more accurate.

This is the best of both worlds, the topic creation of BERTopic together with the topic representation of Llama 2.

Llama 2 lets us fine-tune the topic representations generated by BERTopic.

Now that this intro is out of the way, let’s start the hands-on tutorial!

We will start by installing a number of packages that we are going to use throughout this example:

pip install bertopic datasets accelerate bitsandbytes xformers adjustText

Keep in mind that you will need at least a T4 GPU in order to run this example, which can be used with a free Google Colab instance.

Data

We are going to apply topic modeling on a number of ArXiv abstracts. They are a great source for topic modeling since they contain a wide variety of topics and are generally well-written.

from datasets import load_dataset dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"] # Extract abstracts to train on and corresponding titles abstracts = dataset["abstract"] titles = dataset["title"]

To give you an idea, an abstract looks like the following:

>>> # The abstract of "Attention Is All You Need" >>> print(abstracts[13894])\ """ The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. """

🤗 HuggingFace Hub Credentials

Before we can load in Llama2 using a number of tricks, we will first need to accept the License for using Llama2. The steps are as follows:

Create a HuggingFace account here

Apply for Llama 2 access here

Get your HuggingFace token here

After doing so, we can log in with our HuggingFace credentials so that this environment knows we have permission to download the Llama 2 model that we are interested in.

from huggingface_hub import notebook_login notebook_login()

🦙 Llama 2

Now comes one of the more interesting components of this tutorial, how to load in a Llama 2 model on a T4-GPU!

We will be focusing on the 'meta-llama/Llama-2-13b-chat-hf' variant. It is large enough to give interesting and useful results whilst small enough that it can be run on our environment.

We start by defining our model and identifying if our GPU is correctly selected. We expect the output of device to show a Cuda device:

from torch import cuda model_id = 'meta-llama/Llama-2-13b-chat-hf' device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'; print(device)

Optimization & Quantization

In order to load our 13 billion parameter model, we will need to perform some optimization tricks. Since we have limited VRAM and not an A100 GPU, we will need to “condense” the model a bit so that we can run it.

There are a number of tricks that we can use but the main principle is going to be 4-bit quantization.

This process reduces the 64-bit representation to only 4-bits which reduces the GPU memory that we will need. It is a recent technique and quite elegant at that for efficient LLM loading and usage. You can find more about that method here in the QLoRA paper and on the amazing HuggingFace blog here.

from torch import bfloat16 import transformers # Quantization to load an LLM with less GPU memory bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, # 4-bit quantization bnb_4bit_quant_type='nf4', # Normalized float 4 bnb_4bit_use_double_quant=True, # Second quantization after the first bnb_4bit_compute_dtype=bfloat16 # Computation type )

These four parameters that we just run are incredibly important and bring many LLM applications to consumers:

load_in_4bit

Allows us to load the model in 4-bit precision compared to the original 32-bit precision

This gives us an incredible speed up and reduces memory!

bnb_4bit_quant_type

This is the type of 4-bit precision. The paper recommends normalized float 4-bit, so that is what we are going to use!

bnb_4bit_use_double_quant

This is a neat trick as it performs a second quantization after the first which further reduces the necessary bits

bnb_4bit_compute_dtype

The compute type used during computation, which further speeds up the model.

Using this configuration, we can start loading in the model as well as the tokenizer:

# Llama 2 Tokenizer tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) # Llama 2 Model model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, quantization_config=bnb_config, device_map='auto', ) model.eval()

Using the model and tokenizer, we will generate a HuggingFace transformers pipeline that allows us to easily generate new text:

# Our text generator generator = transformers.pipeline( model=model, tokenizer=tokenizer, task='text-generation', temperature=0.1, max_new_tokens=500, repetition_penalty=1.1 )

Prompt Engineering

To check whether our model is correctly loaded, let’s try it out with a few prompts.

>>> prompt = "Could you explain to me how 4-bit quantization works as if I am 5?" >>> res = generator(prompt) >>> print(res[0]["generated_text"]) """ Could you explain to me how 4-bit quantization works as if I am 5? Sure! Imagine you have a big box of crayons. Each crayon represents a different color, like red, blue, green, and so on. Now, imagine that instead of using all the different colors, we only want to use four colors: red, blue, green, and yellow. We can do this by taking all the colors and dividing them into four groups. Group 1: Red, Blue, Green Group 2: Yellow Now, when we want to draw something, we can only choose one color from each group. So, if we want to draw a red apple, we would choose the red crayon from Group 1 and the yellow crayon from Group 2. This way, we can only use four colors, but we can still make lots of different things! In the same way, when we use 4-bit quantization, we take all the numbers and divide them into four groups. Each group has a different number of bits, just like our crayon groups had different colors. Then, when we want to represent a number, we can only choose one number from each group. This way, we can represent lots of different numbers using only four bits! """

Although we can directly prompt the model, there is actually a template that we need to follow. The template looks as follows:

""" [INST] <> {{ System Prompt }} <> {{ User Prompt }} [/INST] {{ Model Answer }} """

This template consists of two main components, namely the {{ System Prompt }} and the {{ User Prompt }}:

The {{ System Prompt }} helps us guide the model during a conversation. For example, we can say that it is a helpful assistant that is specialized in labeling topics.

The {{ User Prompt }} is where we ask it a question.

You might have noticed the [INST] tags, which are used to identify the beginning and end of a prompt. We can use these to model the conversation history as we will see more in-depth later on.

Next, let’s see how we can use this template to optimize Llama 2 for topic modeling.

Prompt Template

We are going to keep our system prompt simple and to the point:

# System prompt describes information given to all conversations system_prompt = """ [INST] <> You are a helpful, respectful and honest assistant for labeling topics. <> """

We will tell the model that it is simply a helpful assistant for labeling topics since that is our main goal.

In contrast, our user prompt is going to be a bit more involved. It will consist of two components, an example and the main prompt.

Let’s start with the example. Most LLMs do a much better job of generating accurate responses if you give them an example to work with. We will show it an accurate example of the kind of output we are expecting.

# Example prompt demonstrating the output we are looking for example_prompt = """ I have a topic that contains the following documents: - Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food. - Meat, but especially beef, is the word food in terms of emissions. - Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one. The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'. Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more. [/INST] Environmental impacts of eating meat """

This example, based on a number of keywords and documents primarily about the impact of meat, helps to model to understand the kind of output it should give. We show the model that we were expecting only the label, which is easier for us to extract.

Next, we will create a template that we can use within BERTopic:

# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags main_prompt = """ [INST] I have a topic that contains the following documents: [DOCUMENTS] The topic is described by the following keywords: '[KEYWORDS]'. Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more. [/INST] """

There are two BERTopic-specific tags that are of interest, namely [DOCUMENTS] and [KEYWORDS]:

[DOCUMENTS] contain the top 5 most relevant documents to the topic

[KEYWORDS] contain the top 10 most relevant keywords to the topic as generated through c-TF-IDF

This template will be filled accordingly to each topic. And finally, we can combine this into our final prompt:

prompt = system_prompt + example_prompt + main_prompt

🗨️ BERTopic

Before we can start with topic modeling, we will first need to perform two steps:

Pre-calculating Embeddings

Defining Sub-models

Preparing Embeddings

By pre-calculating the embeddings for each document, we can speed-up additional exploration steps and use the embeddings to quickly iterate over BERTopic’s hyperparameters if needed.

🔥 TIP: You can find a great overview of good embeddings for clustering on the MTEB Leaderboard.

from sentence_transformers import SentenceTransformer # Pre-calculate embeddings embedding_model = SentenceTransformer("BAAI/bge-small-en") embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

Sub-models

Next, we will define all sub-models in BERTopic and do some small tweaks to the number of clusters to be created, setting random states, etc.

from umap import UMAP from hdbscan import HDBSCAN umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42) hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

As a small bonus, we are going to reduce the embeddings we created before to 2-dimensions so that we can use them for visualization purposes when we have created our topics.

# Pre-reduce embeddings for visualization purposes reduced_embeddings = UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)

Representation Models

One of the ways we are going to represent the topics is with Llama 2 which should give us a nice label. However, we might want to have additional representations to view a topic from multiple angles.

Here, we will be using c-TF-IDF as our main representation and KeyBERT, MMR, and Llama 2 as our additional representations.

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration # KeyBERT keybert = KeyBERTInspired() # MMR mmr = MaximalMarginalRelevance(diversity=0.3) # Text generation with Llama 2 llama2 = TextGeneration(generator, prompt=prompt) # All representation models representation_model = { "KeyBERT": keybert, "Llama2": llama2, "MMR": mmr, }

🔥 Training

Now that we have our models prepared, we can start training our topic model! We supply BERTopic with the sub-models of interest, run .fit_transform, and see what kind of topics we get.

from bertopic import BERTopic topic_model = BERTopic( # Sub-models embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, # Hyperparameters top_n_words=10, verbose=True ) # Train model topics, probs = topic_model.fit_transform(abstracts, embeddings)

Now that we are done training our model, let’s see what topics were generated:

# Show top 3 most frequent topics topic_model.get_topic_info()[1:4]

Topic Count Representation KeyBERT Llama2 MMR

1 0 10339 [‘policy’, ‘reinforcement’, ‘rl’, ‘agent’, ‘learning’, ‘control’, ‘agents’, ‘to’, ‘reward’, ‘in’] [‘learning’, ‘robots’, ‘reinforcement’, ‘dynamics’, ‘model’, ‘robotic’, ‘learned’, ‘robot’, ‘algorithms’, ‘exploration’] [‘Reinforcement Learning Agent Control’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] [‘policy’, ‘reinforcement’, ‘rl’, ‘agent’, ‘learning’, ‘control’, ‘agents’, ‘to’, ‘reward’, ‘in’]

2 1 3592 [‘privacy’, ‘federated’, ‘fl’, ‘private’, ‘clients’, ‘data’, ‘learning’, ‘communication’, ‘local’, ‘client’] [‘federated’, ‘decentralized’, ‘heterogeneity’, ‘distributed’, ‘algorithms’, ‘datasets’, ‘models’, ‘convergence’, ‘model’, ‘gradient’] [‘Privacy-Preserving Machine Learning: Federated Learning’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] [‘privacy’, ‘federated’, ‘fl’, ‘private’, ‘clients’, ‘data’, ‘learning’, ‘communication’, ‘local’, ‘client’]

3 2 3532 [‘speech’, ‘audio’, ‘speaker’, ‘music’, ‘asr’, ‘acoustic’, ‘recognition’, ‘voice’, ‘the’, ‘model’] [‘encoder’, ‘speech’, ‘voice’, ‘trained’, ‘language’, ‘models’, ‘neural’, ‘model’, ‘supervised’, ‘learning’] [‘Speech Recognition and Audio-Visual Processing’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] [‘speech’, ‘audio’, ‘speaker’, ‘music’, ‘asr’, ‘acoustic’, ‘recognition’, ‘voice’, ‘the’, ‘model’]

# Show top 3 least frequent topics topic_model.get_topic_info()[-3:]

Topic Count Representation KeyBERT Llama2 MMR

118 117 160 [‘design’, ‘circuit’, ‘circuits’, ‘synthesis’, ‘chip’, ‘designs’, ‘power’, ‘hardware’, ‘placement’, ‘hls’] [‘circuits’, ‘circuit’, ‘analog’, ‘optimization’, ‘model’, ‘chip’, ‘technology’, ‘simulation’, ‘learning’, ‘neural’] [‘Design Automation for Analog Circuits using Reinforcement Learning’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] [‘design’, ‘circuit’, ‘circuits’, ‘synthesis’, ‘chip’, ‘designs’, ‘power’, ‘hardware’, ‘placement’, ‘hls’]

119 118 159 [‘sentiment’, ‘aspect’, ‘analysis’, ‘polarity’, ‘reviews’, ‘opinion’, ‘text’, ‘task’, ‘twitter’, ‘language’] [‘embeddings’, ‘sentiment’, ‘sentiments’, ‘supervised’, ‘annotated’, ‘corpus’, ‘aspect’, ‘multilingual’, ‘datasets’, ‘model’] [‘Multilingual Aspect-Based Sentiment Analysis’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] [‘sentiment’, ‘aspect’, ‘analysis’, ‘polarity’, ‘reviews’, ‘opinion’, ‘text’, ‘task’, ‘twitter’, ‘language’]

120 119 159 [‘crowdsourcing’, ‘workers’, ‘crowd’, ‘worker’, ‘crowdsourced’, ‘labels’, ‘annotators’, ‘annotations’, ‘label’, ‘labeling’] [‘crowdsourcing’, ‘crowdsourced’, ‘annotators’, ‘crowds’, ‘annotation’, ‘algorithms’, ‘aggregation’, ‘crowd’, ‘datasets’, ‘annotator’] [‘Crowdsourced Data Labeling’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] [‘crowdsourcing’, ‘workers’, ‘crowd’, ‘worker’, ‘crowdsourced’, ‘labels’, ‘annotators’, ‘annotations’, ‘label’, ‘labeling’]

We got over 100 topics that were created and they all seem quite diverse.We can use the labels by Llama 2 and assign them to topics that we have created. Normally, the default topic representation would be c-TF-IDF, but we will focus on Llama 2 representations instead.

llama2_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Llama2"].values()] topic_model.set_topic_labels(llama2_labels)

📊 Visualize

We can go through each topic manually, which would take a lot of work, or we can visualize them all in a single interactive graph. BERTopic has a bunch of visualization functions that we can use. For now, we are sticking with visualizing the documents.

topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings, hide_annotations=True, hide_document_hover=False, custom_labels=True)

🖼️ (BONUS): Advanced Visualization

Although we can use the built-in visualization features of BERTopic, we can also create a static visualization that might be a bit more informative.

We start by creating the necessary variables that contain our reduced embeddings and representations:

import itertools import pandas as pd # Define colors for the visualization to iterate over colors = itertools.cycle(['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#46f0f0', '#f032e6', '#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324', '#fffac8', '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080', '#ffffff', '#000000']) color_key = {str(topic): next(colors) for topic in set(topic_model.topics_) if topic != -1} # Prepare dataframe and ignore outliers df = pd.DataFrame({"x": reduced_embeddings[:, 0], "y": reduced_embeddings[:, 1], "Topic": [str(t) for t in topic_model.topics_]}) df["Length"] = [len(doc) for doc in abstracts] df = df.loc[df.Topic != "-1"] df = df.loc[(df.y > -10) & (df.y < 10) & (df.x < 10) & (df.x > -10), :] df["Topic"] = df["Topic"].astype("category") # Get centroids of clusters mean_df = df.groupby("Topic").mean().reset_index() mean_df.Topic = mean_df.Topic.astype(int) mean_df = mean_df.sort_values("Topic")

Next, we will visualize the reduced embeddings with matplotlib and process the labels in such a way that it is visually more pleasing:

import seaborn as sns from matplotlib import pyplot as plt from adjustText import adjust_text import matplotlib.patheffects as pe import textwrap fig = plt.figure(figsize=(20, 20)) sns.scatterplot(data=df, x='x', y='y', c=df['Topic'].map(color_key), alpha=0.4, sizes=(0.4, 10), size="Length") # Annotate top 50 topics texts, xs, ys = [], [], [] for row in mean_df.iterrows(): topic = row[1]["Topic"] name = textwrap.fill(topic_model.custom_labels_[int(topic)], 20) if int(topic) <= 50: xs.append(row[1]["x"]) ys.append(row[1]["y"]) texts.append(plt.text(row[1]["x"], row[1]["y"], name, size=10, ha="center", color=color_key[str(int(topic))], path_effects=[pe.withStroke(linewidth=0.5, foreground="black")] )) # Adjust annotations such that they do not overlap adjust_text(texts, x=xs, y=ys, time_lim=1, force_text=(0.01, 0.02), force_static=(0.01, 0.02), force_pull=(0.5, 0.5)) plt.axis('off') plt.legend('', frameon=False) plt.show()

Thank you for reading!

If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:

Decoding Auto-GPT

2023-08-08T00:00:00+00:00

There have been many interesting, complex, and innovative solutions since the release of ChatGPT. The community has explored countless possibilities for improving its capabilities.

One of those is the well-known Auto-GPT package. With more than 140k stars, it is one of the highest-ranking repositories on Github!

Auto-GPT is an attempt at making GPT-4 fully autonomous.

Auto-GPT gives GPT-4 the power to make its own decisions

It sounds incredible and it definitely is! But how does it work?

In this post, we will go through Auto-GPT’s architecture and explore how it can reach autonomous behavior.

The Architecture

Auto-GPT has an overall architecture, or a main loop of sorts, that it uses to model autonomous behavior.

Let’s start by describing this overall after which we will go through each step in-depth:

The main cyclical loop that describes Auto-GPT main’s autonomous behavioral mechanism.

The core of Auto-GPT is a cyclical sequence of steps:

Initialize the prompt with summarized information

GPT-4 proposes an action

The action is executed

Embed both the input and output of this cycle

Save embeddings to a vector database

These 5 steps make up the core of Auto-GPT and represent its main autonomous behavior.

Before we go through each step in-depth, there is a step before this cyclical sequence, namely initializing the agent.

0. Initializing the Agent

Before Auto-GPT completes a task fully autonomous, it first needs to initialize an Agent. This agent essentially describes who GPT-4 is and what goal it should pursue

Let’s say that we want Auto-GPT to create a recipe for vegan chocolate.

With that goal in mind, we need to give GPT-4 a bit of context about what an agent should be and what it should achieve:

Prompt: Create sub-goals and a name for our Agent.

We create a prompt defining two things:

Create 5 highly effective goals (these can be updated later on)

Create an appropriate role-based name (_GPT)

The name helps GPT-4 to continuously remember what it should model. The sub-goals are especially helpful to make small tasks for it to achieve.

Next, we give an example of what the desired output should be:

Prompt: GPT-4 works much better if we provide it with an example of the desired output.

Giving examples to any generative Large Language Model works really well. By describing what the output should look like, it more easily generates accurate answers.

When we pass this prompt to GPT-4 using Auto-GPT, we get the following response:

GPT-4 created a description of RecipeGPT for us!

It seems that GPT-4 has created a description of RecipeGPT for us. We can give this context to GPT-4 as a system prompt so that it continuously remembers its purpose.

Now that Auto-GPT has created a description of its agent, along with clear goals, it can start by taking its first autonomous action.

1. First Prompt

The very first step in its cyclical sequence is creating the prompt that triggers an action.

The first step in Auto-GPT’s autonomous cycle. We ask GPT-4 to use a single command based on a system prompt and a summary of events that happened in the past.

The prompt consists of three components:

System Prompt

Summary

Call to Action

We will go into the summary a bit later but the call to action is nothing more than asking GPT-4 which command it should use. The commands GPT-4 can use are defined in its System Prompt.

System Prompt

The system prompt is the context that we give to GPT-4 so that it remembers certain guidelines that it should follow.

The format of the system prompt.

As shown above, it consists of six guidelines:

The goals and description of the initialized Agent

Constrains it should adhere to

Commands it can use

Resources it has access to

Evaluation steps

Example of a valid JSON output

The last five steps are essentially constraints the Agent should adhere to.

Here is a more in-depth overview of what these guidelines and constraints generally look like:

The constraints that are given in the system prompt.

As you can see, the system prompt sketches the boundaries in which GPT-4 can act. For example, in “Resources”, it describes that GPT-4 can use GPT-3.5 Agents for the delegation of simple tasks. Similarly, *“Evaluation,” *tells GPT-4 that it should continuously self-criticize its own behavior to improve upon its next actions.

Example of the First Prompt

Together, the very first prompt looks a bit like the following:

The full first prompt. Note the three components: System prompt, summary, and call to action.

Notice that in blue “I was created” is mentioned. Typically, this would contain a summary of all the actions it has taken. Since it was just created, it has no action taken before and the summary is nothing more than “I was created”.

2. GPT-4 Proposes an Action

In step 2, we give GPT-4 the prompt we defined in the previous step. It can then propose an action to take which should adhere to the following format:

The second step in Auto-GPT’s autonomous cycle. GPT-4 executes the previous command and uses a framework called ReACT to demonstrate complex output.

You can see six individual steps being mentioned:

Thoughts

Reasoning

Plan

Criticism

Speak

Action

These steps describe a format of prompting called Reason and ACT (ReACT).

ReACT is one of Auto-GPT’s superpowers!

ReACT allows for GPT-4 to mimic self-criticism and demonstrate more complex reasoning than what is possible if we just ask the model directly.

A basic and illustrative example of ReACT. Most GPT models would get this question right with the basic prompt but it demonstrates how you could use ReACT for more complex questions.

Whenever we ask GPT-4 a question using the ReACT framework, we ask GPT-4 to output individual thoughts, actions, and observations before coming to a conclusion.

By having the model mimic extensive reasoning, it tends to give more accurate answers compared to directly answering the question.

In our example, Auto-GPT has extended the base ReACT framework and generates the following response:

The ReACT framework as used by Auto-GPT.

As you can see, it follows the ReACT pipeline that we described before but includes additional criticism and reasoning steps.

It proposes to search the web to extract more information about popular recipes.

3. Execute Action

After having generated a response, in valid JSON format. We can extract what the RecipeGPT wants to do. In this case, it calls for a web search:

The next action as proposed by GPT-4.

and in turn, will execute searching the web:

The third step in Auto-GPT’s autonomous cycle. Auto-GPT executes the previously proposed behavior.

This action it can take, searching the web, is simply a tool at its disposal that generates a file containing the main body of the page.

Since we explained to GPT-4 in its system prompt that it can use web search, it considers this a valid action.

Auto-GPT is as autonomous as the number of tools it possesses

Do note that if the only tool at its disposal is searching the web, then we can start to argue how autonomous such a model really is!

Either way, we save the output to a file for later use.

4. Embed Everything!

Every step Auto-GPT has taken thus far is vital information for any next steps to take. Especially when it needs to take dozens of steps, for example for taking over the world, remembering what it has done thus far is important.

One method of doing so is by embedding the prompts and output it has generated. This allows us to convert text into numerical representations (embeddings) that we can save later on.

The fourth step in Auto-GPT’s autonomous cycle. Embed every relevant text it has seen thus far. Input, output, observations, actions, etc.

These embeddings are generated using OpenAI’s *text-embedding-ada-002 *model which works tremendously well across many use cases.

5. Vector Database + Summarization

After having generated the embeddings, we need a place to store them. Pinecone is often used to create the vector database but many other systems can be used as long as you can easily find similar vectors.

The fifth step in Auto-GPT’s autonomous cycle. Save all embeddings in a vector database such that they can easily be accessed and searched for.

The vector database allows us to quickly find information for an input query.

We can query the vector database to find all steps it has taken thus far. Using that information, we ask GPT-4 to create a **summary **of all actions it has taken thus far:

Create a summary of everything that has happened thus far using the vector database and GPT-4.

This summary is then used to construct the prompt as we did in step 1.

That way, it can “remember” what it has done thus far and think about the next steps to be taken.

This completes the very first cycle of Auto-GPT’s autonomous behavior!

6. Do it all over again!

As you might have guessed, the cycle continues from where we started, asking GPT-4 to take action based on a history of actions.

Auto-GPT’s cycle of Autonomy.

Auto-GPT will continue until it has reached its goal or when you interrupt it.

During this cyclical process, it can keep track of estimated costs in order to make sure you do not spend too much on your Agent.

In the future, especially with the release of Llama2, I expect and hope that local models can reliably be used in Auto-GPT!

GPT Psychology

2023-07-01T00:00:00+00:00

The state of AI has changed drastically with generative text models, such as ChatGPT, GPT-4, and many others.

These GPT (Generative Pretrained Transformer) models seemingly removed the threshold for diving into Artificial intelligence for those without a technical background. Anyone can just start asking the models a bunch of stuff and get scarily accurate answers.

At least, most of the time…

When it fails to reproduce the right output, it does not mean it is incapable of doing so. Often, we simply need to change what we ask, the prompt, in a way to guide the model toward the right answer.

This is often referred to as prompt engineering.

Many of the techniques in prompt engineering try to mimic the way humans think. Asking the model to “think aloud” or “let’s think step by step” are great examples of having the model mimic how we think.

These analogies between GPT models and human psychology are important since they help us understand how we can improve the output of GPT models. It shows us capabilities they might be missing.

This does not mean that I am advocating for any GPT model as general intelligence but it is interesting to see how and why we are trying to make GPT models “think” like humans.

Many of the analogies that you will see here are also discussed in this video. Andrej Karpathy shares amazing insights into Large Language Models from a psychological perspective and is definitely worth watching!

As a data scientist and psychologist myself, this is a subject that is close to my heart. It is incredibly interesting to see how these models behave, how we would like them to behave, and how we are nudging these models to behave like us.

There are a number of subjects where analogies between GPT models and human psychology give interesting insights that will be discussed in this article:

DISCLAIMER: When talking about analogies of GPT models with human psychology, there is a risk involved, namely the anthropomorphism of Artificial Intelligence. In other words, humanizing these GPT models. This is definitely not my intention. This post is not about existential risks or general intelligence but merely a fun exercise drawing similarities between us and GPT models. If anything, feel free to take this with a grain of salt!

Prompting

A prompt is what we ask of a GPT model, for example: “Create a list of 10 book titles”.

When we try different questions in the hopes of improving the performance of the model, then we apply prompt engineering.

In psychology, there are many different forms of prompting individuals to exhibit certain behavior, which is typically used in applied behavior applications (ABA) to learn new behavior.

There is a distinct difference between how this works in GPT models versus Psychology. In Psychology, prompting is about learning new behavior. Something the individual could not do before. For a GPT model, it is about demonstrating previously unseen behavior.

The main distinction lies in that an individual learns something entirely new and, to a certain degree, changes as an individual. In contrast, the GPT model was already capable of showing that behavior but did not due to its circumstances, namely the prompts. Even when you successfully elicit “appropriate” behavior from the model, the model itself did not change.

Prompting in GPT models is also a lot less subtle. Many of the techniques in prompting are as explicit as they can be (e.g., “You are a scientist. Summarize this article.”).

Mimicking Behavior

GPT models are copycats. It, and comparable models, are trained on mountains of textual data and try to replicate that as best as they can.

This means that when you ask the model a question, it tries to generate a sequence of words that fits best with what it has seen during training. With enough training data, this sequence of words becomes more and more coherent.

However, such a model has no inherent capabilities of truly understanding the behavior it is mimicking. As with many things in this article, whether a GPT model truly is capable of reasoning is definitely open for discussion and often elicits passionate discussions.

Although we have inherent capabilities for mimicking behavior, it is much more involved and has a grounding in both social constructs and biology. We tend to, to some degree, understand mimicked behavior and can easily generalize it.

Identity

We have a preconceived notion of who we are, how our experiences have shaped us, and the views that we have of the world. We have an identity.

GPT models do not have an identity. It has a lot of knowledge about the world we live in and it knows what kind of answers we might prefer, but it has no sense of “self”.

It is not necessarily guided toward certain views like we are. From an identity perspective, it is a blank slate. This means that since a GPT model has a lot of knowledge about the world, it has some capabilities to mimic the identity you ask of it.

But as always, it is just mimicked behavior.

It does have a major advantage. We can ask the model to take on the role of a scientist, writer, editor, etc. and it will try to follow suit. By priming it towards mimicking certain identities, its output will be more tuned toward the task.

Competencies

This is an interesting subject. There are many sources for evaluating Large Language Models on a wide variety of tests, such as the Hugging Face Leaderboard or using Elo ratings to challenge Large Language Models.

These are important tests to evaluate the capabilities of these models. However, what I consider to be a strength of a certain model, you might not agree with.

This relates to the model itself. Even if we tell it the scores of these tests, it still does not know where its strengths and weaknesses comparatively lie. For example, GPT-4 passed the bar exam which we generally consider a big strength. However, the model might then not realize that only passing the bar is not a strength it was when in a room full of experienced lawyers.

In other words, it highly depends on the context of the situation when one’s capabilities are considered strengths or weaknesses. The same applies to our own capabilities. I might think myself to be proficient in Large Language Models but if you surround me with people like Andrew Ng, Sebastian Raschka, etc. my knowledge about Large Language Models is suddenly not the strength it was before.

This is important because the model does not instinctively know when something is a strength or weakness, so you should tell it.

For example, if you feel like the model is poor when solving mathematical equations, you can tell it to never perform any calculations itself but use the Wolfram Plugin instead.

In contrast, although we claim to have some notion of our own strengths and weaknesses, these are often subjective and tend to be heavily biased.

Tools

As mentioned previously, a GPT model does not know what it is good at or not in specific situations. You can help it make sense of the situation by adding an explanation of the situation to the prompt. By describing the situation, the model is primed towards generating more accurate answers.

This will not always make it capable across tasks. Like humans, explaining the situation helps but does not overcome all their weaknesses.

Instead, when we face something that we are not currently capable of we often rely on tools to overcome them. We use a calculator when doing complex equations or use a car for faster transportation.

This reliance on external tools is not something a GPT model automatically does. You will have to tell the model to use a specific external tool when you are convinced it is not capable of a certain task.

What is important here is that we rely on an enormous amount of tools on a daily basis, your phone, keys, glasses, etc. Giving a GPT model the same capabilities can be a tremendous help to its performance. These external tools are similar to the plugins that OpenAI has available.

A major disadvantage of this is that these models do not automatically use tools. It will only access plugins if you tell the model that it is a possibility.

Internal Monologue

We typically have an inner voice that we converse with when solving difficult problems. “If I do this, then that will be results but if I do that, then that might give me a better solution”.

GPT models do not exhibit this behavior automatically. When you ask it a question it simply generates a number of words that most logically would follow that question. Sure, it does compute those words but it does not leverage those words to create this internal monologue.

As it turns out, asking the model “think aloud” by saying, “Let’s think step by step” tends to improve the answers it gives quite a bit. This is called chain-of-thoughts and tries to emulate the thought processes of human reasoners. This does not necessarily mean that the model is “reasoning” but it is interesting to see how much this improves its performance.

As a nice little bonus, the model does not perform this monologue internally, so following along with what the model is thinking gives amazing insights into its behavior.

This “inner voice” is quite a bit simplified compared to how ours works. We are much more dynamic in the “conversations” we have with ourselves as well as the way we have those “conversations”. It can be symbolic, motoric, or even emotional in nature. For example, many athletes picture themselves performing the sport they excel in as a way to train for the actual thing. This is called mental imagery.

These conversations allow us to brainstorm. We use this to come up with new ideas, solve problems, and understand the context in which a problem appears. A GPT model, in contrast, will have to be told explicitly to brainstorm a solution through very specific instructions.

We can further relate this to our system 1 and system 2 thinking processes. System 1 thinking is an automatic, intuitive, and near-instantaneous process. We have very little control here. In contrast, system 2 is a conscious, slow, logical, and effortful process.

By giving a GPT model the ability of self-reflection, we are essentially trying to mimic this system 2 way of thinking. The model takes more time to generate an answer and looks over it carefully instead of quickly generating a response.

Roughly, you could say that without any prompt engineering, we enable its system 1 thinking process whilst without specific instructions and chain-of-thought-like processes, we enable its system 2 way of thinking.

If you want to know more about our system 1 and system 2 thinking, there is an amazing book called Thinking, Fast and Slow that is worth reading!

Memory

Andrej Karpathy, in his video mentioned at the beginning of the article, makes a great comparison of a human’s memory capabilities versus that of a GPT model.

Our memory is quite complex, we have long-term memory, working memory, short-term memory, sensory memory, and more.

We can, very roughly, view the memory of a GPT model as four components and compare that to our own memory systems:

Long-term memory

Working memory

Sensory memory

External memory

The long-term memory of a GPT model can be viewed as the things it has learned whilst training on billions of data. That information is, to a certain degree, represented within the model which it can perfectly reproduce whenever it wants. This long-term memory will stick with the model throughout its existence. In contrast, our long-term memory can decay over time, often referred to as the decay theory.

A GPT model’s long-term memory is perfect and does not decay over time

The working memory of a GPT model is everything that fits within the prompt you give it. The model can use all of that information perfectly to perform its calculation and give back a response. This is a great analogy with our working memory since it is a type of memory that has a limited capacity to temporarily hold information. A GPT model for instance will “forget” its prompt after it has given its response. The reason why it seems to remember the conversation is that alongside the prompt, the conversation history is added to the prompt.

A GPT model is forgetful when it comes to new information

Sensory memory relates to how we hold information derived from our senses, like visual, auditory, and haptic information. We use this information and pass it to our short-term or working memory for processing. This is similar to multi-modal GPT models, models that work on text, images, and even sound.

However, it might be more appropriate to say that GPT models have multi-modal working and long-term memory rather than sensory memory. These models tightly couple multi-modal data, with their different forms of “memory”. So as we have seen before, it rather seems to mimic sensory memory.

A GPT model mimics sensory memory with a multi-modal training procedure

Lastly, GPT models become quite a bit stronger when you give them external memory. This refers to a database of information that it can access whenever it wants, like several books about physics. In contrast, our external memory uses cues from the environment to help us remember certain ideas and sensations. In a way, it is about accessing external information versus remembering internal information.

NOTE: I did not mention short-term memory. There is much discussion between short-term and working memory and whether they are not actually the same thing. A difference often mentioned is that working memory does more than just the short-term storage of information but also has the ability to manipulate it. Also, it has a better analogy with a GPT model, so let’s cherry-pick for a bit here.

Autonomy

As we have seen throughout this article, if we want a GPT model to do something, we should tell it.

This is important to note as it relates to a sense of autonomy. By default, we have a certain degree of autonomy. If I decide to grab a drink, I can.

This is different for a GPT model as it has no autonomy by default. It cannot operate independently without giving it the necessary tools and environment to do so.

We can give a GPT model autonomy by having it create a number of tasks to execute in order to reach a certain end goal. For each task, it writes down the steps for completing it, reflects on them, and executes them if it has the tools to do so.

AutoGPT is a great example of giving a GPT model autonomy

As a result, whatever the model is capable of is very much dependent on its environment to an arguably larger degree than our environment impacts us. Which is quite impactful considering the effect our environment has on us.

This also means that although a GPT model can show impressively complex autonomous behavior, it is fixed. It cannot decide to use a tool we never told it existed. For us, we are more adaptable to new and previously unknown tools.

Hallucination

A common problem with GPT models is their ability to confidently say something that is simply not true nor supported by their training data.

For example, when you ask a GPT model to generate factual information, like the revenue of Apple in 2019, it might generate completely false information.

This is called hallucination.

The term stems from hallucination in human psychology, where we believe something that we see to be true whilst in reality, it is not. The main difference here is that human hallucination is based on perception, whilst a model “hallucinates” incorrect facts.

It might be more appropriate to compare it with false memories. The tendency of humans to recall something differently from how it actually happened. This is similar to a GPT model that tries to reproduce things that actually never happened.

Interestingly, we can more easily generate false memories with suggestibility, priming, framing, etc. This seems to more closely match how a GPT model “hallucinates” as the prompt it receives is highly influential.

Our memories can also be influenced by prompts/phrases that we receive from others. For example, by asking a person “What shade of red was this car?” we are implicitly providing a person with a supposed “fact”, namely that the car was red even when it was not. This can generate false memories and is referred to as a presupposition.

Using Whisper and BERTopic to model Kurzgesagt’s videos

2022-11-21T00:00:00+00:00

A little over a month ago, OpenAI released a neural net for English speech recognition called Whisper. It has gained quite some popularity over the last few weeks due to its accuracy, ease of use, and most importantly because they open-sourced it!

With these kinds of releases, I can hardly wait to get my hands on such a model and play around with it. However, I like to have a fun or interesting use case to actually use it for.

So I figured, why not use it for creating transcripts of a channel I always enjoy watching, Kurzgesagt!

It is an amazing channel with incredibly well-explained videos focused on animated educational content, ranging from topics about Climate Change and Dinosaurs to Black Holes and Geoengineering.

I decided to do a little more than just create some transcripts. Instead, let us use BERTopic to see if we can extract the main topics found in Kurzgesagt’s videos.

Hence, this article is a tutorial about using Whisper and BERTopic to extract transcripts from Youtube videos and use topic modeling on top of them.

1. Installation

Before going into the actual code, we first need to install a few packages, namely Whisper, BERTopic, and Pytube.

pip install --upgrade git+https://github.com/openai/whisper.git pip install git+https://github.com/pytube/pytube.git@refs/pull/1409/merge pip install bertopic

We are purposefully choosing a specific pull request in Pytube since it fixes an issue with empty channels.

At the very last step, I am briefly introducing an upcoming feature of BERTopic, which you can already install with:

pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/840/merge

2. Pytube

We need to start off by extracting every metadata that we need from Kurzgesagt’s YouTube channel. Using Pytube, we can create a Channel object that allows us to extract the URLs and titles of their videos.

# Extract all video_urls from pytube import YouTube, Channel c = Channel('https://www.youtube.com/c/inanutshell/videos/') video_urls = c.video_urls video_titles = [video.title for video in c.videos]

We are also extracting the titles as they might come in handy when we are visualizing the topics later on.

3. Whisper

When we have our URLs, we can start downloading the videos and extracting the transcripts. To create those transcripts, we make use of the recently released Whisper.

The model can be quite daunting for new users but it is essentially a sequence-to-sequence Transformer model which has been trained on several different speech-processing tasks. These tasks are fed into the encoder-decoder structure of the Transformer model which allows Whisper to replace several stages of the traditional speech-processing pipeline.

In other words, because it focuses on jointly representing multiple tasks, it can learn a variety of different processing steps all in a single model!

This is great because we can now use a single model to do all of the processing necessary. Below, we will import our Whisper model:

# Just two lines of code to load in a Whisper model! import whisper whisper_model = whisper.load_model("tiny")

Then, we iterate over our YouTube URLs, download the audio, and finally pass them through our Whisper model in order to generate the transcriptions:

# Infer all texts texts = [] for url in video_urls[:100]: path = YouTube(url).streams.filter(only_audio=True)[0].download(filename="audio.mp4") transcription = whisper_model.transcribe(path) texts.append(transcription["text"])

And that is it! We now have transcriptions from 100 videos of Kurzgesagt.

NOTE: I opted for the tiny model due to its speed and accuracy but there are more accurate models that you can use in Whisper that are worth checking out.

4. Transcript processing

BERTopic approaches topic modeling as a clustering task and as a result, assigns a single document to a single topic. To circumvent this, we can split our transcripts into sentences and run BERTopic on those:

from nltk.tokenize import sent_tokenize # Sentencize the transcripts and track their titles docs = [] titles = [] for text, title in zip(texts, video_titles): sentences = sent_tokenize(text) docs.extend(sentences) titles.extend([title] * len(sentences))

Not only do we then have more data to train on, but we can also create more accurate create topic representations.

NOTE: There might or might not be a feature for topic distributions coming up in BERTopic…

5. BERTopic

BERTopic is a topic modeling technique that focuses on modularity, transparency, and human evaluation. It is a framework that allows users to, within certain boundaries, build their own custom topic model.

BERTopic works by following a linear pipeline of clustering and topic extraction:

UMAP -> HDBSCAN -> CountVectorizer -> c-TF-IDF -> (Optional) MMR" />
SBERT -> UMAP -> HDBSCAN -> CountVectorizer -> c-TF-IDF -> (Optional) MMR

At each step of the pipeline, it makes few assumptions about all steps that came before that. For example, the c-TF-IDF representation does not care which input embeddings are used. This guiding philosophy of BERTopic allows for the sub-components to easily be swapped out. As a result, you can build your model however you like:

Build your own topic model by defining the embedding model, dimensionality reduction algorithm, cluster model, and finally the tokenizer along with the c-TF-IDF extraction.

Although we can use BERTopic in just a few lines, it is worthwhile to generate our embeddings such that we can use them multiple times later on with the need to regenerate them:

from sentence_transformers import SentenceTransformer # Create embeddings from the documents sentence_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2") embeddings = sentence_model.encode(docs)

Although the content of Kurzgesagt is in English, there might be some non-English terms out there, so I opted for a multilingual sentence-transformer model.

After having generated our embeddings, I wanted to tweak the sub-models slightly in order to best fit with our data:

from bertopic import BERTopic from umap import UMAP from hdbscan import HDBSCAN from sklearn.feature_extraction.text import CountVectorizer # Define sub-models vectorizer = CountVectorizer(stop_words="english") umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42) hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=2, metric='euclidean', cluster_selection_method='eom') # Train our topic model with BERTopic topic_model = BERTopic( embedding_model=sentence_model, umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer ).fit(docs, embeddings)

Now that we have fitted our BERTopic model, let us take a look at some of its topics. To do so, we run topic_model.get_topic_info().head(10) to get a dataframe of the most frequent topics:

We can see topics about food, cells, the galaxy, and many more!

6. Visualize Topics

Although the model found some interesting topics it seems like a lot of work to go through them all by hand. Instead, we can use a number of visualization techniques that makes it a bit easier.

First, it might be worthwhile to generate some nicer-looking labels. To do so, we are going to generate our topic labels with generate_topic_labels.

We want the top 3 words, with a , separator and we are not so much interested in a topic prefix.

# Generate nicer looking labels and set them in our model topic_labels = topic_model.generate_topic_labels(nr_words=3, topic_prefix=False, word_length=15, separator=", ") topic_model.set_topic_labels(topic_labels)

Now, we are ready to perform some interesting visualizations. First off, .visualize_documents ! This method aims to visualize the documents and their corresponding documents interactively in a 2D space:

# Manually selected some interesting topics to prevent information overload topics_of_interest = [33, 1, 8, 9, 0, 30, 27, 19, 16, 28, 44, 11, 21, 23, 26, 2, 37, 34, 3, 4, 5, 15, 17, 22, 38] # I added the title to the documents themselves for easier interactivity adjusted_docs = ["" + title + " " + doc[:100] + "..." for doc, title in zip(docs, titles)] # Visualize documents topic_model.visualize_documents( adjusted_docs, embeddings=embeddings, hide_annotations=False, topics=topics_of_interest, custom_labels=True )

A selection of topics that we can find in Kurzgesagt’s videos.

As can be seen in the visualization above, we have a number of very different topics, ranging from dinosaurs and climate change to bacteria and even ants!

7. Topics per video

Since we have split each video up into sentences, we can model the distribution of topics per video. I saw recently saw a video called

”What Happens if a Supervolcano Blows Up?”

So let’s see which topics can be found in that video:

# Topic frequency in ""What Happens if a Supervolcano Blows Up?"" video_topics = [topic_model.custom_labels_[topic+1] for topic, title in zip(topic_model.topics_, titles) if title == "What Happens if a Supervolcano Blows Up?" and topic != -1] counts = pd.DataFrame({"Topic": video_topics}).value_counts(); countstopics_per_class = topic_model.topics_per_class(docs, classes=classes)

A selection of topics that we can find in Kurzgesagt’s videos.

As expected, it seems to be mostly related to a topic about volcanic eruptions but also explosions in general.

8. Topic Distribution

In the upcoming BERTopic v0.13 release, there is the possibility to approximate the topic distributions for any document regardless of its size.

The method works by creating a sliding window over the document and calculating the windows similarity to each topic:

Procedure for generating multi-topic assignments on a token level.

We can generate these distributions for all of our documents by running the following and making sure that we calculate the distributions on a token level:

# We need to calculate the topic distributions on a token level (topic_distr, topic_token_distr) = topic_model.approximate_distribution( docs, calculate_tokens=True )

Now we need to choose a piece of text over which to model the topics. For that, I thought it would be interesting to explore how the model handles the advertisement of Brilliant at the end of Kurzgesagt’s:

And with free trial of brilliant premium you can explore everything brilliant has to offer.

We input that document and run our visualization:

# Create a visualization using a styled dataframe if Jinja2 is installed df = topic_model.visualize_approximate_distribution(docs[100], topic_token_distr[100]); df

As we can see, it seems to pick up topics about Brilliant and memberships, which seems to make sense in this case.

Interestingly, with this approach, we can take into account that there are not only multiple topics per document but even multiple topics per token!

	Topic	Count	Representation	KeyBERT	Llama2	MMR
1	0	10339	[‘policy’, ‘reinforcement’, ‘rl’, ‘agent’, ‘learning’, ‘control’, ‘agents’, ‘to’, ‘reward’, ‘in’]	[‘learning’, ‘robots’, ‘reinforcement’, ‘dynamics’, ‘model’, ‘robotic’, ‘learned’, ‘robot’, ‘algorithms’, ‘exploration’]	[‘Reinforcement Learning Agent Control’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’]	[‘policy’, ‘reinforcement’, ‘rl’, ‘agent’, ‘learning’, ‘control’, ‘agents’, ‘to’, ‘reward’, ‘in’]
2	1	3592	[‘privacy’, ‘federated’, ‘fl’, ‘private’, ‘clients’, ‘data’, ‘learning’, ‘communication’, ‘local’, ‘client’]	[‘federated’, ‘decentralized’, ‘heterogeneity’, ‘distributed’, ‘algorithms’, ‘datasets’, ‘models’, ‘convergence’, ‘model’, ‘gradient’]	[‘Privacy-Preserving Machine Learning: Federated Learning’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’]	[‘privacy’, ‘federated’, ‘fl’, ‘private’, ‘clients’, ‘data’, ‘learning’, ‘communication’, ‘local’, ‘client’]
3	2	3532	[‘speech’, ‘audio’, ‘speaker’, ‘music’, ‘asr’, ‘acoustic’, ‘recognition’, ‘voice’, ‘the’, ‘model’]	[‘encoder’, ‘speech’, ‘voice’, ‘trained’, ‘language’, ‘models’, ‘neural’, ‘model’, ‘supervised’, ‘learning’]	[‘Speech Recognition and Audio-Visual Processing’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’]	[‘speech’, ‘audio’, ‘speaker’, ‘music’, ‘asr’, ‘acoustic’, ‘recognition’, ‘voice’, ‘the’, ‘model’]

	Topic	Count	Representation	KeyBERT	Llama2	MMR
118	117	160	[‘design’, ‘circuit’, ‘circuits’, ‘synthesis’, ‘chip’, ‘designs’, ‘power’, ‘hardware’, ‘placement’, ‘hls’]	[‘circuits’, ‘circuit’, ‘analog’, ‘optimization’, ‘model’, ‘chip’, ‘technology’, ‘simulation’, ‘learning’, ‘neural’]	[‘Design Automation for Analog Circuits using Reinforcement Learning’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’]	[‘design’, ‘circuit’, ‘circuits’, ‘synthesis’, ‘chip’, ‘designs’, ‘power’, ‘hardware’, ‘placement’, ‘hls’]
119	118	159	[‘sentiment’, ‘aspect’, ‘analysis’, ‘polarity’, ‘reviews’, ‘opinion’, ‘text’, ‘task’, ‘twitter’, ‘language’]	[‘embeddings’, ‘sentiment’, ‘sentiments’, ‘supervised’, ‘annotated’, ‘corpus’, ‘aspect’, ‘multilingual’, ‘datasets’, ‘model’]	[‘Multilingual Aspect-Based Sentiment Analysis’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’]	[‘sentiment’, ‘aspect’, ‘analysis’, ‘polarity’, ‘reviews’, ‘opinion’, ‘text’, ‘task’, ‘twitter’, ‘language’]
120	119	159	[‘crowdsourcing’, ‘workers’, ‘crowd’, ‘worker’, ‘crowdsourced’, ‘labels’, ‘annotators’, ‘annotations’, ‘label’, ‘labeling’]	[‘crowdsourcing’, ‘crowdsourced’, ‘annotators’, ‘crowds’, ‘annotation’, ‘algorithms’, ‘aggregation’, ‘crowd’, ‘datasets’, ‘annotator’]	[‘Crowdsourced Data Labeling’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’]	[‘crowdsourcing’, ‘workers’, ‘crowd’, ‘worker’, ‘crowdsourced’, ‘labels’, ‘annotators’, ‘annotations’, ‘label’, ‘labeling’]

Maarten Grootendorst

A Visual Guide to Quantization

Part 1: The “Problem“ with LLMs

How to Represent Numerical Values

Memory Constraints

Part 2: Introduction to Quantization

Common Data Types

FP16

BF16

INT8

Symmetric Quantization

Asymmetric Quantization

Range Mapping and Clipping

Calibration

Weights (and Biases)

Activations

Part 3: Post-Training Quantization

Dynamic Quantization

Static Quantization

The Realm of 4-bit Quantization

GPTQ

GGUF

Part 4: Quantization Aware Training

The Era of 1-bit LLMs: BitNet

Weight Quantization

Activation Quantization

Dequantization

All Large Language Models are in 1.58 Bits

The Power of 0

Quantization

Thank you for reading!

Additional Resources

A Visual Guide to Mamba and State Space Models

Part 1: The Problem with Transformers

The Core Components of Transformers

A Blessing with Training…

And the Curse with Inference!

Are RNNs a Solution?

Part 2: The State Space Model (SSM)

What is a State Space?

What is a State Space Model?

From a Continuous to a Discrete Signal

The Recurrent Representation

The Convolution Representation

The Three Representations

The Importance of Matrix A

Part 3: Mamba — A Selective SSM

What Problem does it attempt to Solve?

Selectively Retain Information

The Scan Operation

Hardware-aware Algorithm

The Mamba Block

Additional Resources

Thank you for reading!

BERTopic: What Is So Special About v0.16?

Zero-Shot Topic Modeling: A Flexible Technique

Model Merging: Federated and Incremental Learning

Incremental and Batched Learning

More Large Language Model Support

Thank you for reading!

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

1. HuggingFace

2. Sharding

3. Quantize with Bitsandbytes

4. Pre-Quantization (GPTQ vs. AWQ vs. GGUF)

GPTQ: Post-Training Quantization for GPT Models

GGUF: GPT-Generated Unified Format

AWQ: Activation-aware Weight Quantization

Thank you for reading!

Introducing KeyLLM - Keyword Extraction with LLMs

Loading the Model

📄 Prompt Engineering

🗝️ Keyword Extraction with KeyLLM

🚀 Efficient Keyword Extraction with KeyLLM

🏆 Efficient Keyword Extraction with KeyBERT & KeyLLM

Thank you for reading!

3 Ways To Improve Your Large Language Model

Load Llama 2 🦙

1. Prompt Engineering ⚙️

Example-based Prompt Engineering

🗝️ Keyword Extraction with `KeyLLM`

🚀 Efficient Keyword Extraction with `KeyLLM`

🏆 Efficient Keyword Extraction with `KeyBERT` & `KeyLLM`