To further improve LLMs, new architectures are developed that might even outperform the Transformer architecture. One of these methods is Mamba, a State Space Model.
Mamba was proposed in the paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces. You can find its official implementation and model checkpoints in its repository.
In this post, I will introduce the field of State Space Models in the context of language modeling and explore concepts one by one to develop an intuition about the field. Then, we will cover how Mamba might challenge the Transformers architecture.
As a visual guide, expect many visualizations to develop an intuition about Mamba and State Space Models!
To illustrate why Mamba is such an interesting architecture, let’s do a short re-cap of transformers first and explore one of its disadvantages.
A Transformer sees any textual input as a sequence that consists of tokens.
A major benefit of Transformers is that whatever input it receives, it can look back at any of the earlier tokens in the sequence to derive its representation.
Remember that a Transformer consists of two structures, a set of encoder blocks for representing text and a set of decoder blocks for generating text. Together, these structures can be used for several tasks, including translation.
We can adopt this structure to create generative models by using only decoders. This Transformer-based model, Generative Pre-trained Transformers (GPT), uses decoder blocks to complete some input text.
Let’s take a look at how that works!
A single decoder block consists of two main components, masked self-attention followed by a feed-forward neural network.
Self-attention is a major reason why these models work so well. It enables an uncompressed view of the entire sequence with fast training.
So how does it work?
It creates a matrix comparing each token with every token that came before. The weights in the matrix are determined by how relevant the token pairs are to one another.
During training, this matrix is created in one go. The attention between “My” and “name” does not need to be calculated first before we calculate the attention between “name” and “is”.
It enables parallelization, which speeds up training tremendously!
There is a flaw, however. When generating the next token, we need to re-calculate the attention for the entire sequence, even if we already generated some tokens.
Generating tokens for a sequence of length *L *needs roughly *L² *computations which can be costly if the sequence length increases.
This need to recalculate the entire sequence is a major bottleneck of the Transformer architecture.
Let’s look at how a “classic” technique, Recurrent Neural Networks, solves this problem of slow inference.
Recurrent Neural Networks (RNN) is a sequence-based network. It takes two inputs at each time step in a sequence, namely the input at time step t and a hidden state of the previous time step t-1, to generate the next hidden state and predict the output.
RNNs have a looping mechanism that allows them to pass information from a previous step to the next. We can “unfold” this visualization to make it more explicit.
When generating the output, the RNN only needs to consider the previous hidden state and current input. It prevents recalculating all previous hidden states which is what a Transformer would do.
In other words, RNNs can do inference fast as it scales linearly with the sequence length! In theory, it can even have an infinite context length.
To illustrate, let’s apply the RNN to the input text we have used before.
Each hidden state is the aggregation of all previous hidden states and is typically a compressed view.
There is a problem, however…
Notice that the last hidden state, when producing the name “Maarten” does not contain information about the word “Hello” anymore. RNNs tend to forget information over time since they only consider one previous state.
This sequential nature of RNNs creates another problem. Training cannot be done in parallel since it needs to go through each step at a time sequentially.
The problem with RNNs, compared to Transformers, is completely the opposite! Its inference is incredibly fast but it is not parallelizable.
Can we somehow find an architecture that does parallelize training like Transformers whilst still performing inference that scales linearly with sequence length?
Yes! This is what Mamba offers but before diving into its architecture, let’s explore the world of State Space Models first.
A State Space Model (SSM), like the Transformer and RNN, processes sequences of information, like text but also signals. In this section, we will go through the basics of SSMs and how they relate to textual data.
A State Space contains the minimum number of variables that fully describe a system. It is a way to mathematically represent a problem by defining a system’s possible states.
Let’s simplify this a bit. Imagine we are navigating through a maze. The “state space” is the map of all possible locations (states). Each point represents a unique position in the maze with specific details, like how far you are from the exit.
The “state space representation” is a simplified description of this map. It shows where you are (current state), where you can go next (possible future states), and what changes take you to the next state (going right or left).
Although State Space Models use equations and matrices to track this behavior, it is simply a way to track where you are, where you can go, and how you can get there.
The variables that describe a state, in our example the X and Y coordinates, as well as the distance to the exit, can be represented as “state vectors”.
Sounds familiar? That is because embeddings or vectors in language models are also frequently used to describe the “state” of an input sequence. For instance, a vector of your current position (state vector) could look a bit like this:
In terms of neural networks, the “state” of a system is typically its hidden state and in the context of Large Language Models, one of the most important aspects of generating a new token.
SSMs are models used to describe these state representations and make predictions of what their next state could be depending on some input.
Traditionally, at time t, SSMs:
map an input sequence x(t) — (e.g., moved left and down in the maze)
to a latent state representation h(t) — (e.g., distance to exit and x/y coordinates)
and derive a predicted output sequence y(t) — (e.g., move left again to reach the exit sooner)
However, instead of using discrete sequences (like moving left once) it takes as input a continuous sequence and predicts the output sequence.
SSMs assume that dynamic systems, such as an object moving in 3D space, can be predicted from its state at time t through two equations.
By solving these equations, we assume that we can uncover the statistical principles to predict the state of a system based on observed data (input sequence and previous state).
Its goal is to find this state representation h(t) such that we can go from an input to an output sequence.
These two equations are the core of the State Space Model.
The two equations will be referenced throughout this guide. To make them a bit more intuitive, they are color-coded so you can quickly reference them.
The state equation describes how the state changes (through matrix A) based on how the input influences the state (through matrix B).
As we saw before, h(t) refers to our latent state representation at any given time t, and x(t) refers to some input.
The output equation describes how the state is translated to the output (through matrix C) and how the input influences the output (through matrix D).
NOTE: Matrices *A, B, C, and D *are also commonly refered to as *parameters *since they are learnable.
Visualizing these two equations gives us the following architecture:
Let’s go through the general technique step-by-step to understand how these matrices influence the learning process.
Assume we have some input signal x(t), this signal first gets multiplied by matrix B which describes how the inputs influence the system.
The updated state (akin to the hidden state of a neural network) is a latent space that contains the core “knowledge” of the environment. We multiply the state with matrix A which describes how all the internal states are connected as they represent the underlying dynamics of the system.
As you might have noticed, matrix A is applied before creating the state representations and is updated after the state representation has been updated.
Then, we use matrix C to describe how the state can be translated to an output.
Finally, we can make use of matrix D to provide a direct signal from the input to the output. This is also often referred to as a skip-connection.
Since matrix D is similar to a skip-connection, the SSM is often regarded as the following without the skip-connection.
Going back to our simplified perspective, we can now focus on matrices A, B, and C as the core of the SSM.
We can update the original equations (and add some pretty colors) to signify the purpose of each matrix as we did before.
Together, these two equations aim to predict the state of a system from observed data. Since the input is expected to be continuous, the main representation of the SSM is a continuous-time representation.
Finding the state representation h(t) is analytically challenging if you have a continuous signal. Moreover, since we generally have a discrete input (like a textual sequence), we want to discretize the model.
To do so, we make use of the Zero-order hold technique. It works as follows. First, every time we receive a discrete signal, we hold its value until we receive a new discrete signal. This process creates a continuous signal the SSM can use:
How long we hold the value is represented by a new learnable parameter, called the step size ∆. It represents the resolution of the input.
Now that we have a continuous signal for our input, we can generate a continuous output and only sample the values according to the time steps of the input.
These sampled values are our discretized output!
Mathematically, we can apply the Zero-order hold as follows:
Together, they allow us to go from a continuous SSM to a discrete SSM represented by a formulation that instead of a function-to-function, x(t) → y(t), is now a *sequence-to-sequence, xₖ** → **y*ₖ:
Here, matrices A and B now represent discretized parameters of the model.
We use k instead of t to represent discretized timesteps and to make it a bit more clear when we refer to a continuous versus a discrete SSM.
NOTE: We are still saving the continuous form of *Matrix A *and not the discretized version during training. During training, the continuous representation is discretized.
Now that we have a formulation of a discrete representation, let’s explore how we can actually compute the model.
Our discretized SSM allows us to formulate the problem in specific timesteps instead of continuous signals. A recurrent approach, as we saw before with RNNs is quite useful here.
If we consider discrete timesteps instead of a continuous signal, we can reformulate the problem with timesteps:
At each timestep, we calculate how the current input (Bxₖ) influences the previous state (Ahₖ₋₁) and then calculate the predicted output (Chₖ).
This representation might already seem a bit familiar! We can approach it the same way we did with the RNN as we saw before.
Which we can unfold (or unroll) as such:
Notice how we can use this discretized version using the underlying methodology of an RNN.
This technique gives us both the advantages and disadvantages of an RNN, namely fast inference and slow training.
Another representation that we can use for SSMs is that of convolutions. Remember from classic image recognition tasks where we applied filters (kernels) to derive aggregate features:
Since we are dealing with text and not images, we need a 1-dimensional perspective instead:
The kernel that we use to represent this “filter” is derived from the SSM formulation:
Let’s explore how this kernel works in practice. Like convolution, we can use our SSM kernel to go over each set of tokens and calculate the output:
This also illustrates the effect padding might have on the output. I changed the order of padding to improve the visualization but we often apply it at the end of a sentence.
In the next step, the kernel is moved once over to perform the next step in the calculation:
In the final step, we can see the full effect of the kernel:
A major benefit of representing the SSM as a convolution is that it can be trained in parallel like Convolutional Neural Networks (CNNs). However, due to the fixed kernel size, their inference is not as fast and unbounded as RNNs.
These three representations, continuous, recurrent, and convolutional all have different sets of advantages and disadvantages:
Interestingly, we now have efficient inference with the recurrent SSM and parallelizable training with the convolutional SSM.
With these representations, there is a neat trick that we can use, namely choose a representation depending on the task. During training, we use the convolutional representation which can be parallelized and during inference, we use the efficient recurrent representation:
This model is referred to as the Linear State-Space Layer (LSSL).
These representations share an important property, namely that of Linear Time Invariance (LTI). LTI states that the SSMs parameters, A, B, and C, are fixed for all timesteps. This means that matrices A, B, and C are the same for every token the SSM generates.
In other words, regardless of what sequence you give the SSM, the values of A, B, and C remain the same. We have a static representation that is not content-aware.
Before we explore how Mamba addresses this issue, let’s explore the final piece of the puzzle, matrix A.
Arguably one of the most important aspects of the SSM formulation is matrix A. As we saw before with the recurrent representation, it captures information about the previous state to build the new state.
In essence, matrix A produces the hidden state:
Creating matrix A can therefore be the difference between remembering only a few previous tokens and capturing every token we have seen thus far. Especially in the context of the Recurrent representation since it only looks back at the previous state.
So how can we create matrix A in a way that retains a large memory (context size)?
We use Hungry Hungry Hippo! Or HiPPO for High-order Polynomial Projection Operators. HiPPO attempts to compress all input signals it has seen thus far into a vector of coefficients.
It uses matrix A to build a state representation that captures recent tokens well and decays older tokens. Its formula can be represented as follows:
Assuming we have a square matrix A, this gives us:
Building matrix A using HiPPO was shown to be much better than initializing it as a random matrix. As a result, it more accurately reconstructs newer signals (recent tokens) compared to older signals (initial tokens).
The idea behind the HiPPO Matrix is that it produces a hidden state that memorizes its history.
Mathematically, it does so by tracking the coefficients of a Legendre polynomial which allows it to approximate all of the previous history.
HiPPO was then applied to the recurrent and convolution representations that we saw before to handle long-range dependencies. The result was Structured State Space for Sequences (S4), a class of SSMs that can efficiently handle long sequences.
It consists of three parts:
This class of SSMs has several benefits depending on the representation you choose (recurrent vs. convolution). It can also handle long sequences of text and store memory efficiently by building upon the HiPPO matrix.
NOTE: If you want to dive into more of the technical details on how to calculate the HiPPO matrix and build a S4 model yourself, I would HIGHLY advise going through the Annotated S4.
We finally have covered all the fundamentals necessary to understand what makes Mamba special. State Space Models can be used to model textual sequences but still have a set of disadvantages we want to prevent.
In this section, we will go through Mamba’s two main contributions:
Together they create the selective SSM or S6 models which can be used, like self-attention, to create Mamba blocks.
Before exploring the two main contributions, let’s first explore why they are necessary.
State Space Models, and even the S4 (Structured State Space Model), perform poorly on certain tasks that are vital in language modeling and generation, namely the ability to focus on or ignore particular inputs.
We can illustrate this with two synthetic tasks, namely selective copying and induction heads.
In the selective copying task, the goal of the SSM is to copy parts of the input and output them in order:
However, a (recurrent/convolutional) SSM performs poorly in this task since it is Linear Time Invariant. As we saw before, the matrices A, B, and C are the same for every token the SSM generates.
As a result, an SSM cannot perform content-aware reasoning since it treats each token equally as a result of the fixed A, B, and C matrices. This is a problem as we want the SSM to reason about the input (prompt).
The second task an SSM performs poorly on is induction heads where the goal is to reproduce patterns found in the input:
In the above example, we are essentially performing one-shot prompting where we attempt to “teach” the model to provide an “A:” response after every “Q:”. However, since SSMs are time-invariant it cannot select which previous tokens to recall from its history.
Let’s illustrate this by focusing on matrix B. Regardless of what the input x is, matrix B remains exactly the same and is therefore independent of x:
Likewise, A and C also remain fixed regardless of the input. This demonstrates the static nature of the SSMs we have seen thus far.
In comparison, these tasks are relatively easy for Transformers since they dynamically change their attention based on the input sequence. They can selectively “look” or “attend” at different parts of the sequence.
The poor performance of SSMs on these tasks illustrates the underlying problem with time-invariant SSMs, the static nature of matrices A, B, and C results in problems with content-awareness.
The recurrent representation of an SSM creates a small state that is quite efficient as it compresses the entire history. However, compared to a Transformer model which does no compression of the history (through the attention matrix), it is much less powerful.
Mamba aims to have the best of both worlds. A small state that is as powerful as the state of a Transformer:
As teased above, it does so by compressing data selectively into the state. When you have an input sentence, there is often information, like stop words, that does not have much meaning.
To selectively compress information, we need the parameters to be dependent on the input. To do so, let’s first explore the dimensions of the input and output in an SSM during training:
In a Structured State Space Model (S4), the matrices A, B, and C are independent of the input since their dimensions N and D are static and do not change.
Instead, Mamba makes matrices B and C, *and even the *step size ∆*, *dependent on the input by incorporating the sequence length and batch size of the input:
This means that for every input token, we now have different B and C matrices which solves the problem with content-awareness!
NOTE: Matrix A *remains the same since we want the state itself to remain static but the way it is influenced (through *B *and *C) to be dynamic
Together, they selectively choose what to keep in the hidden state and what to ignore since they are now dependent on the input.
A smaller step size ∆ results in ignoring specific words and instead using the previous context more whilst a larger step size ∆ focuses on the input words more than the context:
Since these matrices are now dynamic, they cannot be calculated using the convolution representation since it assumes a fixed kernel. We can only use the recurrent representation and lose the parallelization the convolution provides.
To enable parallelization, let’s explore how we compute the output with recurrency:
Each state is the sum of the previous state (multiplied by A) plus the current input (multiplied by B). This is called a scan operation and can easily be calculated with a for loop.
Parallelization, in contrast, seems impossible since each state can only be calculated if we have the previous state. Mamba, however, makes this possible through the [parallel scan](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda) algorithm.
It assumes the order in which we do operations does not matter through the associate property. As a result, we can calculate the sequences in parts and iteratively combine them:
Together, dynamic matrices B and C, and the parallel scan algorithm create the selective scan algorithm to represent the dynamic and fast nature of using the recurrent representation.
A disadvantage of recent GPUs is their limited transfer (IO) speed between their small but highly efficient SRAM and their large but slightly less efficient DRAM. Frequently copying information between SRAM and DRAM becomes a bottleneck.
Mamba, like Flash Attention, attempts to limit the number of times we need to go from DRAM to SRAM and vice versa. It does so through kernel fusion which allows the model to prevent writing intermediate results and continuously performing computations until it is done.
We can view the specific instances of DRAM and SRAM allocation by visualizing Mamba’s base architecture:
Here, the following are fused into one kernel:
The last piece of the hardware-aware algorithm is recomputation.
The intermediate states are not saved but are necessary for the backward pass to compute the gradients. Instead, the authors recompute those intermediate states during the backward pass.
Although this might seem inefficient, it is much less costly than reading all those intermediate states from the relatively slow DRAM.
We have now covered all components of its architecture which is depicted using the following image from its article:
This architecture is often referred to as a selective SSM or S6 model since it is essentially an S4 model computed with the selective scan algorithm.
The selective SSM that we have explored thus far can be implemented as a block, the same way we can represent self-attention in a decoder block.
Like the decoder, we can stack multiple Mamba blocks and use their output as the input for the next Mamba block:
It starts with a linear projection to expand upon the input embeddings. Then, a convolution before the Selective SSM is applied to prevent independent token calculations.
The Selective SSM has the following properties:
We can expand on this architecture a bit more when looking at the code implementation and explore how an end-to-end example would look like:
Notice some changes, like the inclusion of normalization layers and softmax for choosing the output token.
When we put everything together, we get both fast inference and training and even unbounded context!
Using this architecture, the authors found it matches and sometimes even exceeds the performance of Transformer models of the same size!
Hopefully, this was an accessible introduction to Mamba and State Space Models. If you want to go deeper, I would suggest the following resources:
If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:
All images without a source credit were created by the author — Which means all of them, I like creating my own images ;)
]]>My ambition for BERTopic is to make it the one-stop shop for topic modeling by allowing for significant flexibility and modularity.
That has been the goal for the last few years and with the release of v0.16, I believe we are a BIG step closer to achieving that.
First, let’s take a small step back. What is BERTopic?
Well, BERTopic is a topic modeling framework that allows users to essentially create their version of a topic model. With many variations of topic modeling implemented, the idea is that it should support almost any use case.
With v0.16, several features were implemented that I believe will take BERTopic to the next level, namely:
Zero-Shot Topic Modeling
Model Merging
More Large Language Model (LLM) Support
In this tutorial, we will go through what these features are and for which use cases they could be helpful.
To start with, you can install BERTopic (with HF datasets) as follows:
pip install bertopic datasets
You can also follow along with the Google Colab Notebook to make sure everything works as intended.
UPDATE: I uploaded a video version to YouTube that goes more in-depth into how to use these new features:
Zero-shot techniques generally refer to having no examples to train your data on. Although you know the target, it is not assigned to your data.
In BERTopic, we use Zero-shot Topic Modeling to find pre-defined topics in large amounts of documents.
Imagine you have ArXiv abstracts about Machine Learning and you know that the topic “Large Language Models” is in there. With Zero-shot Topic Modeling, you can ask BERTopic to find all documents related to “Large Language Models”.
In essence, it is nothing more than semantic search! But… there is a neat trick ;-)
When you try to find those documents related to “Large Language Models”, there will be many left not about those topics. So, what do you do with those topics? You use BERTopic to find all topics that were left!
As a result, you will have three scenarios of Zero-shot Topic Modeling:
No zero-shot topics were detected. This means that none of the documents would fit with the predefined topics and a regular BERTopic would be run.
Only zero-shot topics were detected. Here, we would not need to find additional topics since all original documents were assigned to one of the predefined topics.
Both zero-shot topics and clustered topics were detected. This means that some documents would fit with the predefined topics whereas others would not. For the latter, new topics were found.
Using Zero-shot BERTopic is straightforward:
from datasets import load_dataset
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
# We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]
# We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]
# We fit our model using the zero-shot topics
# and we define a minimum similarity. For each document,
# if the similarity does not exceed that value, it will be used
# for clustering instead.
topic_model = BERTopic(
embedding_model="thenlper/gte-small",
min_topic_size=15,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.85,
representation_model=KeyBERTInspired()
)
topics, probs = topic_model.fit_transform(docs)
We can view the three pre-defined topics along with several newly discovered topics:
topic_model.get_topic_info()
Note that although we have pre-defined names for the topics, we allow BERTopic for additional representations.
This gives exciting new insight into pre-defined topics!
So… when do you use Zero-shot Topic Modeling?
If you already know some of the topics in your data, this is a great solution for finding them! Since it can discover both pre-defined and new topics, is an incredibly flexible technique.
This is a fun new feature, model merging!
Model merging refers to BERTopic’s capability to combine multiple pre-trained BERTopic models to create one large topic model. It explores which topics should be merged and which should remain separate.
It works as follows. When we pass a list of models to this new feature, .merge_models, the first model in the list is chosen as the baseline. This baseline is used to check whether all other models contain new topics based on the similarity between their topic embeddings.
Dissimilar topics are added to the baseline model whereas similar topics are assigned to the topic of the baseline. This means that we need the embedding models to be the same.
Merging pre-trained BERTopic models is straightforward and only requires a few lines of code:
from bertopic import BERTopic
# Merge 3 pre-trained BERTopic models
merged_model = BERTopic.merge_models(
[topic_model_1, topic_model_2, topic_model_3]
)
And that is it! With a single function, .merge_models, you can merge pre-trained BERTopic models.
The benefit of merging pre-trained models is that it allows for a variety of creative and useful use cases. For instance, we could use it for:
Incremental Learning — We can continuously discover new topics by iteratively merging models. This can be used for issue tickets to quickly uncover pressing bugs/issues.
Batched Learning — Compute and memory problems can arise with large datasets or when you simply do not have the hardware for it. By splitting the training process up into smaller models, we can get similar performance whilst reducing the necessary compute.
Federated Learning — Merging models allow for the training to be distributed among different clients who do not wish to share their data. This increases privacy and security with respect to their data especially if a non-keyword-based method is used for generating the representations, such as using a Large Language Model.
Federated Learning is rather straightforward, simply run .merge_models on your central server.
The other two, incremental and batched learning, might require a bit of an example!
To perform both *incremental *and *batched *learning, we are going to mimic a typical .partial_fit pipeline. Here, we will train a base model first and then iteratively add a small newly trained model.
In each iteration, we can check any topics that were added to the base model:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from datasets import load_dataset
# Prepare documents
all_docs = load_dataset("CShorten/ML-ArXiv-Papers")["train"]["abstract"][:20_000]
doc_chunks = [all_docs[i:i+5000] for i in range(0, len(all_docs), 5000)]
# Base Model
representation_model = KeyBERTInspired()
base_model = BERTopic(representation_model=representation_model, min_topic_size=15).fit(doc_chunks[0])
# Iteratively add small and newly trained models
for docs in doc_chunks[1:]:
new_model = BERTopic(representation_model=representation_model, min_topic_size=15).fit(docs)
updated_model = BERTopic.merge_models([base_model, new_model])
# Let's print the newly discover topics
nr_new_topics = len(set(updated_model.topics_)) - len(set(base_model.topics_))
new_topics = list(updated_model.topic_labels_.values())[-nr_new_topics:]
print("The following topics are newly found:")
print(f"{new_topics}\n")
# Update the base model
base_model = updated_model
To illustrate, this will give back newly found topics such as:
> The following topics are newly found:
[
‘50_forecasting_predicting_prediction_stocks’,
‘51_activity_activities_accelerometer_accelerometers’,
‘57_rnns_deepcare_neural_imputation’
]
It retains everything from the original model, including
Not only do we reduce the compute by splitting the training up into chunks, but we can monitor any new topics that were added to the model.
In practice, you can train a new model with a frequency that fits your use case. You might check for new topics monthly, weekly, or even daily if you have enough data.
Although we could use Large Language Models (LLMs) for a while now in BERTopic, the v0.16 release has several smaller additions that make working with LLMs a nicer experience!
To sum up, the following were added:
llama-cpp-python: Load any GGUF-compatible LLM with llama.cpp
Truncate documents: Use a variety of techniques to truncate documents when passing them to any LLM.
LangChain: Support for LCEL Runnables by @joshuasundance-swca
Let’s explore a short example of the first two features, llama.cpp and document truncation.
When you pass documents to any LLM module, they might exceed its token limit. Instead, we can truncate the documents passed to the LLM by defining a tokenizer and a doc_length.
The definition of a doc_length depends on the tokenizer you use. For example, a value of 100 can refer to truncating by the number of tokens or even characters.
To use this together with llama-cpp-python , let’s consider the following example. First, we install the necessary packages, prepare the environment, and download a small but capable model (Zephyr-7B):
pip install llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
wget https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf
Loading a GGUF model with llama-cpp-python in BERTopic is straightforward:
from bertopic import BERTopic
from bertopic.representation import LlamaCPP
# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
# and truncate each document to 50 words
representation_model = LlamaCPP(
"zephyr-7b-alpha.Q4_K_M.gguf",
tokenizer="whitespace",
doc_length=50
)
# Create our BERTopic model
topic_model = BERTopic(representation_model=representation_model, verbose=True)
And that is it! We created a model that truncates input documents and creates interesting topic representations without being constrained by its token limit.
If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:
All images without a source credit were created by the author — Which means all of them, I like creating my own images ;)
]]>Throughout the last year, we have seen the Wild West of Large Language Models (LLMs). The pace at which new technology and models were released was astounding! As a result, we have many different standards and ways of working with LLMs.
In this article, we will explore one such topic, namely loading your local LLM through several (quantization) standards. With sharding, quantization, and different saving and compression strategies, it is not easy to know which method is suitable for you.
Throughout the examples, we will use Zephyr 7B, a fine-tuned variant of Mistral 7B that was trained with Direct Preference Optimization (DPO).
🔥 TIP: After each example of loading an LLM, it is advised to restart your notebook to prevent OutOfMemory errors. Loading multiple LLMs requires significant RAM/VRAM. You can reset memory by deleting the models and resetting your cache like so:
# Delete any models previously created
del model, tokenizer, pipe
# Empty VRAM cache
import torch
torch.cuda.empty_cache()
UPDATE: I uploaded a video version to YouTube that goes more in-depth into how to use these quantization methods:
The most straightforward, and vanilla, way of loading your LLM is through 🤗 Transformers. HuggingFace has created a large suite of packages that allow us to do amazing things with LLMs!
We will start by installing HuggingFace, among others, from its main branch to support newer models:
# Latest HF transformers version for Mistral-like models
pip install git+https://github.com/huggingface/transformers.git
pip install accelerate bitsandbytes xformers
After installation, we can use the following pipeline to easily load our LLM:
from torch import bfloat16
from transformers import pipeline
# Load in your LLM without any compression tricks
pipe = pipeline(
"text-generation",
model="HuggingFaceH4/zephyr-7b-beta",
torch_dtype=bfloat16,
device_map="auto"
)
This method of loading an LLM generally does not perform any compression tricks for saving VRAM or increasing efficiency.
To generate our prompt, we first have to create the necessary template. Fortunately, this can be done automatically if the chat template is saved in the underlying tokenizer:
# We use the tokenizer's chat template to format each message
# See https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{
"role": "system",
"content": "You are a friendly chatbot.",
},
{
"role": "user",
"content": "Tell me a funny joke about Large Language Models."
},
]
prompt = pipe.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
The generated prompt, using the internal prompt template, is constructed like so:
Then, we can start passing the prompt to the LLM to generate our answer:
outputs = pipe(
prompt,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
top_p=0.95
)
print(outputs[0]["generated_text"])
This gives us the following output:
Why did the Large Language Model go to the party?
To network and expand its vocabulary!
The punchline may be a bit cheesy, but Large Language Models are all about expanding their vocabulary and networking with other models to improve their language skills. So, this joke is a perfect fit for them!
For pure inference, this method is generally the least efficient as we are loading the entire model without any compression or quantization strategies.
It is, however, a great method to start with as it allows for easy loading and using the model!
Before we go into quantization strategies, there is another trick that we can employ to reduce the necessary VRAM for loading our model. With sharding, we are essentially splitting our model up into small pieces or shards.
Each shard contains a smaller part of the model and aims to work around GPU memory limitations by distributing the model weights across different devices.
Remember when I said we did not perform any compression tricks before?
That was not entirely true…
The model that we loaded, Zephyr-7B-β, was actually already sharded for us! If you go to the model and click the “Files and versions” link, you will see that the model was split up into eight pieces.
Although we can shard a model ourselves, it is generally advised to be on the lookout for quantized models or even quantize them yourself.
Sharding is quite straightforward using the Accelerate package:
from accelerate import Accelerator
# Shard our model into pieces of 1GB
accelerator = Accelerator()
accelerator.save_model(
model=pipe.model,
save_directory="/content/model",
max_shard_size="4GB"
)
And that is it! Because we sharded the model into pieces of 4GB instead of 2GB, we created fewer files to load:
A Large Language Model is represented by a bunch of weights and activations. These values are generally represented by the usual 32-bit floating point (float32) datatype.
The number of bits tells you something about how many values it can represent. Float32 can represent values between 1.18e-38 and 3.4e38, quite a number of values! The lower the number of bits, the fewer values it can represent.
As you might expect, if we choose a lower bit size, then the model becomes less accurate but it also needs to represent fewer values, thereby decreasing its size and memory requirements.
Quantization refers to converting an LLM from its original Float32 representation to something smaller. However, we do not simply want to use a smaller bit variant but map a larger bit representation to a smaller bit without losing too much information.
In practice, we see this often done with a new format, named 4bit-NormalFloat (NF4). This datatype does a few special tricks in order to efficiently represent a larger bit datatype. It consists of three steps:
Normalization: The weights of the model are normalized so that we expect the weights to fall within a certain range. This allows for more efficient representation of more common values.
Quantization: The weights are quantized to 4-bit. In NF4, the quantization levels are evenly spaced with respect to the normalized weights, thereby efficiently representing the original 32-bit weights.
Dequantization: Although the weights are stored in 4-bit, they are dequantized during computation which gives a performance boost during inference.
To perform this quantization with HuggingFace, we need to define a configuration for the quantization with Bitsandbytes:
from transformers import BitsAndBytesConfig
from torch import bfloat16
# Our 4-bit configuration to load the LLM with less GPU memory
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # 4-bit quantization
bnb_4bit_quant_type='nf4', # Normalized float 4
bnb_4bit_use_double_quant=True, # Second quantization after the first
bnb_4bit_compute_dtype=bfloat16 # Computation type
)
This configuration allows us to specify which quantization levels we are going for. Generally, we want to represent the weights with 4-bit quantization but do the inference in 16-bit.
Loading the model in a pipeline is then straightforward:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
# Zephyr with BitsAndBytes Configuration
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")
model = AutoModelForCausalLM.from_pretrained(
"HuggingFaceH4/zephyr-7b-alpha",
quantization_config=bnb_config,
device_map='auto',
)
# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')
Next up, we can use the same prompt as we did before:
# We will use the same prompt as we did originally
outputs = pipe(
prompt,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.95
)
print(outputs[0]["generated_text"])
This will give us the following output:
Why did the Large Language Model go to the party?
To network and expand its vocabulary!
The punchline may be a bit cheesy, but Large Language Models are all about expanding their vocabulary and networking with other models to improve their language skills. So, this joke is a perfect fit for them!
Quantization is a powerful technique to reduce the memory requirements of a model whilst keeping performance similar. It allows for faster loading, using, and fine-tuning LLMs even with smaller GPUs.
Thus far, we have explored sharding and quantization techniques. Albeit useful techniques to have in your skillset, it seems rather wasteful to have to apply them every time you load the model.
Instead, these models have often already been sharded and quantized for us to use. TheBloke in particular is a user on HuggingFace that performs a bunch of quantizations for us to use.
At the moment of writing this, he has uploaded more than 2000 quantized models for us!
These quantized models actually come in many different shapes and sizes. Most notably, the GPTQ, GGUF, and AWQ formats are most frequently used to perform 4-bit quantization.
GPTQ is a post-training quantization (PTQ) method for 4-bit quantization that focuses primarily on GPU inference and performance.
The idea behind the method is that it will try to compress all weights to a 4-bit quantization by minimizing the mean squared error to that weight. During inference, it will dynamically dequantize its weights to float16 for improved performance whilst keeping memory low.
For a more detailed guide to the inner workings of GPTQ, definitely check out the following post: 4-bit Quantization with GPTQ
We start with installing a number of packages we need to load in GPTQ-like models in HuggingFace Transformers:
pip install optimum
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
After doing so, we can navigate to the model that we want to load, namely “TheBloke/zephyr-7B-beta-GPTQ” and choose a specific revision.
These revisions essentially indicate the quantization method, compression level, size of the model, etc.
For now, we are sticking with the “main” branch as that is generally a nice balance between compression and accuracy:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load LLM and Tokenizer
model_id = "TheBloke/zephyr-7B-beta-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=False,
revision="main"
)
# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')
Although we installed a few additional dependencies, we could use the same pipeline as we used before which is a great benefit of using GPTQ.
After loading the model, we can run a prompt as follows:
# We will use the same prompt as we did originally
outputs = pipe(
prompt,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
top_p=0.95
)
print(outputs[0]["generated_text"])
This gives us the following generated text:
Why did the Large Language Model go to the party?
To show off its wit and charm, of course!
But unfortunately, it got lost in the crowd and couldn’t find its way back to its owner. The partygoers were impressed by its ability to blend in so seamlessly with the crowd, but the Large Language Model was just confused and wanted to go home. In the end, it was found by a group of humans who recognized its unique style and brought it back to its rightful place. From then on, the Large Language Model made sure to wear a name tag at all parties, just to be safe.
GPTQ is the most often used compression method since it optimizes for GPU usage. It is definitely worth starting with GPTQ and switching over to a CPU-focused method, like GGUF if your GPU cannot handle such large models.
Although GPTQ does compression well, its focus on GPU can be a disadvantage if you do not have the hardware to run it.
GGUF, previously GGML, is a quantization method that allows users to use the CPU to run an LLM but also offload some of its layers to the GPU for a speed up.
Although using the CPU is generally slower than using a GPU for inference, it is an incredible format for those running models on CPU or Apple devices. Especially since we are seeing smaller and more capable models appearing, like Mistral 7B, the GGUF format might just be here to stay!
Using GGUF is rather straightforward with the ctransformers package which we will need to install first:
pip install ctransformers[cuda]
After doing so, we can navigate to the model that we want to load, namely “TheBloke/zephyr-7B-beta-GGUF” and choose a specific file.
Like GPTQ, these files indicate the quantization method, compression, level, size of the model, etc.
We are using “zephyr-7b-beta.Q4_K_M.gguf” since we focus on 4-bit quantization:
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline
# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-beta-GGUF",
model_file="zephyr-7b-beta.Q4_K_M.gguf",
model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
"HuggingFaceH4/zephyr-7b-beta", use_fast=True
)
# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')
After loading the model, we can run a prompt as follows:
# We will use the same prompt as we did originally
outputs = pipe(prompt, max_new_tokens=256)
print(outputs[0]["generated_text"])
This gives us the following output:
Why did the Large Language Model go to the party? To impress everyone with its vocabulary! But unfortunately, it kept repeating the same jokes over and over again, making everyone groan and roll their eyes. The partygoers soon realized that the Large Language Model was more of a party pooper than a party animal. Moral of the story: Just because a Large Language Model can generate a lot of words, doesn’t mean it knows how to be funny or entertaining. Sometimes, less is more!
GGUF is an amazing format if you want to leverage both the CPU and GPU when you, like me, are GPU-poor and do not have the latest and greatest GPU available.
A new format on the block is AWQ (Activation-aware Weight Quantization) which is a quantization method similar to GPTQ. There are several differences between AWQ and GPTQ as methods but the most important one is that AWQ assumes that not all weights are equally important for an LLM’s performance.
In other words, there is a small fraction of weights that will be skipped during quantization which helps with the quantization loss.
As a result, their paper mentions a significant speed-up compared to GPTQ whilst keeping similar, and sometimes even better, performance.
The method is still relatively new and has not been adopted yet to the extent of GPTQ and GGUF, so it is interesting to see if all these methods can co-exist.
For AWQ, we will use the vLLM package as that was, at least in my experience, the road of least resistance to using AWQ:
pip install vllm
With vLLM, loading and using our model becomes painless:
from vllm import LLM, SamplingParams
# Load the LLM
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=256)
llm = LLM(
model="TheBloke/zephyr-7B-beta-AWQ",
quantization='awq',
dtype='half',
gpu_memory_utilization=.95,
max_model_len=4096
)
Then, we can easily run the model with .generate:
# Generate output based on the input prompt and sampling parameters
output = llm.generate(prompt, sampling_params)
print(output[0].outputs[0].text)
This gives us the following output:
Why did the Large Language Model go to the party? To network and expand its vocabulary! Why did the Large Language Model blush? Because it overheard another model saying it was a little too wordy! Why did the Large Language Model get kicked out of the library? It was being too loud and kept interrupting other models’ conversations with its endless chatter! …
Although it is a new format, AWQ is gaining popularity due to its speed and quality of compression!
🔥 TIP: For a more detailed comparison between these techniques with respect to VRAM/Perplexity, I highly advise reading this in-depth post with a follow-up here.
If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:
All images without a source credit were created by the author — Which means all of them, I like creating my own images ;)
]]>Large Language Models (LLMs) are becoming smaller, faster, and more efficient. Up to the point where I started to consider them for iterative tasks, like keyword extraction.
Having created KeyBERT, I felt that it was time to extend the package to also include LLMs. They are quite powerful and I wanted to prepare the package for when these models can be run on smaller GPUs.
As such, introducing KeyLLM, an extension to KeyBERT that allows you to use any LLM to extract, create, or even fine-tune the keywords! In this tutorial, we will go through keyword extraction with KeyLLM using the recently released Mistral 7B model.
Update: I uploaded a video version to YouTube that goes more in-depth into how to use KeyLLM
We will start by installing a number of packages that we are going to use throughout this example:
pip install --upgrade git+https://github.com/UKPLab/sentence-transformers
pip install keybert ctransformers[cuda]
pip install --upgrade git+https://github.com/huggingface/transformers
We are installing sentence-transformers from its main branch since it has a fix for community detection which we will use in the last few use cases. We do the same for transformers since it does not yet support the Mistral architecture.
In previous tutorials, we demonstrated how we could quantize the original model’s weight to make it run without running into memory problems.
Over the course of the last few months, TheBloke has been working hard on doing the quantization for hundreds of models for us.
This way, we can download the model directly which will speed things up quite a bit.
We’ll start with loading the model itself. We will ofload 50 layers to the GPU. This will reduce RAM usage and use VRAM instead. If you are running into memory errors, reducing this parameter (gpu_layers
) might help!
from ctransformers import AutoModelForCausalLM
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
model_type="mistral",
gpu_layers=50,
hf=True
)
After having loaded the model itself, we want to create a 🤗 Transformers pipeline.
The main benefit of doing so is that these pipelines are found in many tutorials and are often used in packages as backend. Thus far, ctransformers
is not yet natively supported as much as transformers
.
Loading the Mistral tokenizer with ctransformers
is not yet possible as the model is quite new. Instead, we use the tokenizer from the original repository instead.
from transformers import AutoTokenizer, pipeline
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
# Pipeline
generator = pipeline(
model=model, tokenizer=tokenizer,
task='text-generation',
max_new_tokens=50,
repetition_penalty=1.1
)
Let’s see if this works with a very basic example:
>>> response = generator("What is 1+1?")
>>> print(response[0]["generated_text"])
"""
What is 1+1?
A: 2
"""
Perfect! It can handle a very basic question. For the purpose of keyword extraction, let’s explore whether it can handle a bit more complexity.
prompt = """
I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still have not received mine
Extract 5 keywords from that document.
"""
response = generator(prompt)
print(response[0]["generated_text"])
We get the following output:
"""
I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still have not received mine
Extract 5 keywords from that document.
**Answer:**
1. Website
2. Mentions
3. Deliver
4. Couple
5. Days
"""
It does great! However, if we want the structure of the output to stay consistent regardless of the input text we will have to give the LLM an example.
This is where more advanced prompt engineering comes in. As with most Large Language Models, Mistral 7B expects a certain prompt format. This is tremendously helpful when we want to show it what a “correct” interaction looks like.
The prompt template is as follows:
Based on that template, let’s create a template for keyword extraction.
It needs to have two components:
Example prompt
- This will be used to show the LLM what a “good” output looks likeKeyword prompt
- This will be used to ask the LLM to extract the keywordsThe first component, the example_prompt
, will simply be an example of correctly extracting the keywords in the format that we are interested.
The format is a key component since it will make sure that the LLM will always output keywords the way we want:
example_prompt = """
<s>[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>"""
The second component, the keyword_prompt
, will essentially be a repeat of the example_prompt
but with two changes:
KeyBERT
’s [DOCUMENT] tag for indicating where the input document will go.We can use the [DOCUMENT] to insert a document at a location of your choice. Having this option helps us to change the structure of the prompt if needed without being set on having the prompt at a specific location.
keyword_prompt = """
[INST]
I have the following document:
- [DOCUMENT]
Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""
Lastly, we combine the two prompts to create our final template:
>>> prompt = example_prompt + keyword_prompt
>>> print(prompt)
"""
<s>[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>
[INST]
I have the following document:
- [DOCUMENT]
Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""
Now that we have our final prompt template, we can start exploring a couple of interesting new features in KeyBERT
with KeyLLM
. We will start by exploring KeyLLM
only using Mistral’s 7B model
KeyLLM
Keyword extraction with vanilla KeyLLM
couldn’t be more straightforward; we simply ask it to extract keywords from a document.
This idea of extracting keywords from documents through an LLM is straightforward and allows for easily testing your LLM and its capabilities.
Using KeyLLM
is straightforward, we start by loading our LLM through keybert.llm.TextGeneration
and give it the prompt template that we created before.
🔥 TIP 🔥: If you want to use a different LLM, like ChatGPT, you can find a full overview of implemented algorithms here:
from keybert.llm import TextGeneration
from keybert import KeyLLM
# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)
After preparing our KeyLLM
instance, it is as simple as running .extract_keywords
over your documents:
documents = [
"The website mentions that it only takes a couple of days to deliver but I still have not received mine.",
"I received my package!",
"Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license."
]
keywords = kw_model.extract_keywords(documents)
We get the following keywords:
[['deliver',
'days',
'website',
'mention',
'couple',
'still',
'receive',
'mine'],
['package', 'received'],
['LLM',
'API',
'accessibility',
'release',
'license',
'research',
'community',
'model',
'weights',
'Meta']]
These seem like a great set of keywords!
You can play around with the prompt to specify the kind of keywords you want extracted, how long they can be, and even in which language they should be returned if your LLM is multi-lingual.
KeyLLM
Iterating your LLM over thousands of documents is not the most efficient approach! Instead, we can leverage embedding models to make the keyword extraction a bit more efficient.
This works as follows. First, we embed all of our documents and convert them to numerical representations. Second, we find out which documents are most similar to one another. We assume that documents that are highly similar will have the same keywords, so there would be no need to extract keywords for all documents. Third, we only extract keywords from 1 document in each cluster and assign the keywords to all documents in the same cluster.
This is much more efficient and also quite flexible. The clusters are generated purely based on the similarity between documents, without taking cluster structures into account. In other words, it is essentially finding near-duplicate documents that we expect to have the same set of keywords.
To do this with KeyLLM
, we embed our documents beforehand and pass them to .extract_keywords
. The threshold indicates how similar documents will minimally need to be in order to be assigned to the same cluster.
Increasing this value to something like .95 will identify near-identical documents whereas setting it to something like .5 will identify documents about the same topic.
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer
# Extract embeddings
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(documents, convert_to_tensor=True)
# Load it in KeyLLM
kw_model = KeyLLM(llm)
# Extract keywords
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.5)
We get the following keywords:
>>> keywords
[['deliver',
'days',
'website',
'mention',
'couple',
'still',
'receive',
'mine'],
['deliver',
'days',
'website',
'mention',
'couple',
'still',
'receive',
'mine'],
['LLaMA',
'model',
'weights',
'release',
'noncommercial',
'license',
'research',
'community',
'powerful',
'LLMs',
'APIs']]
In this example, we can see that the first two documents were clustered together and received the same keywords. Instead of passing all three documents to the LLM, we only pass two documents. This can speed things up significantly if you have thousands of documents.
KeyBERT
& KeyLLM
Before, we manually passed the embeddings to KeyLLM
to essentially do a zero-shot extraction of keywords. We can further extend this example by leveraging KeyBERT
.
Since KeyBERT
generates keywords and embeds the documents, we can leverage that to not only simplify the pipeline but suggest a number of keywords to the LLM.
These suggested keywords can help the LLM decide on the keywords to use. Moreover, it allows for everything within KeyBERT
to be used with KeyLLM
!
This efficient keyword extraction with both KeyBERT
and KeyLLM
only requires three lines of code! We create a KeyBERT model and assign it the LLM with the embedding model we previously created:
from keybert import KeyLLM, KeyBERT
# Load it in KeyLLM
kw_model = KeyBERT(llm=llm, model='BAAI/bge-small-en-v1.5')
# Extract keywords
keywords = kw_model.extract_keywords(documents, threshold=0.5)
We get the following keywords:
>>> keywords
[['deliver',
'days',
'website',
'mention',
'couple',
'still',
'receive',
'mine'],
['package', 'received'],
['LLM',
'API',
'accessibility',
'release',
'license',
'research',
'community',
'model',
'weights',
'Meta']]
And that is it! With KeyLLM
you are able to use Large Language Models to help create better keywords. We can choose to extract keywords from the text itself or ask the LLM to come up with keywords.
By combining KeyLLM
with KeyBERT
, we increase its potential by doing some computation and suggestions beforehand.
🔥 TIP 🔥: You can use [CANDIDATES]
to pass the generated keywords in KeyBERT to the LLM as candidate keywords. That way, you can tell the LLM that KeyBERT has already generated a number of keywords and ask it to improve them.
If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:
]]>Large Language Models (LLMs) are here to stay. With the recent release of Llama 2, LLMs are approaching the performance of ChatGPT and with proper tuning can even exceed it.
Using these LLMs is often not as straightforward as it seems especially if you want to fine-tune the LLM to your specific use case.
In this article, we will go through 3 of the most common methods for improving the performance of any LLM:
Prompt Engineering
Retrieval Augmented Generation (RAG)
Parameter Efficient Fine-Tuning (PEFT)
There are many more methods but these are the easiest and can result in major improvements without much work.
These 3 methods start from the least complex method, the so-called low-hanging fruits, to one of the more complex methods for improving your LLM.
To get the most out of LLMs, you can even combine all three methods!
Before we get started, here is a more in-depth overview of the methods for easier reference:
You can also follow along with the Google Colab Notebook to make sure everything works as intended.
Update: I uploaded a video version to YouTube that goes more in-depth into how to use these methods:
Before we get started, we need to load in an LLM to use throughout these examples. We’re going with the base Llama 2 as it shows incredible performance and because I am a big fan of sticking with foundation models in tutorials.
We will first need to accept the license before we can get started. Follow these steps:
After doing so, we can log in with our HuggingFace credentials so that this environment knows we have permission to download the Llama 2 model that we are interested in:
from huggingface_hub import notebook_login
notebook_login()
Next, we can load in the 13B variant of Llama 2:
from torch import cuda, bfloat16
import transformers
model_id = 'meta-llama/Llama-2-13b-chat-hf'
pyt
# 4-bit Quanityzation to load Llama 2 with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)
# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
quantization_config=bnb_config,
device_map='auto',
)
model.eval()
# Our text generator
generator = transformers.pipeline(
model=model, tokenizer=tokenizer,
task='text-generation',
temperature=0.1,
max_new_tokens=500,
repetition_penalty=1.1
)
Most open-source LLMs have some sort of template that you must adhere to when creating prompts. In the case of Llama 2, the following helps guide the prompts:
This means that we would have to use the prompt as follows to generate text properly:
basic_prompt = """
<s>[INST] <<SYS>>
You are a helpful assistant
<</SYS>>
What is 1 + 1? [/INST]
"""
print(generator(basic_prompt)[0]["generated_text"])
Which generates the following output:
"""
Oh my, that's a simple one!
The answer to 1 + 1 is... (drumroll please)... 2! 😄
"""
What a cheeky LLM!
The template is less complex than it seems but with a bit of practice, you should get it right in no time.
Now, let’s dive into our first method for improving the output of an LLM, prompt engineering.
How we ask the LLM something has a major effect on the quality of the output that we get. We need to be precise, complete and give examples of the output we are interested in.
This tailoring of your prompt is called prompt engineering.
Prompt engineering is such an amazing way to “tune” your model. It requires no updating of the model and you can quickly iterate over it.
There are two major concepts in prompt engineering:
Example-based
Thought-based
In example-based prompting, such as one-shot or few-shot learning, we provide the LLM with a couple of examples of what we are looking for.
This generally generates text that is more aligned with how we want it.
For example, let’s apply sentiment classification to a short review:
prompt = """
<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
Classify the text into neutral, negative or positive.
Text: I think the food was okay. [/INST]
"""
print(generator(prompt)[0]["generated_text"])
Which generates the following output:
"""
Positive. The word "okay" is a mildly positive word,
indicating that the food was satisfactory or acceptable.
"""
Personally, I am not that convinced with the answer. I think it is more neutral than positive. Also, we have to search in the text for the answer.
Instead, let’s give it an example of how we want the answer to be generated:
prompt = """
<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
Classify the text into neutral, negative or positive.
Text: I think the food was alright.
Sentiment:
[/INST]
Neutral</s><s>
[INST]
Classify the text into neutral, negative or positive.
Text: I think the food was okay.
Sentiment:
[/INST]
"""
print(generator(prompt)[0]["generated_text"])
When we look at the output, we get the expected result:
"""
Neutral
"""
Now, the LLM is to the point and gives only the answer that we are interested in. Much better!
We can go a step further and ask the LLM to “reason” about its answer.
By having the LLM divide its thinking into smaller steps, it allows for more computation to be given to each step. These smaller steps are generally referred to as the *“thoughts” *of the LLM.
There are many ways that we can use thought-based prompting and we can make it quite complex as seen below:
However, let’s start with the most basic example. What if we ask the model to solve a problem by breaking the problem into steps?
prompt = """
<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
Do the odd numbers in this group add up to an even number? 3, 5, 15, 32.
Solve by breaking the problem into steps.
Identify the odd numbers, add them, and indicate whether the result is odd or even.[/INST]
"""
print(generator(prompt)[0]["generated_text"])
The output we get shows its reasoning in detail:
"""
Hello! I'd be happy to help you with that. Let's break down the problem into steps:
Step 1: Identify the odd numbers in the group.
The odd numbers in the group are:
3, 5, 15
Step 2: Add the odd numbers.
3 + 5 + 15 = 23
Step 3: Is the result odd or even?
23 is an odd number.
Therefore, the sum of the odd numbers in the group is an odd number.
"""
As we have seen before, this is also called Chain-of-Thought where the LLM has a sequence of individual *thoughts *or steps it follows.
These individual steps also help the model to stay accountable during its computation. Because it has “reasoned” about each step individually there is structure in its “thinking” process.
Although prompt engineering can get us an improvement, it cannot make the LLM know something it has not learned before.
When an LLM is trained in 2022, it has no knowledge about what has happened in 2023.
This is where Retrieval-Augmented Generation (RAG) comes in. It is a method of providing external knowledge to an LLM that it can leverage.
In RAG, a knowledge base, like Wikipedia, is converted to numerical representations to capture its meaning, called embeddings. These embeddings are stored in a vector database so that the information can easily be retrieved.
Then, when you give the LLM a certain prompt, the vector database is searched for information that relates to the prompt.
The most relevant information is then passed to the LLM as the additional context that it can use to derive its response.
In practice, RAG helps the LLM to “look up” information in its external knowledge base to improve its response.
To create an RAG pipeline or system, we can use the well-known and easy-to-use framework called LangChain.
We’ll start with creating a tiny knowledge base about Llama 2 and writing it into a text file:
# Our tiny knowledge base
knowledge_base = [
"On July 18, 2023, in partnership with Microsoft, Meta announced LLaMA-2, the next generation of LLaMA." ,
"Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ",
"The fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases.",
"Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters.",
"The model architecture remains largely unchanged from that of LLaMA-1 models, but 40% more data was used to train the foundational models.",
"The accompanying preprint also mentions a model with 34B parameters that might be released in the future upon satisfying safety targets."
]
with open(r'knowledge_base.txt', 'w') as fp:
fp.write('\n'.join(knowledge_base))
After doing so, we will need to create an embedding model that can convert text to numerical representations, namely embeddings.
We will choose a well-known sentence-transformers model, namely sentence-transformers/all-MiniLM-L6-v2.
🔥 TIP 🔥 You can find an amazing list of models at the Massive Text Embedding Benchmark (MTEB) Leaderboard.
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
# Embedding Model for converting text to numerical representations
embedding_model = HuggingFaceEmbeddings(
model_name='sentence-transformers/all-MiniLM-L6-v2'
)
Now that we have an embedding model and a tiny knowledge base, we can start developing our vector database. There are many to choose from but let’s choose one that we can use locally, namely FAISS.
🔥 TIP🔥 Other great vector databases are Pinecone and Weaviate.
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
# Load documents and split them
documents = TextLoader("knowledge_base.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# Create local vector database
db = FAISS.from_documents(docs, embedding_model)
The db variable contains everything we need for the database but still need to combine it with an LLM. LangChain makes this incredibly easy and straightforward:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
# Load LLM into LangChain
llm = HuggingFacePipeline(pipeline=generator)
# RAG Pipeline
rag = RetrievalQA.from_chain_type(
llm=llm, chain_type='stuff',
retriever=db.as_retriever()
)
The great thing about the two variables llm and rag that we created is that we can use llm to do some prompting without the RAG pipeline and rag with the RAG pipeline.
Let’s start without RAG. What happens if we ask the LLM about itself, Llama 2?
>>> llm('What is Llama 2?')
"""
Llama 2 is a cryptocurrency that was created as a parody of the popular cryptocurrency Dogecoin. It was designed to be a more serious and less meme-focused alternative to Dogecoin, with a focus on privacy and security.
Llama 2 uses a proof-of-work consensus algorithm and has a block time of 1 minute. It also features a decentralized governance system that allows holders of the llama2 token to vote on proposals for the development and growth of the project.
One of the key features of Llama 2 is its use of "ring signatures" to enhance privacy and security. Ring signatures are a type of digital signature that can be used to hide the identity of the sender in a transaction. This makes it more difficult for hackers or other malicious actors to trace the source of a transaction.
Overall, Llama 2 is a privacy-focused cryptocurrency that is designed to provide users with greater control over their financial data and more secure transactions.
"""
Cryptocurrency? That is not exactly the answer that we were looking for… It seems that it has no knowledge about itself.
Let’s try to use the RAG pipeline instead:
>>> rag('What is Llama 2?')
"""
Llama 2 is a collection of pretrained and fine-tuned large language models
(LLMs) announced by Meta in partnership with Microsoft on July 18, 2023.
"""
That is much better!
Since we have given it external knowledge about Llama 2, it can leverage that information to generate more accurate answers.
🔥 TIP 🔥 Prompting can get difficult and complex quite quickly. If you want to know the exact prompt that is given to the LLM, you can run the following before running the LLM:
import langchain
langchain.debug = False
Both prompt engineering and RAG generally do not change the LLM in itself. Its parameters remain the same and the model does not “learn” anything new, it simply leverages.
We can fine-tune the LLM for a specific use case with domain-specific data so that it learns something new.
Instead of fine-tuning the model’s billions of parameters, we can leverage PEFT instead, Parameter-Efficient Fine-Tuning. As the name implies, it is a subfield that focuses on efficiently fine-tuning an LLM with as few parameters as possible.
One of the most often used methods to do so is called Low-Rank Adaptation (LoRA). LoRA finds a small subset of the original parameters to train without having to touch the base model.
These parameters can be seen as smaller representations of the full model where only the most important or impactful parameters are trained. The beauty is that the resulting weights can be added to the base model and therefore saved separately.
The process of fine-tuning Llama 2 can be difficult with the many parameters out there. Fortunately, AutoTrain takes most of the difficulty away from you and allows you to fine-tune in only a single line!
We’ll start with the data. As always, it is the one thing that affects the resulting performance most!
We are going to make the base Llama 2 model, a chat model, and we will use the OpenAssistant Guanaco dataset for that:
import pandas as pd
from datasets import load_dataset
# Load dataset in pandas
dataset = load_dataset("timdettmers/openassistant-guanaco")
df = pd.DataFrame(dataset["train"][:1000]).dropna()
df.to_csv("train.csv")
This dataset has a number of question/response schemes that you can train Llama 2 on. It differentiates the user with the ### Human tag and the response from the LLM with the ### Assistant tag.
We are only going to take 1000 samples from this dataset for illustration purposes but the performance will definitely increase with more quality data points.
NOTE: The dataset will need a text column which is what AutoTrain will automatically use.
The training in itself is extremely straightforward after installing AutoTrain with only a single line of code:
autotrain llm --train \
--project_name Llama-Chat \
--model abhishek/llama-2-7b-hf-small-shards \
--data_path . \
--use_peft \
--use_int4 \
--learning_rate 2e-4 \
--num_train_epochs 1 \
--trainer sft \
--merge_adapter
There are a number of parameters that are important:
data_path: The path to your data. We saved a train.csv *locally with a *text column *that AutoTrain will use during training.*
model: The base model that we are going to fine-tune. It is a sharded version of the base model that allows for easier training.
use_peft & use_int4: The parameters enable the efficient fine-tuning of the model which reduces the VRAM that is necessary. It leverages, in part, LoRA.
merge_adapter: To make it easier to use the model, we will merge the LoRA together with the base model to create a new model.
And that is it! Fine-tuning a Llama 2 model this way is incredibly easy and since we merged the LoRA weights with the original model, you can load in the updated model as we did before.
🔥 TIP 🔥 Although fine-tuning in one line is amazing, it is very much advised to go through the parameters yourself. Learning what it exactly means to fine-tune with in-depth guides helps you also understand when things are going wrong.
If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:
]]>With the advent of Llama 2, running strong LLMs locally has become more and more a reality. Its accuracy approaches OpenAI’s GPT-3.5, which serves well for many use cases.
In this article, we will explore how we can use Llama2 for Topic Modeling without the need to pass every single document to the model. Instead, we are going to leverage BERTopic, a modular topic modeling technique that can use any LLM for fine-tuning topic representations.
Update: I uploaded a video version to YouTube that goes more in-depth into how to use BERTopic with Llama 2:
BERTopic works rather straightforward. It consists of 5 sequential steps: embedding documents, reducing embeddings in dimensionality, cluster embeddings, tokenizing documents per cluster, and finally extracting the best-representing words per topic.
However, with the rise of LLMs like Llama 2, we can do much better than a bunch of independent words per topic. It is computationally not feasible to pass all documents to Llama 2 directly and have it analyze them. We can employ vector databases for search but we are not entirely sure which topics to search for.
Instead, we will leverage the clusters and topics that were created by BERTopic and have Llama 2 fine-tune and distill that information into something more accurate.
This is the best of both worlds, the topic creation of BERTopic together with the topic representation of Llama 2.
Now that this intro is out of the way, let’s start the hands-on tutorial!
We will start by installing a number of packages that we are going to use throughout this example:
pip install bertopic datasets accelerate bitsandbytes xformers adjustText
Keep in mind that you will need at least a T4 GPU in order to run this example, which can be used with a free Google Colab instance.
We are going to apply topic modeling on a number of ArXiv abstracts. They are a great source for topic modeling since they contain a wide variety of topics and are generally well-written.
from datasets import load_dataset
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
# Extract abstracts to train on and corresponding titles
abstracts = dataset["abstract"]
titles = dataset["title"]
To give you an idea, an abstract looks like the following:
>>> # The abstract of "Attention Is All You Need"
>>> print(abstracts[13894])\
"""
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks in an encoder-decoder configuration. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer, based
solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to be
superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014
English-to-German translation task, improving over the existing best results,
including ensembles by over 2 BLEU. On the WMT 2014 English-to-French
translation task, our model establishes a new single-model state-of-the-art
BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction
of the training costs of the best models from the literature. We show that the
Transformer generalizes well to other tasks by applying it successfully to
English constituency parsing both with large and limited training data.
"""
Before we can load in Llama2 using a number of tricks, we will first need to accept the License for using Llama2. The steps are as follows:
After doing so, we can log in with our HuggingFace credentials so that this environment knows we have permission to download the Llama 2 model that we are interested in.
from huggingface_hub import notebook_login
notebook_login()
Now comes one of the more interesting components of this tutorial, how to load in a Llama 2 model on a T4-GPU!
We will be focusing on the 'meta-llama/Llama-2-13b-chat-hf'
variant. It is large enough to give interesting and useful results whilst small enough that it can be run on our environment.
We start by defining our model and identifying if our GPU is correctly selected. We expect the output of device
to show a Cuda device:
from torch import cuda
model_id = 'meta-llama/Llama-2-13b-chat-hf'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'; print(device)
In order to load our 13 billion parameter model, we will need to perform some optimization tricks. Since we have limited VRAM and not an A100 GPU, we will need to “condense” the model a bit so that we can run it.
There are a number of tricks that we can use but the main principle is going to be 4-bit quantization.
This process reduces the 64-bit representation to only 4-bits which reduces the GPU memory that we will need. It is a recent technique and quite elegant at that for efficient LLM loading and usage. You can find more about that method here in the QLoRA paper and on the amazing HuggingFace blog here.
from torch import bfloat16
import transformers
# Quantization to load an LLM with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True, # 4-bit quantization
bnb_4bit_quant_type='nf4', # Normalized float 4
bnb_4bit_use_double_quant=True, # Second quantization after the first
bnb_4bit_compute_dtype=bfloat16 # Computation type
)
These four parameters that we just run are incredibly important and bring many LLM applications to consumers:
load_in_4bit
bnb_4bit_quant_type
bnb_4bit_use_double_quant
bnb_4bit_compute_dtype
Using this configuration, we can start loading in the model as well as the tokenizer:
# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
quantization_config=bnb_config,
device_map='auto',
)
model.eval()
Using the model and tokenizer, we will generate a HuggingFace transformers pipeline that allows us to easily generate new text:
# Our text generator
generator = transformers.pipeline(
model=model, tokenizer=tokenizer,
task='text-generation',
temperature=0.1,
max_new_tokens=500,
repetition_penalty=1.1
)
To check whether our model is correctly loaded, let’s try it out with a few prompts.
>>> prompt = "Could you explain to me how 4-bit quantization works as if I am 5?"
>>> res = generator(prompt)
>>> print(res[0]["generated_text"])
"""
Could you explain to me how 4-bit quantization works as if I am 5?
Sure! Imagine you have a big box of crayons. Each crayon represents a different color, like red, blue, green, and so on. Now, imagine that instead of using all the different colors, we only want to use four colors: red, blue, green, and yellow. We can do this by taking all the colors and dividing them into four groups.
Group 1: Red, Blue, Green
Group 2: Yellow
Now, when we want to draw something, we can only choose one color from each group. So, if we want to draw a red apple, we would choose the red crayon from Group 1 and the yellow crayon from Group 2. This way, we can only use four colors, but we can still make lots of different things!
In the same way, when we use 4-bit quantization, we take all the numbers and divide them into four groups. Each group has a different number of bits, just like our crayon groups had different colors. Then, when we want to represent a number, we can only choose one number from each group. This way, we can represent lots of different numbers using only four bits!
"""
Although we can directly prompt the model, there is actually a template that we need to follow. The template looks as follows:
"""
<s>[INST] <<SYS>>
{{ System Prompt }}
<</SYS>>
{{ User Prompt }}
[/INST]
{{ Model Answer }}
"""
This template consists of two main components, namely the {{ System Prompt }}
and the {{ User Prompt }}
:
{{ System Prompt }}
helps us guide the model during a conversation. For example, we can say that it is a helpful assistant that is specialized in labeling topics.{{ User Prompt }}
is where we ask it a question.You might have noticed the [INST]
tags, which are used to identify the beginning and end of a prompt. We can use these to model the conversation history as we will see more in-depth later on.
Next, let’s see how we can use this template to optimize Llama 2 for topic modeling.
We are going to keep our system prompt
simple and to the point:
# System prompt describes information given to all conversations
system_prompt = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for labeling topics.
<</SYS>>
"""
We will tell the model that it is simply a helpful assistant for labeling topics since that is our main goal.
In contrast, our user prompt
is going to be a bit more involved. It will consist of two components, an example and the main prompt.
Let’s start with the example. Most LLMs do a much better job of generating accurate responses if you give them an example to work with. We will show it an accurate example of the kind of output we are expecting.
# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.
The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.
Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
[/INST] Environmental impacts of eating meat
"""
This example, based on a number of keywords and documents primarily about the impact of meat, helps to model to understand the kind of output it should give. We show the model that we were expecting only the label, which is easier for us to extract.
Next, we will create a template that we can use within BERTopic:
# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
main_prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: '[KEYWORDS]'.
Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
[/INST]
"""
There are two BERTopic-specific tags that are of interest, namely [DOCUMENTS]
and [KEYWORDS]
:
[DOCUMENTS]
contain the top 5 most relevant documents to the topic[KEYWORDS]
contain the top 10 most relevant keywords to the topic as generated through c-TF-IDFThis template will be filled accordingly to each topic. And finally, we can combine this into our final prompt:
prompt = system_prompt + example_prompt + main_prompt
Before we can start with topic modeling, we will first need to perform two steps:
By pre-calculating the embeddings for each document, we can speed-up additional exploration steps and use the embeddings to quickly iterate over BERTopic’s hyperparameters if needed.
🔥 TIP: You can find a great overview of good embeddings for clustering on the MTEB Leaderboard.
from sentence_transformers import SentenceTransformer
# Pre-calculate embeddings
embedding_model = SentenceTransformer("BAAI/bge-small-en")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
Next, we will define all sub-models in BERTopic and do some small tweaks to the number of clusters to be created, setting random states, etc.
from umap import UMAP
from hdbscan import HDBSCAN
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
As a small bonus, we are going to reduce the embeddings we created before to 2-dimensions so that we can use them for visualization purposes when we have created our topics.
# Pre-reduce embeddings for visualization purposes
reduced_embeddings = UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)
One of the ways we are going to represent the topics is with Llama 2 which should give us a nice label. However, we might want to have additional representations to view a topic from multiple angles.
Here, we will be using c-TF-IDF as our main representation and KeyBERT, MMR, and Llama 2 as our additional representations.
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration
# KeyBERT
keybert = KeyBERTInspired()
# MMR
mmr = MaximalMarginalRelevance(diversity=0.3)
# Text generation with Llama 2
llama2 = TextGeneration(generator, prompt=prompt)
# All representation models
representation_model = {
"KeyBERT": keybert,
"Llama2": llama2,
"MMR": mmr,
}
Now that we have our models prepared, we can start training our topic model! We supply BERTopic with the sub-models of interest, run .fit_transform
, and see what kind of topics we get.
from bertopic import BERTopic
topic_model = BERTopic(
# Sub-models
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
# Hyperparameters
top_n_words=10,
verbose=True
)
# Train model
topics, probs = topic_model.fit_transform(abstracts, embeddings)
Now that we are done training our model, let’s see what topics were generated:
# Show top 3 most frequent topics
topic_model.get_topic_info()[1:4]
Topic | Count | Representation | KeyBERT | Llama2 | MMR | |
---|---|---|---|---|---|---|
1 | 0 | 10339 | [‘policy’, ‘reinforcement’, ‘rl’, ‘agent’, ‘learning’, ‘control’, ‘agents’, ‘to’, ‘reward’, ‘in’] | [‘learning’, ‘robots’, ‘reinforcement’, ‘dynamics’, ‘model’, ‘robotic’, ‘learned’, ‘robot’, ‘algorithms’, ‘exploration’] | [‘Reinforcement Learning Agent Control’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] | [‘policy’, ‘reinforcement’, ‘rl’, ‘agent’, ‘learning’, ‘control’, ‘agents’, ‘to’, ‘reward’, ‘in’] |
2 | 1 | 3592 | [‘privacy’, ‘federated’, ‘fl’, ‘private’, ‘clients’, ‘data’, ‘learning’, ‘communication’, ‘local’, ‘client’] | [‘federated’, ‘decentralized’, ‘heterogeneity’, ‘distributed’, ‘algorithms’, ‘datasets’, ‘models’, ‘convergence’, ‘model’, ‘gradient’] | [‘Privacy-Preserving Machine Learning: Federated Learning’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] | [‘privacy’, ‘federated’, ‘fl’, ‘private’, ‘clients’, ‘data’, ‘learning’, ‘communication’, ‘local’, ‘client’] |
3 | 2 | 3532 | [‘speech’, ‘audio’, ‘speaker’, ‘music’, ‘asr’, ‘acoustic’, ‘recognition’, ‘voice’, ‘the’, ‘model’] | [‘encoder’, ‘speech’, ‘voice’, ‘trained’, ‘language’, ‘models’, ‘neural’, ‘model’, ‘supervised’, ‘learning’] | [‘Speech Recognition and Audio-Visual Processing’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] | [‘speech’, ‘audio’, ‘speaker’, ‘music’, ‘asr’, ‘acoustic’, ‘recognition’, ‘voice’, ‘the’, ‘model’] |
# Show top 3 least frequent topics
topic_model.get_topic_info()[-3:]
Topic | Count | Representation | KeyBERT | Llama2 | MMR | |
---|---|---|---|---|---|---|
118 | 117 | 160 | [‘design’, ‘circuit’, ‘circuits’, ‘synthesis’, ‘chip’, ‘designs’, ‘power’, ‘hardware’, ‘placement’, ‘hls’] | [‘circuits’, ‘circuit’, ‘analog’, ‘optimization’, ‘model’, ‘chip’, ‘technology’, ‘simulation’, ‘learning’, ‘neural’] | [‘Design Automation for Analog Circuits using Reinforcement Learning’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] | [‘design’, ‘circuit’, ‘circuits’, ‘synthesis’, ‘chip’, ‘designs’, ‘power’, ‘hardware’, ‘placement’, ‘hls’] |
119 | 118 | 159 | [‘sentiment’, ‘aspect’, ‘analysis’, ‘polarity’, ‘reviews’, ‘opinion’, ‘text’, ‘task’, ‘twitter’, ‘language’] | [‘embeddings’, ‘sentiment’, ‘sentiments’, ‘supervised’, ‘annotated’, ‘corpus’, ‘aspect’, ‘multilingual’, ‘datasets’, ‘model’] | [‘Multilingual Aspect-Based Sentiment Analysis’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] | [‘sentiment’, ‘aspect’, ‘analysis’, ‘polarity’, ‘reviews’, ‘opinion’, ‘text’, ‘task’, ‘twitter’, ‘language’] |
120 | 119 | 159 | [‘crowdsourcing’, ‘workers’, ‘crowd’, ‘worker’, ‘crowdsourced’, ‘labels’, ‘annotators’, ‘annotations’, ‘label’, ‘labeling’] | [‘crowdsourcing’, ‘crowdsourced’, ‘annotators’, ‘crowds’, ‘annotation’, ‘algorithms’, ‘aggregation’, ‘crowd’, ‘datasets’, ‘annotator’] | [‘Crowdsourced Data Labeling’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’] | [‘crowdsourcing’, ‘workers’, ‘crowd’, ‘worker’, ‘crowdsourced’, ‘labels’, ‘annotators’, ‘annotations’, ‘label’, ‘labeling’] |
We got over 100 topics that were created and they all seem quite diverse.We can use the labels by Llama 2 and assign them to topics that we have created. Normally, the default topic representation would be c-TF-IDF, but we will focus on Llama 2 representations instead.
llama2_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Llama2"].values()]
topic_model.set_topic_labels(llama2_labels)
We can go through each topic manually, which would take a lot of work, or we can visualize them all in a single interactive graph. BERTopic has a bunch of visualization functions that we can use. For now, we are sticking with visualizing the documents.
topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings,
hide_annotations=True, hide_document_hover=False, custom_labels=True)
Although we can use the built-in visualization features of BERTopic, we can also create a static visualization that might be a bit more informative.
We start by creating the necessary variables that contain our reduced embeddings and representations:
import itertools
import pandas as pd
# Define colors for the visualization to iterate over
colors = itertools.cycle(['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#46f0f0', '#f032e6', '#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324', '#fffac8', '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080', '#ffffff', '#000000'])
color_key = {str(topic): next(colors) for topic in set(topic_model.topics_) if topic != -1}
# Prepare dataframe and ignore outliers
df = pd.DataFrame({"x": reduced_embeddings[:, 0], "y": reduced_embeddings[:, 1], "Topic": [str(t) for t in topic_model.topics_]})
df["Length"] = [len(doc) for doc in abstracts]
df = df.loc[df.Topic != "-1"]
df = df.loc[(df.y > -10) & (df.y < 10) & (df.x < 10) & (df.x > -10), :]
df["Topic"] = df["Topic"].astype("category")
# Get centroids of clusters
mean_df = df.groupby("Topic").mean().reset_index()
mean_df.Topic = mean_df.Topic.astype(int)
mean_df = mean_df.sort_values("Topic")
Next, we will visualize the reduced embeddings with matplotlib and process the labels in such a way that it is visually more pleasing:
import seaborn as sns
from matplotlib import pyplot as plt
from adjustText import adjust_text
import matplotlib.patheffects as pe
import textwrap
fig = plt.figure(figsize=(20, 20))
sns.scatterplot(data=df, x='x', y='y', c=df['Topic'].map(color_key), alpha=0.4, sizes=(0.4, 10), size="Length")
# Annotate top 50 topics
texts, xs, ys = [], [], []
for row in mean_df.iterrows():
topic = row[1]["Topic"]
name = textwrap.fill(topic_model.custom_labels_[int(topic)], 20)
if int(topic) <= 50:
xs.append(row[1]["x"])
ys.append(row[1]["y"])
texts.append(plt.text(row[1]["x"], row[1]["y"], name, size=10, ha="center", color=color_key[str(int(topic))],
path_effects=[pe.withStroke(linewidth=0.5, foreground="black")]
))
# Adjust annotations such that they do not overlap
adjust_text(texts, x=xs, y=ys, time_lim=1, force_text=(0.01, 0.02), force_static=(0.01, 0.02), force_pull=(0.5, 0.5))
plt.axis('off')
plt.legend('', frameon=False)
plt.show()
If you are, like me, passionate about AI and/or Psychology, please feel free to add me on LinkedIn, follow me on Twitter, or subscribe to my Newsletter:
]]>There have been many interesting, complex, and innovative solutions since the release of ChatGPT. The community has explored countless possibilities for improving its capabilities.
One of those is the well-known Auto-GPT package. With more than 140k stars, it is one of the highest-ranking repositories on Github!
Auto-GPT is an attempt at making GPT-4 fully autonomous.
Auto-GPT gives GPT-4 the power to make its own decisions
It sounds incredible and it definitely is! But how does it work?
In this post, we will go through Auto-GPT’s architecture and explore how it can reach autonomous behavior.
Auto-GPT has an overall architecture, or a main loop of sorts, that it uses to model autonomous behavior.
Let’s start by describing this overall after which we will go through each step in-depth:
The core of Auto-GPT is a cyclical sequence of steps:
Initialize the prompt with summarized information
GPT-4 proposes an action
The action is executed
Embed both the input and output of this cycle
Save embeddings to a vector database
These 5 steps make up the core of Auto-GPT and represent its main autonomous behavior.
Before we go through each step in-depth, there is a step before this cyclical sequence, namely initializing the agent.
Before Auto-GPT completes a task fully autonomous, it first needs to initialize an Agent. This agent essentially describes who GPT-4 is and what goal it should pursue
Let’s say that we want Auto-GPT to create a recipe for vegan chocolate.
With that goal in mind, we need to give GPT-4 a bit of context about what an agent should be and what it should achieve:
We create a prompt defining two things:
Create 5 highly effective goals (these can be updated later on)
Create an appropriate role-based name (_GPT)
The name helps GPT-4 to continuously remember what it should model. The sub-goals are especially helpful to make small tasks for it to achieve.
Next, we give an example of what the desired output should be:
Giving examples to any generative Large Language Model works really well. By describing what the output should look like, it more easily generates accurate answers.
When we pass this prompt to GPT-4 using Auto-GPT, we get the following response:
It seems that GPT-4 has created a description of RecipeGPT for us. We can give this context to GPT-4 as a system prompt so that it continuously remembers its purpose.
Now that Auto-GPT has created a description of its agent, along with clear goals, it can start by taking its first autonomous action.
The very first step in its cyclical sequence is creating the prompt that triggers an action.
The prompt consists of three components:
System Prompt
Summary
Call to Action
We will go into the summary a bit later but the call to action is nothing more than asking GPT-4 which command it should use. The commands GPT-4 can use are defined in its System Prompt.
The system prompt is the context that we give to GPT-4 so that it remembers certain guidelines that it should follow.
As shown above, it consists of six guidelines:
The goals and description of the initialized Agent
Constrains it should adhere to
Commands it can use
Resources it has access to
Evaluation steps
Example of a valid JSON output
The last five steps are essentially constraints the Agent should adhere to.
Here is a more in-depth overview of what these guidelines and constraints generally look like:
As you can see, the system prompt sketches the boundaries in which GPT-4 can act. For example, in “Resources”, it describes that GPT-4 can use GPT-3.5 Agents for the delegation of simple tasks. Similarly, *“Evaluation,” *tells GPT-4 that it should continuously self-criticize its own behavior to improve upon its next actions.
Together, the very first prompt looks a bit like the following:
Notice that in blue “I was created” is mentioned. Typically, this would contain a summary of all the actions it has taken. Since it was just created, it has no action taken before and the summary is nothing more than “I was created”.
In step 2, we give GPT-4 the prompt we defined in the previous step. It can then propose an action to take which should adhere to the following format:
You can see six individual steps being mentioned:
Thoughts
Reasoning
Plan
Criticism
Speak
Action
These steps describe a format of prompting called Reason and ACT (ReACT).
ReACT is one of Auto-GPT’s superpowers!
ReACT allows for GPT-4 to mimic self-criticism and demonstrate more complex reasoning than what is possible if we just ask the model directly.
Whenever we ask GPT-4 a question using the ReACT framework, we ask GPT-4 to output individual thoughts, actions, and observations before coming to a conclusion.
By having the model mimic extensive reasoning, it tends to give more accurate answers compared to directly answering the question.
In our example, Auto-GPT has extended the base ReACT framework and generates the following response:
As you can see, it follows the ReACT pipeline that we described before but includes additional criticism and reasoning steps.
It proposes to search the web to extract more information about popular recipes.
After having generated a response, in valid JSON format. We can extract what the RecipeGPT wants to do. In this case, it calls for a web search:
and in turn, will execute searching the web:
This action it can take, searching the web, is simply a tool at its disposal that generates a file containing the main body of the page.
Since we explained to GPT-4 in its system prompt that it can use web search, it considers this a valid action.
Auto-GPT is as autonomous as the number of tools it possesses
Do note that if the only tool at its disposal is searching the web, then we can start to argue how autonomous such a model really is!
Either way, we save the output to a file for later use.
Every step Auto-GPT has taken thus far is vital information for any next steps to take. Especially when it needs to take dozens of steps, for example for taking over the world, remembering what it has done thus far is important.
One method of doing so is by embedding the prompts and output it has generated. This allows us to convert text into numerical representations (embeddings) that we can save later on.
These embeddings are generated using OpenAI’s *text-embedding-ada-002 *model which works tremendously well across many use cases.
After having generated the embeddings, we need a place to store them. Pinecone is often used to create the vector database but many other systems can be used as long as you can easily find similar vectors.
The vector database allows us to quickly find information for an input query.
We can query the vector database to find all steps it has taken thus far. Using that information, we ask GPT-4 to create a **summary **of all actions it has taken thus far:
This summary is then used to construct the prompt as we did in step 1.
That way, it can “remember” what it has done thus far and think about the next steps to be taken.
This completes the very first cycle of Auto-GPT’s autonomous behavior!
As you might have guessed, the cycle continues from where we started, asking GPT-4 to take action based on a history of actions.
Auto-GPT will continue until it has reached its goal or when you interrupt it.
During this cyclical process, it can keep track of estimated costs in order to make sure you do not spend too much on your Agent.
In the future, especially with the release of Llama2, I expect and hope that local models can reliably be used in Auto-GPT!
]]>These GPT (Generative Pretrained Transformer) models seemingly removed the threshold for diving into Artificial intelligence for those without a technical background. Anyone can just start asking the models a bunch of stuff and get scarily accurate answers.
At least, most of the time…
When it fails to reproduce the right output, it does not mean it is incapable of doing so. Often, we simply need to change what we ask, the prompt, in a way to guide the model toward the right answer.
This is often referred to as prompt engineering.
Many of the techniques in prompt engineering try to mimic the way humans think. Asking the model to “think aloud” or “let’s think step by step” are great examples of having the model mimic how we think.
These analogies between GPT models and human psychology are important since they help us understand how we can improve the output of GPT models. It shows us capabilities they might be missing.
This does not mean that I am advocating for any GPT model as general intelligence but it is interesting to see how and why we are trying to make GPT models “think” like humans.
Many of the analogies that you will see here are also discussed in this video. Andrej Karpathy shares amazing insights into Large Language Models from a psychological perspective and is definitely worth watching!
As a data scientist and psychologist myself, this is a subject that is close to my heart. It is incredibly interesting to see how these models behave, how we would like them to behave, and how we are nudging these models to behave like us.
There are a number of subjects where analogies between GPT models and human psychology give interesting insights that will be discussed in this article:
DISCLAIMER: When talking about analogies of GPT models with human psychology, there is a risk involved, namely the anthropomorphism of Artificial Intelligence. In other words, humanizing these GPT models. This is definitely not my intention. This post is not about existential risks or general intelligence but merely a fun exercise drawing similarities between us and GPT models. If anything, feel free to take this with a grain of salt!
A prompt is what we ask of a GPT model, for example: “Create a list of 10 book titles”.
When we try different questions in the hopes of improving the performance of the model, then we apply prompt engineering.
In psychology, there are many different forms of prompting individuals to exhibit certain behavior, which is typically used in applied behavior applications (ABA) to learn new behavior.
There is a distinct difference between how this works in GPT models versus Psychology. In Psychology, prompting is about learning new behavior. Something the individual could not do before. For a GPT model, it is about demonstrating previously unseen behavior.
The main distinction lies in that an individual learns something entirely new and, to a certain degree, changes as an individual. In contrast, the GPT model was already capable of showing that behavior but did not due to its circumstances, namely the prompts. Even when you successfully elicit “appropriate” behavior from the model, the model itself did not change.
Prompting in GPT models is also a lot less subtle. Many of the techniques in prompting are as explicit as they can be (e.g., “You are a scientist. Summarize this article.”).
GPT models are copycats. It, and comparable models, are trained on mountains of textual data and try to replicate that as best as they can.
This means that when you ask the model a question, it tries to generate a sequence of words that fits best with what it has seen during training. With enough training data, this sequence of words becomes more and more coherent.
However, such a model has no inherent capabilities of truly understanding the behavior it is mimicking. As with many things in this article, whether a GPT model truly is capable of reasoning is definitely open for discussion and often elicits passionate discussions.
Although we have inherent capabilities for mimicking behavior, it is much more involved and has a grounding in both social constructs and biology. We tend to, to some degree, understand mimicked behavior and can easily generalize it.
We have a preconceived notion of who we are, how our experiences have shaped us, and the views that we have of the world. We have an identity.
GPT models do not have an identity. It has a lot of knowledge about the world we live in and it knows what kind of answers we might prefer, but it has no sense of “self”.
It is not necessarily guided toward certain views like we are. From an identity perspective, it is a blank slate. This means that since a GPT model has a lot of knowledge about the world, it has some capabilities to mimic the identity you ask of it.
But as always, it is just mimicked behavior.
It does have a major advantage. We can ask the model to take on the role of a scientist, writer, editor, etc. and it will try to follow suit. By priming it towards mimicking certain identities, its output will be more tuned toward the task.
This is an interesting subject. There are many sources for evaluating Large Language Models on a wide variety of tests, such as the Hugging Face Leaderboard or using Elo ratings to challenge Large Language Models.
These are important tests to evaluate the capabilities of these models. However, what I consider to be a strength of a certain model, you might not agree with.
This relates to the model itself. Even if we tell it the scores of these tests, it still does not know where its strengths and weaknesses comparatively lie. For example, GPT-4 passed the bar exam which we generally consider a big strength. However, the model might then not realize that only passing the bar is not a strength it was when in a room full of experienced lawyers.
In other words, it highly depends on the context of the situation when one’s capabilities are considered strengths or weaknesses. The same applies to our own capabilities. I might think myself to be proficient in Large Language Models but if you surround me with people like Andrew Ng, Sebastian Raschka, etc. my knowledge about Large Language Models is suddenly not the strength it was before.
This is important because the model does not instinctively know when something is a strength or weakness, so you should tell it.
For example, if you feel like the model is poor when solving mathematical equations, you can tell it to never perform any calculations itself but use the Wolfram Plugin instead.
In contrast, although we claim to have some notion of our own strengths and weaknesses, these are often subjective and tend to be heavily biased.
As mentioned previously, a GPT model does not know what it is good at or not in specific situations. You can help it make sense of the situation by adding an explanation of the situation to the prompt. By describing the situation, the model is primed towards generating more accurate answers.
This will not always make it capable across tasks. Like humans, explaining the situation helps but does not overcome all their weaknesses.
Instead, when we face something that we are not currently capable of we often rely on tools to overcome them. We use a calculator when doing complex equations or use a car for faster transportation.
This reliance on external tools is not something a GPT model automatically does. You will have to tell the model to use a specific external tool when you are convinced it is not capable of a certain task.
What is important here is that we rely on an enormous amount of tools on a daily basis, your phone, keys, glasses, etc. Giving a GPT model the same capabilities can be a tremendous help to its performance. These external tools are similar to the plugins that OpenAI has available.
A major disadvantage of this is that these models do not automatically use tools. It will only access plugins if you tell the model that it is a possibility.
We typically have an inner voice that we converse with when solving difficult problems. “If I do this, then that will be results but if I do that, then that might give me a better solution”.
GPT models do not exhibit this behavior automatically. When you ask it a question it simply generates a number of words that most logically would follow that question. Sure, it does compute those words but it does not leverage those words to create this internal monologue.
As it turns out, asking the model “think aloud” by saying, “Let’s think step by step” tends to improve the answers it gives quite a bit. This is called chain-of-thoughts and tries to emulate the thought processes of human reasoners. This does not necessarily mean that the model is “reasoning” but it is interesting to see how much this improves its performance.
As a nice little bonus, the model does not perform this monologue internally, so following along with what the model is thinking gives amazing insights into its behavior.
This “inner voice” is quite a bit simplified compared to how ours works. We are much more dynamic in the “conversations” we have with ourselves as well as the way we have those “conversations”. It can be symbolic, motoric, or even emotional in nature. For example, many athletes picture themselves performing the sport they excel in as a way to train for the actual thing. This is called mental imagery.
These conversations allow us to brainstorm. We use this to come up with new ideas, solve problems, and understand the context in which a problem appears. A GPT model, in contrast, will have to be told explicitly to brainstorm a solution through very specific instructions.
We can further relate this to our system 1 and system 2 thinking processes. System 1 thinking is an automatic, intuitive, and near-instantaneous process. We have very little control here. In contrast, system 2 is a conscious, slow, logical, and effortful process.
By giving a GPT model the ability of self-reflection, we are essentially trying to mimic this system 2 way of thinking. The model takes more time to generate an answer and looks over it carefully instead of quickly generating a response.
Roughly, you could say that without any prompt engineering, we enable its system 1 thinking process whilst without specific instructions and chain-of-thought-like processes, we enable its system 2 way of thinking.
If you want to know more about our system 1 and system 2 thinking, there is an amazing book called Thinking, Fast and Slow that is worth reading!
Andrej Karpathy, in his video mentioned at the beginning of the article, makes a great comparison of a human’s memory capabilities versus that of a GPT model.
Our memory is quite complex, we have long-term memory, working memory, short-term memory, sensory memory, and more.
We can, very roughly, view the memory of a GPT model as four components and compare that to our own memory systems:
Long-term memory
Working memory
Sensory memory
External memory
The long-term memory of a GPT model can be viewed as the things it has learned whilst training on billions of data. That information is, to a certain degree, represented within the model which it can perfectly reproduce whenever it wants. This long-term memory will stick with the model throughout its existence. In contrast, our long-term memory can decay over time, often referred to as the decay theory.
A GPT model’s long-term memory is perfect and does not decay over time
The working memory of a GPT model is everything that fits within the prompt you give it. The model can use all of that information perfectly to perform its calculation and give back a response. This is a great analogy with our working memory since it is a type of memory that has a limited capacity to temporarily hold information. A GPT model for instance will “forget” its prompt after it has given its response. The reason why it seems to remember the conversation is that alongside the prompt, the conversation history is added to the prompt.
A GPT model is forgetful when it comes to new information
Sensory memory relates to how we hold information derived from our senses, like visual, auditory, and haptic information. We use this information and pass it to our short-term or working memory for processing. This is similar to multi-modal GPT models, models that work on text, images, and even sound.
However, it might be more appropriate to say that GPT models have multi-modal working and long-term memory rather than sensory memory. These models tightly couple multi-modal data, with their different forms of “memory”. So as we have seen before, it rather seems to mimic sensory memory.
A GPT model mimics sensory memory with a multi-modal training procedure
Lastly, GPT models become quite a bit stronger when you give them external memory. This refers to a database of information that it can access whenever it wants, like several books about physics. In contrast, our external memory uses cues from the environment to help us remember certain ideas and sensations. In a way, it is about accessing external information versus remembering internal information.
NOTE: I did not mention short-term memory. There is much discussion between short-term and working memory and whether they are not actually the same thing. A difference often mentioned is that working memory does more than just the short-term storage of information but also has the ability to manipulate it. Also, it has a better analogy with a GPT model, so let’s cherry-pick for a bit here.
As we have seen throughout this article, if we want a GPT model to do something, we should tell it.
This is important to note as it relates to a sense of autonomy. By default, we have a certain degree of autonomy. If I decide to grab a drink, I can.
This is different for a GPT model as it has no autonomy by default. It cannot operate independently without giving it the necessary tools and environment to do so.
We can give a GPT model autonomy by having it create a number of tasks to execute in order to reach a certain end goal. For each task, it writes down the steps for completing it, reflects on them, and executes them if it has the tools to do so.
AutoGPT is a great example of giving a GPT model autonomy
As a result, whatever the model is capable of is very much dependent on its environment to an arguably larger degree than our environment impacts us. Which is quite impactful considering the effect our environment has on us.
This also means that although a GPT model can show impressively complex autonomous behavior, it is fixed. It cannot decide to use a tool we never told it existed. For us, we are more adaptable to new and previously unknown tools.
A common problem with GPT models is their ability to confidently say something that is simply not true nor supported by their training data.
For example, when you ask a GPT model to generate factual information, like the revenue of Apple in 2019, it might generate completely false information.
This is called hallucination.
The term stems from hallucination in human psychology, where we believe something that we see to be true whilst in reality, it is not. The main difference here is that human hallucination is based on perception, whilst a model “hallucinates” incorrect facts.
It might be more appropriate to compare it with false memories. The tendency of humans to recall something differently from how it actually happened. This is similar to a GPT model that tries to reproduce things that actually never happened.
Interestingly, we can more easily generate false memories with suggestibility, priming, framing, etc. This seems to more closely match how a GPT model “hallucinates” as the prompt it receives is highly influential.
Our memories can also be influenced by prompts/phrases that we receive from others. For example, by asking a person “What shade of red was this car?” we are implicitly providing a person with a supposed “fact”, namely that the car was red even when it was not. This can generate false memories and is referred to as a presupposition.
]]>A little over a month ago, OpenAI released a neural net for English speech recognition called Whisper. It has gained quite some popularity over the last few weeks due to its accuracy, ease of use, and most importantly because they open-sourced it!
With these kinds of releases, I can hardly wait to get my hands on such a model and play around with it. However, I like to have a fun or interesting use case to actually use it for.
So I figured, why not use it for creating transcripts of a channel I always enjoy watching, Kurzgesagt!
It is an amazing channel with incredibly well-explained videos focused on animated educational content, ranging from topics about Climate Change and Dinosaurs to Black Holes and Geoengineering.
I decided to do a little more than just create some transcripts. Instead, let us use BERTopic to see if we can extract the main topics found in Kurzgesagt’s videos.
Hence, this article is a tutorial about using Whisper and BERTopic to extract transcripts from Youtube videos and use topic modeling on top of them.
Before going into the actual code, we first need to install a few packages, namely Whisper, BERTopic, and Pytube.
pip install --upgrade git+https://github.com/openai/whisper.git
pip install git+https://github.com/pytube/pytube.git@refs/pull/1409/merge
pip install bertopic
We are purposefully choosing a specific pull request in Pytube since it fixes an issue with empty channels.
At the very last step, I am briefly introducing an upcoming feature of BERTopic, which you can already install with:
pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/840/merge
We need to start off by extracting every metadata that we need from Kurzgesagt’s YouTube channel. Using Pytube, we can create a Channel object that allows us to extract the URLs and titles of their videos.
# Extract all video_urls
from pytube import YouTube, Channel
c = Channel('https://www.youtube.com/c/inanutshell/videos/')
video_urls = c.video_urls
video_titles = [video.title for video in c.videos]
We are also extracting the titles as they might come in handy when we are visualizing the topics later on.
When we have our URLs, we can start downloading the videos and extracting the transcripts. To create those transcripts, we make use of the recently released Whisper.
The model can be quite daunting for new users but it is essentially a sequence-to-sequence Transformer model which has been trained on several different speech-processing tasks. These tasks are fed into the encoder-decoder structure of the Transformer model which allows Whisper to replace several stages of the traditional speech-processing pipeline.
In other words, because it focuses on jointly representing multiple tasks, it can learn a variety of different processing steps all in a single model!
This is great because we can now use a single model to do all of the processing necessary. Below, we will import our Whisper model:
# Just two lines of code to load in a Whisper model!
import whisper
whisper_model = whisper.load_model("tiny")
Then, we iterate over our YouTube URLs, download the audio, and finally pass them through our Whisper model in order to generate the transcriptions:
# Infer all texts
texts = []
for url in video_urls[:100]:
path = YouTube(url).streams.filter(only_audio=True)[0].download(filename="audio.mp4")
transcription = whisper_model.transcribe(path)
texts.append(transcription["text"])
And that is it! We now have transcriptions from 100 videos of Kurzgesagt.
NOTE: I opted for the tiny
model due to its speed and accuracy but there are more accurate models that you can use in Whisper that are worth checking out.
BERTopic approaches topic modeling as a clustering task and as a result, assigns a single document to a single topic. To circumvent this, we can split our transcripts into sentences and run BERTopic on those:
from nltk.tokenize import sent_tokenize
# Sentencize the transcripts and track their titles
docs = []
titles = []
for text, title in zip(texts, video_titles):
sentences = sent_tokenize(text)
docs.extend(sentences)
titles.extend([title] * len(sentences))
Not only do we then have more data to train on, but we can also create more accurate create topic representations.
NOTE: There might or might not be a feature for topic distributions coming up in BERTopic…
BERTopic is a topic modeling technique that focuses on modularity, transparency, and human evaluation. It is a framework that allows users to, within certain boundaries, build their own custom topic model.
BERTopic works by following a linear pipeline of clustering and topic extraction:
At each step of the pipeline, it makes few assumptions about all steps that came before that. For example, the c-TF-IDF representation does not care which input embeddings are used. This guiding philosophy of BERTopic allows for the sub-components to easily be swapped out. As a result, you can build your model however you like:
Although we can use BERTopic in just a few lines, it is worthwhile to generate our embeddings such that we can use them multiple times later on with the need to regenerate them:
from sentence_transformers import SentenceTransformer
# Create embeddings from the documents
sentence_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
embeddings = sentence_model.encode(docs)
Although the content of Kurzgesagt is in English, there might be some non-English terms out there, so I opted for a multilingual sentence-transformer model.
After having generated our embeddings, I wanted to tweak the sub-models slightly in order to best fit with our data:
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
# Define sub-models
vectorizer = CountVectorizer(stop_words="english")
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=2, metric='euclidean', cluster_selection_method='eom')
# Train our topic model with BERTopic
topic_model = BERTopic(
embedding_model=sentence_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer
).fit(docs, embeddings)
Now that we have fitted our BERTopic model, let us take a look at some of its topics. To do so, we run topic_model.get_topic_info().head(10) to get a dataframe of the most frequent topics:
We can see topics about food, cells, the galaxy, and many more!
Although the model found some interesting topics it seems like a lot of work to go through them all by hand. Instead, we can use a number of visualization techniques that makes it a bit easier.
First, it might be worthwhile to generate some nicer-looking labels. To do so, we are going to generate our topic labels with generate_topic_labels.
We want the top 3 words, with a , separator and we are not so much interested in a topic prefix.
# Generate nicer looking labels and set them in our model
topic_labels = topic_model.generate_topic_labels(nr_words=3,
topic_prefix=False,
word_length=15,
separator=", ")
topic_model.set_topic_labels(topic_labels)
Now, we are ready to perform some interesting visualizations. First off, .visualize_documents ! This method aims to visualize the documents and their corresponding documents interactively in a 2D space:
# Manually selected some interesting topics to prevent information overload
topics_of_interest = [33, 1, 8, 9, 0, 30, 27, 19, 16,
28, 44, 11, 21, 23, 26, 2, 37, 34, 3, 4, 5,
15, 17, 22, 38]
# I added the title to the documents themselves for easier interactivity
adjusted_docs = ["<b>" + title + "</b><br>" + doc[:100] + "..."
for doc, title in zip(docs, titles)]
# Visualize documents
topic_model.visualize_documents(
adjusted_docs,
embeddings=embeddings,
hide_annotations=False,
topics=topics_of_interest,
custom_labels=True
)
As can be seen in the visualization above, we have a number of very different topics, ranging from dinosaurs and climate change to bacteria and even ants!
Since we have split each video up into sentences, we can model the distribution of topics per video. I saw recently saw a video called
”What Happens if a Supervolcano Blows Up?”
So let’s see which topics can be found in that video:
# Topic frequency in ""What Happens if a Supervolcano Blows Up?""
video_topics = [topic_model.custom_labels_[topic+1]
for topic, title in zip(topic_model.topics_, titles)
if title == "What Happens if a Supervolcano Blows Up?"
and topic != -1]
counts = pd.DataFrame({"Topic": video_topics}).value_counts(); countstopics_per_class = topic_model.topics_per_class(docs, classes=classes)
As expected, it seems to be mostly related to a topic about volcanic eruptions but also explosions in general.
In the upcoming BERTopic v0.13 release, there is the possibility to approximate the topic distributions for any document regardless of its size.
The method works by creating a sliding window over the document and calculating the windows similarity to each topic:
We can generate these distributions for all of our documents by running the following and making sure that we calculate the distributions on a token level:
# We need to calculate the topic distributions on a token level
(topic_distr,
topic_token_distr) = topic_model.approximate_distribution(
docs, calculate_tokens=True
)
Now we need to choose a piece of text over which to model the topics. For that, I thought it would be interesting to explore how the model handles the advertisement of Brilliant at the end of Kurzgesagt’s:
And with free trial of brilliant premium you can explore everything brilliant has to offer.
We input that document and run our visualization:
# Create a visualization using a styled dataframe if Jinja2 is installed
df = topic_model.visualize_approximate_distribution(docs[100], topic_token_distr[100]); df
As we can see, it seems to pick up topics about Brilliant and memberships, which seems to make sense in this case.
Interestingly, with this approach, we can take into account that there are not only multiple topics per document but even multiple topics per token!
]]>As a result, we data scientists use this freely available software that is driving so many technologies whilst still having the opportunity to be involved in its development.
Over the last few years, I was fortunate enough to be involved in open-source and had the opportunity to develop and main several packages!
Developing open-source is more than just coding
During this time, there were plenty of hurdles to overcome and lessons to be learned. From tricky dependencies and API design choices to communication with the user base.
Working on open-source, whether as an author, maintainer or developer, can be quite daunting! With this article, I share some of my experiences in this field which hopefully helps those wanting to develop open-source.
When you create open-source software, you are typically not making the package exclusively for yourself. Users, from all types of different backgrounds, will be making use of your software. Proper documentation comes a long way in helping those users get started.
However, do not underestimate the impact documentation can have on the useability of your package! You can use it to explain complex algorithms, give extensive tutorials, show use cases, and even allow for interactive examples.
Especially data science-related software can be difficult to understand when it involves complex algorithms. Approaching these explanations like a story has often helped me in making them more intuitive.
Trust me, writing good documentation is a skill in itself.
Another benefit is that writing solid documentation lowers the time spent on issues. There is less reason for users to ask questions if they can find the answers in your documentation.
An overview of how KeyBERT works is found in the documentation. However, creating documentation is more than just writing it. Visualizing your algorithm or software goes a long way in making it intuitive. You can learn quite a lot from Jay Alammar when you want to visualize algorithmic principles in your documentation. His visualizations even ended up in the official Numpy documentation!
Your user base, the community, is an important component of your software. Since we are developing open-source, it is safe to say that we want them to be involved in the development.
By engaging with the community you entice them to share issues and bugs, but also feature requests and great ideas for further development! All of these help in creating something for them.
The open-source community is truly more than the sum of its parts
Many core features in BERTopic, like online topic modeling, have been implemented since they were highly requested by its users. As a result, the community is quite active and has been a tremendous help in detecting issues and developing new features.
Implementing feature requests by the community goes a long way! An excerpt of the discussion here.
Whether your package will be used millions of times or just a few, creating one is an excellent opportunity to learn more about open-source, MLOps, unit testing, API design, etc. I have learned more about these skills in developing open-source than I would have in my day-to-day job.
There is also a huge learning opportunity from interacting with the community itself. They are the ones that tell you which designs they like or not. At times, I have seen the same issue popping up several times over the course of a few months. This indicates that I should rethink the design as it was not as user-friendly as I had anticipated!
On top of that, developing open-source projects has given me the opportunity to collaborate with other developers.
Working on your own open-source projects outside of work does come with its disadvantages. To me, the most significant one is that maintaining the package, answering questions, and participating in the discussions can be quite a lot of work.
It definitely helps if you are intrinsically motivated but it still takes quite some time to make sure everything is held together.
Fortunately, you can look towards your community to help you out when answering questions, showcasing use cases, etc.
Over the course of the last few years, I have learned to be a bit more relaxed when it comes to breaking changes. Especially when it concerns dependencies, sometimes there is just so much you can do!
Knowing how often your package is used is a tremendous help in understanding how popular it is. However, many are still using Github stars to equate a package with quality and popularity.
Make sure to define the right metric. GitHub stars can be exaggerated simply due to proper marketing. Many stars do not imply popularity. As data scientists, we must first understand what it is that we are exactly measuring. GitHub stars are nothing more than a user giving a star to a package. It does not even mean that they have used the software or that it is actually working!
The number of downloads for KeyBERT. A much better indicator than Github stars. Technically, I can pay a thousand people to star my repos. Instead, I focus on a variety of statistics, like downloads and forks, but also the number of issues I get on a daily basis.
For example, it is great if your packages get featured on Hacker News but it does not tell you if it is consistently used.
As a psychologist, I tend to focus a lot on the design of my packages. This includes things like documentation and tutorials but it even translates to how I code.
Making sure that the package is easy to use and install makes adoption much simpler. Especially when you focus on design philosophies such as modularity and transparency, some packages become a blast to use.
The modular design of topic modeling with BERTopic. Taking the perspective of a psychologist whilst developing new features has made it much easier to know what to focus on. What are users looking for? How can I code in a way that explains the algorithm? Why are users actually using this package? What are the major disadvantages of my code?
Taking the time to understand the average user drives adoptation
All of the above often leads to a basic but important rule; Keep It Super Simple
Personally, if I find a new package difficult to install and use, I am less likely to adopt it in my workflow.
]]>