View file src/colab/intuitions_about_bertopic.py - Download
# -*- coding: utf-8 -*-
"""intuitions-about-bertopic.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/17SuUlYoAk7Wjr00R_ySKt8_ghSjL639y
https://www.kaggle.com/code/johannareiml/intuitions-about-bertopic/
Tutorial - Topic Modeling with BERTopic
ADAPTED FROM THE ORIGINAL NOTEBOOK.
(last updated 11th July 2024)
In this tutorial we will be exploring how to use BERTopic to create topics from the well-known 20Newsgroups dataset. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for.
BERTopic
BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
Enabling the GPU
First, you'll need to enable GPUs for the notebook:
Navigate to Notebook Settings on the side bar on the right
Open Session options section
select "GPU T4 x2" from the Accelerator drop-down
Installing BERTopic
We start by installing BERTopic from PyPi:
"""
!pip install bertopic
"""
Data
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts. We only consider the 'soc.religion.christian' and 'alt.atheism' group for demo purposes.
"""
!pip install datasets
from datasets import load_dataset
# load our filtered 20newsgroups dataset from huggingface
dataset = load_dataset("aihpi/20_newsgroups_demo", split="train")
dataset
"""
Topic Modeling
In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.
Training
We start by instantiating BERTopic. We set language to english since our documents are in the English language. If you would like to use a multi-lingual model, please use language="multilingual" instead.
We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.
"""
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
# count vectorizer is for turning documents into words. stop words refer to common tokens like "the" or "a"
vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(vectorizer_model=vectorizer_model, language="english", verbose=True)
# get the text from the dataset and fit the topic model
docs = dataset["text"]
topic_model.fit(docs);
"""
Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. -1 refers to all outliers and should typically be ignored.
NOTE: BERTopic is stochastic which means that the topics might differ across runs. This is mostly due to the stochastic nature of UMAP.
"""
df = topic_model.get_topic_info()
df
"""
Attributes
There are a number of attributes that you can access after having trained your BERTopic model:
Attribute Description
topics_ The topics that are generated for each document after training or updating the topic model.
probabilities_ The probabilities that are generated for each document if HDBSCAN is used.
topicsizes The size of each topic
topicmapper A class for tracking topics and their mappings anytime they are merged/reduced.
topicrepresentations The top n terms per topic and their respective c-TF-IDF values.
c_tfidf The topic-term matrix as calculated through c-TF-IDF.
topiclabels The default labels for each topic.
customlabels Custom labels for each topic as generated through .set_topic_labels.
topicembeddings The embeddings for each topic if embedding_model was used.
representativedocs The representative documents for each topic if HDBSCAN is used.
For example, to access the predicted topics for the first 10 documents, we simply run the following:
"""
n_examples = 3
example_docs = docs[:n_examples]
example_docs_topic_id = topic_model.topics_[:n_examples]
for doc, topic_id in zip(example_docs, example_docs_topic_id):
topic = topic_model.topic_labels_[topic_id]
print("-----------------------")
print(topic, doc)
print("-----------------------")
"""
Visualization
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created.
Visualize Topics
After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:
"""
topic_model.visualize_topics()
"""
Visualize Topic Hierarchy
The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.
"""
topic_model.visualize_hierarchy()
"""
Visualize Terms
We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.
"""
topic_model.visualize_barchart(top_n_topics=5)
"""
Model serialization
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.
"""
# Save model
topic_model.save("20newsgroup_model", serialization="safetensors")
# Load model
my_model = BERTopic.load("20newsgroup_model")