v0.13 #840

MaartenGr · 2022-11-15T08:57:12Z

Highlights:

Calculate topic distributions with .approximate_distribution regardless of the cluster model used
- Generates topic distributions on a document- and token-levels
- Can be used for any document regardless of its size!
Fully supervised BERTopic
- You can now use a classification model for the clustering step instead to create a fully supervised topic model
Manual topic modeling
- Generate topic representations from labels directly
- Allows for skipping the embedding and clustering steps in order to go directly to the topic representation step
Reduce outliers with 4 different strategies using .reduce_outliers
Install BERTopic without SentenceTransformers for a lightweight package:
- pip install --no-deps bertopic
- pip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml
Get meta data of trained documents such as topics and probabilities using .get_document_info(docs)
Added more support for cuML's HDBSCAN
- Calculate and predict probabilities during fit_transform and transform respectively
- This should give a major speed-up when setting calculate_probabilities=True
More images to the documentation and a lot of changes/updates/clarifications
Get representative documents for non-HDBSCAN models by comparing document and topic c-TF-IDF representations
Sklearn Pipeline Embedder by @koaning in #791

Fixes:

Improve .partial_fit documentation (#837)
Fixed scipy linkage usage (#807)
Fixed shifted heatmap (#782)
Fixed SpaCy backend (#744)
Fixed representative docs with small clusters (<3) (#703)
Typo fixed by @timpal0l in #734
Typo fixed by @srulikbd in #842
Correcting iframe urls by @Mustapha-AJEGHRIR in #798
Refactor embedding methods by @zachschillaci27 in #855
Added diversity parameter to update_topics() function by @anubhabdaserrr in #887

Documentation

Personally, I believe that documentation can be seen as a feature and is an often underestimated aspect of open-source. So I went a bit overboard😅... and created an animation about the three pillars of BERTopic using Manim. There are many other visualizations added, one of each variation of BERTopic, and many smaller changes.

BERTopicOverview.mp4

Topic Distributions

The difficulty with a cluster-based topic modeling technique is that it does not directly consider that documents may contain multiple topics. With the new release, we can now model the distributions of topics! We even consider that a single word might be related to multiple topics. If a document is a mixture of topics, what is preventing a single word to be the same?

To do so, we approximate the distribution of topics in a document by calculating and summing the similarities of tokensets (achieved by applying a sliding window) with the topics:

# After fitting your model run the following for either your trained documents or even unseen documents
topic_distr, _ = topic_model.approximate_distribution(docs)

To calculate and visualize the topic distributions in a document on a token-level, we can run the following:

# We need to calculate the topic distributions on a token level
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)

# Create a visualization using a styled dataframe if Jinja2 is installed
df = topic_model.visualize_approximate_distribution(docs[0], topic_token_distr[0]); df

Supervised Topic Modeling

BERTopic now supports fully-supervised classification! Instead of using a clustering algorithm, like HDBSCAN, we can replace it with a classifier, like Logistic Regression:

from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression

# Get labeled data
data= fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
docs = data['data']
y = data['target']

# Allows us to skip over the dimensionality reduction step
empty_dimensionality_model = BaseDimensionalityReduction()

# Create a classifier to be used instead of the cluster model
clf= LogisticRegression()

# Create a fully supervised BERTopic instance
topic_model= BERTopic(
        umap_model=empty_dimensionality_model,
        hdbscan_model=clf
)
topics, probs = topic_model.fit_transform(docs, y=y)

Manual Topic Modeling

When you already have a bunch of labels and simply want to extract topic representations from them, you might not need to actually learn how those can predicted. We can bypass the embeddings -> dimensionality reduction -> clustering steps and go straight to the c-TF-IDF representation of our labels:

from bertopic import BERTopic
from bertopic.backend import BaseEmbedder
from bertopic.cluster import BaseCluster
from bertopic.dimensionality import BaseDimensionalityReduction

# Prepare our empty sub-models and reduce frequent words while we are at it.
empty_embedding_model = BaseEmbedder()
empty_dimensionality_model = BaseDimensionalityReduction()
empty_cluster_model = BaseCluster()

# Fit BERTopic without actually performing any clustering
topic_model= BERTopic(
        embedding_model=empty_embedding_model,
        umap_model=empty_dimensionality_model,
        hdbscan_model=empty_cluster_model,
)
topics, probs = topic_model.fit_transform(docs, y=y)

Outlier Reduction

Outlier reduction is an frequently-discussed topic in BERTopic as its default cluster model, HDBSCAN, has a tendency to generate many outliers. This often helps in the topic representation steps, as we do not consider documents that are less relevant, but you might want to still assign those outliers to actual topics. In the modular philosophy of BERTopic, keeping training times in mind, it is now possible to perform outlier reduction after having trained your topic model. This allows for ease of iteration and prevents having to train BERTopic many times to find the parameters you are searching for. There are 4 different strategies that you can use, so make sure to check out the documentation!

Using it is rather straightforward:

new_topics = topic_model.reduce_outliers(docs, topics)

Lightweight BERTopic

The default embedding model in BERTopic is one of the amazing sentence-transformers models, namely "all-MiniLM-L6-v2". Although this model performs well out of the box, it typically needs a GPU to transform the documents into embeddings in a reasonable time. Moreover, the installation requires pytorch which often results in a rather large environment, memory-wise.

Fortunately, it is possible to install BERTopic without sentence-transformers and use it as a lightweight solution instead. The installation can be done as follows:

pip install --no-deps bertopic
pip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml

Then, we can use BERTopic without sentence-transformers as follows using a CPU-based embedding technique:

from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

pipe = make_pipeline(
    TfidfVectorizer(),
    TruncatedSVD(100)
)

topic_model = BERTopic(embedding_model=pipe)

As a result, the entire package and resulting model can be run quickly on the CPU and no GPU is necessary!

Document Information

Get information about the documents on which the topic was trained including the documents themselves, their respective topics, the name of each topic, the top n words of each topic, whether it is a representative document, and the probability of the clustering if the cluster model supports it. There are also options to include other metadata, such as the topic distributions or the x and y coordinates of the reduced embeddings that you can learn more about here.

To get the document info, you will only need to pass the documents on which the topic model was trained:

>>> topic_model.get_document_info(docs)

Document                               Topic	Name	                        Top_n_words                     Probability    ...
I am sure some bashers of Pens...	0	0_game_team_games_season	game - team - games...	        0.200010       ...
My brother is in the market for...      -1     -1_can_your_will_any	        can - your - will...	        0.420668       ...
Finally you said what you dream...	-1     -1_can_your_will_any	        can - your - will...            0.807259       ...
Think! It's the SCSI card doing...	49     49_windows_drive_dos_file	windows - drive - docs...	0.071746       ...
1) I have an old Jasmine drive...	49     49_windows_drive_dos_file	windows - drive - docs...	0.038983       ...

…g or skipping steps

…embership_vectors functions

…el instead of zero numpy vectors

…alization page

MaartenGr added 9 commits November 15, 2022 09:56

Test approximation of topic distribution

94f47c2

Test without jinja for now

d36a9bd

Update tests

776ee49

Take empty documents into account for approximating topic distributions

137b421

Added padding and batch_size parameters, more documentation and examples

1965e49

get_representative_docs now works for all cluster models

6558628

Fix online learning bug

b9a0d1b

Prepare for light weight installation option

763c294

Fix lightweight installation + update docs

baf1593

MaartenGr mentioned this pull request Nov 17, 2022

Mandatory dependency: sentence-transformers? #786

Closed

MaartenGr and others added 9 commits November 20, 2022 09:35

Fully supervised BERTopic by adding classification to the cluster step

dd720f9

Add empty dimensionality and cluster modules for manual topic modelin…

52ff7ba

…g or skipping steps

Start documentation for manual topic modeling

ce560e0

More documentation for manual topic modeling

307ecb7

Added supervised documentation

c842bde

Added support for cuML's HDBSCAN approximate_predict and all_points_m…

5fffe49

…embership_vectors functions

A lot of documentation updates, added several images

3c7cbed

Add lightweight installation and usage

7d3651e

Added loads of images for all BERTopic extensions

ceca56a

This was referenced Nov 28, 2022

Ability to change # of documents pulled per topic by .get_representative_docs()? #848

Closed

cuml HDBSCAN inference error #858

Closed

MaartenGr added 9 commits November 28, 2022 15:04

Update documentation

00b626c

Lots of small documentation changes

0a689ef

Fix #807

79c4a44

Up HDBSCAN version and fix #782

fd9f22b

Fix #744

a7927a2

Fix #703

57963cf

Correct index

5466255

Fix #837 by updating the documentation

d183bb1

Update documentation and allow for skipping over embedding with a mod…

e93823d

…el instead of zero numpy vectors

MaartenGr and others added 19 commits December 1, 2022 16:16

Small doc changes

a592127

Add three pillars of BERTopic animation using Manim Community

97c11cc

Doc change

ac725fc

.

4a29f8a

Add documentation on how to install cuml on google colab

61b697b

Fix #871

3d02de3

Catch import error

1a3bf04

Up version

e5843e2

Update sklearn pipeline documentation

b304e21

Update README

32e3622

Different namespace cuml

ad83b2b

Added function to reduce outliers, documentation, and tests

2e0a717

Update testing and add dataframe for approximate distribution to visu…

1e78429

…alization page

Update documentation

f0fdf0d

Add .get_document_info to get meta data on trained documents

6849a19

Merge from main branch to keep track of recent PRs

3eb44d0

Fix spacy merge

da5a73e

Updated empty document in spacy as no vector was returned otherwise

ea6c9bd

Fix gensim empty document

8a0df22

MaartenGr mentioned this pull request Dec 19, 2022

Flexibility of Cluster (-1) - Outliers Cluster #889

Closed

Prepare changelog, small changes

7c75010

MaartenGr mentioned this pull request Dec 25, 2022

Ways to increase representative documents for a topic? #896

Closed

MaartenGr and others added 6 commits December 25, 2022 08:19

Fixed seed for sampling representative docs

0bf22e6

Add testing, doc updates

6aa2274

Small changes

cb36a49

Update docs

637d4fe

Merge remote-tracking branch 'origin/master' into v0.13

93ff4ca

Small doc change

db8c1be

MaartenGr merged commit 06dcd47 into master Jan 4, 2023

MaartenGr deleted the v0.13 branch May 4, 2023 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.13 #840

v0.13 #840

MaartenGr commented Nov 15, 2022 •

edited

Loading

v0.13 #840

v0.13 #840

Conversation

MaartenGr commented Nov 15, 2022 • edited Loading

Highlights:

Fixes:

Documentation

Topic Distributions

Supervised Topic Modeling

Manual Topic Modeling

Outlier Reduction

Lightweight BERTopic

Document Information

MaartenGr commented Nov 15, 2022 •

edited

Loading