Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.13 #840

Merged
merged 53 commits into from
Jan 4, 2023
Merged

v0.13 #840

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
94f47c2
Test approximation of topic distribution
MaartenGr Nov 15, 2022
d36a9bd
Test without jinja for now
MaartenGr Nov 15, 2022
776ee49
Update tests
MaartenGr Nov 15, 2022
137b421
Take empty documents into account for approximating topic distributions
MaartenGr Nov 15, 2022
1965e49
Added padding and batch_size parameters, more documentation and examples
MaartenGr Nov 15, 2022
6558628
get_representative_docs now works for all cluster models
MaartenGr Nov 17, 2022
b9a0d1b
Fix online learning bug
MaartenGr Nov 17, 2022
763c294
Prepare for light weight installation option
MaartenGr Nov 17, 2022
baf1593
Fix lightweight installation + update docs
MaartenGr Nov 17, 2022
dd720f9
Fully supervised BERTopic by adding classification to the cluster step
MaartenGr Nov 20, 2022
52ff7ba
Add empty dimensionality and cluster modules for manual topic modelin…
MaartenGr Nov 20, 2022
ce560e0
Start documentation for manual topic modeling
MaartenGr Nov 23, 2022
307ecb7
More documentation for manual topic modeling
MaartenGr Nov 23, 2022
c842bde
Added supervised documentation
MaartenGr Nov 24, 2022
5fffe49
Added support for cuML's HDBSCAN approximate_predict and all_points_m…
MaartenGr Nov 25, 2022
3c7cbed
A lot of documentation updates, added several images
MaartenGr Nov 26, 2022
7d3651e
Add lightweight installation and usage
MaartenGr Nov 27, 2022
ceca56a
Added loads of images for all BERTopic extensions
MaartenGr Nov 28, 2022
00b626c
Update documentation
MaartenGr Nov 28, 2022
0a689ef
Lots of small documentation changes
MaartenGr Nov 29, 2022
79c4a44
Fix #807
MaartenGr Nov 29, 2022
fd9f22b
Up HDBSCAN version and fix #782
MaartenGr Nov 29, 2022
a7927a2
Fix #744
MaartenGr Nov 29, 2022
57963cf
Fix #703
MaartenGr Nov 29, 2022
5466255
Correct index
MaartenGr Nov 29, 2022
d183bb1
Fix #837 by updating the documentation
MaartenGr Nov 29, 2022
e93823d
Update documentation and allow for skipping over embedding with a mod…
MaartenGr Dec 1, 2022
a592127
Small doc changes
MaartenGr Dec 1, 2022
97c11cc
Add three pillars of BERTopic animation using Manim Community
MaartenGr Dec 4, 2022
ac725fc
Doc change
MaartenGr Dec 6, 2022
4a29f8a
.
MaartenGr Dec 6, 2022
61b697b
Add documentation on how to install cuml on google colab
MaartenGr Dec 6, 2022
3d02de3
Fix #871
MaartenGr Dec 9, 2022
1a3bf04
Catch import error
MaartenGr Dec 9, 2022
e5843e2
Up version
MaartenGr Dec 9, 2022
b304e21
Update sklearn pipeline documentation
MaartenGr Dec 14, 2022
32e3622
Update README
MaartenGr Dec 15, 2022
ad83b2b
Different namespace cuml
MaartenGr Dec 16, 2022
2e0a717
Added function to reduce outliers, documentation, and tests
MaartenGr Dec 18, 2022
1e78429
Update testing and add dataframe for approximate distribution to visu…
MaartenGr Dec 18, 2022
f0fdf0d
Update documentation
MaartenGr Dec 18, 2022
6849a19
Add .get_document_info to get meta data on trained documents
MaartenGr Dec 19, 2022
3eb44d0
Merge from main branch to keep track of recent PRs
MaartenGr Dec 19, 2022
da5a73e
Fix spacy merge
MaartenGr Dec 19, 2022
ea6c9bd
Updated empty document in spacy as no vector was returned otherwise
MaartenGr Dec 19, 2022
8a0df22
Fix gensim empty document
MaartenGr Dec 19, 2022
7c75010
Prepare changelog, small changes
MaartenGr Dec 20, 2022
0bf22e6
Fixed seed for sampling representative docs
MaartenGr Dec 25, 2022
6aa2274
Add testing, doc updates
MaartenGr Dec 27, 2022
cb36a49
Small changes
MaartenGr Dec 27, 2022
637d4fe
Update docs
MaartenGr Dec 28, 2022
93ff4ca
Merge remote-tracking branch 'origin/master' into v0.13
MaartenGr Jan 3, 2023
db8c1be
Small doc change
MaartenGr Jan 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
[flake8]
[flake8]
max-line-length = 160
2 changes: 1 addition & 1 deletion .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ jobs:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install --upgrade pip
pip install -e ".[test]"
- name: Run Checking Mechanisms
run: make check
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ ENV/
env.bak/
venv.bak/

# Artifacts
.idea
.idea/
.vscode
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2022, Maarten P. Grootendorst
Copyright (c) 2023, Maarten P. Grootendorst

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
119 changes: 55 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,16 @@ allowing for easily interpretable topics whilst keeping important words in the t

BERTopic supports
[**guided**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html),
(semi-) [**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html),
[**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html),
[**semi-supervised**](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html),
[**manual**](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html),
[**long-document**](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html),
[**hierarchical**](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html),
[**class-based**](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html),
[**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html), and
[**online**](https://maartengr.github.io/BERTopic/getting_started/online/online.html) topic modeling. It even supports visualizations similar to LDAvis!

Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99)
and [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794).
Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99), [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4) and [here](https://towardsdatascience.com/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794) or see a [brief overview](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).

## Installation

Expand All @@ -31,8 +34,7 @@ Installation, with sentence-transformers, can be done using [pypi](https://pypi.
pip install bertopic
```

You may want to install more depending on the transformers and language backends that you will be using.
The possible installations are:
If you want to install BERTopic with other embedding models, you can choose one of the following:

```bash
pip install bertopic[flair]
Expand Down Expand Up @@ -82,8 +84,8 @@ Topic Count Name
3 381 22_key_encryption_keys_encrypted
```

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most
frequent topic that was generated, topic 0:
The `-1` topic refers to all outlier documents and are typically ignored. Next, let's take a look at the most
frequent topic that was generated:

```python
>>> topic_model.get_topic(0)
Expand All @@ -100,7 +102,22 @@ frequent topic that was generated, topic 0:
('pc', 0.003047105930670237)]
```

**NOTE**: Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.
Using `.get_document_info`, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:

```python
>>> topic_model.get_document_info(docs)

Document Topic Name Top_n_words Probability ...
I am sure some bashers of Pens... 0 0_game_team_games_season game - team - games... 0.200010 ...
My brother is in the market for... -1 -1_can_your_will_any can - your - will... 0.420668 ...
Finally you said what you dream... -1 -1_can_your_will_any can - your - will... 0.807259 ...
Think! It's the SCSI card doing... 49 49_windows_drive_dos_file windows - drive - docs... 0.071746 ...
1) I have an old Jasmine drive... 49 49_windows_drive_dos_file windows - drive - docs... 0.038983 ...
```

> **Note**
>
> Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.

## Visualize Topics
After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good
Expand All @@ -114,51 +131,19 @@ topic_model.visualize_topics()

<img src="images/topic_visualization.gif" width="60%" height="60%" align="center" />

We can create an overview of the most frequent topics in a way that they are easily interpretable.
Horizontal barcharts typically convey information rather well and allow for an intuitive representation
of the topics:

```python
topic_model.visualize_barchart()
```

<img src="images/topics.png" width="70%" height="70%" align="center" />


Find all possible visualizations with interactive examples in the documentation
[here](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html).

## Embedding Models
BERTopic supports many embedding models that can be used to embed the documents and words:
* Sentence-Transformers
* 🤗 Transformers
* Flair
* Spacy
* Gensim
* USE

[**Sentence-Transformers**](https://github.com/UKPLab/sentence-transformers) is typically used as it has shown great results embedding documents
meant for semantic similarity. Simply select any from their documentation
[here](https://www.sbert.net/docs/pretrained_models.html) and pass it to BERTopic:
## Modularity
By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:

```python
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
```
https://user-images.githubusercontent.com/25746895/205490350-cd9833e7-9cd5-44fa-8752-407d748de633.mp4

Similarly, you can choose any [**🤗 Transformers**](https://huggingface.co/models) model and pass it to BERTopic:

```python
from transformers.pipelines import pipeline
You can swap out any of these models or even remove them entirely. Starting with the embedding step, you can find out how to do this [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) and more about the underlying algorithm and assumptions [here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).

embedding_model = pipeline("feature-extraction", model="distilbert-base-cased")
topic_model = BERTopic(embedding_model=embedding_model)
```

Click [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html)
for a full overview of all supported embedding models.

## Overview
BERTopic has quite a number of functions that quickly can become overwhelming. To alleviate this issue, you will find an overview
## Functionality
BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview
of all methods and a short description of its purpose.

### Common
Expand All @@ -173,48 +158,54 @@ Below, you will find an overview of common functions in BERTopic.
| Access all topics | `.get_topics()` |
| Get topic freq | `.get_topic_freq()` |
| Get all topic information| `.get_topic_info()` |
| Get all document information| `.get_document_info(docs)` |
| Get representative docs per topic | `.get_representative_docs()` |
| Update topic representation | `.update_topics(docs, n_gram_range=(1, 3))` |
| Generate topic labels | `.generate_topic_labels()` |
| Set topic labels | `.set_topic_labels(my_custom_labels)` |
| Merge topics | `.merge_topics(docs, topics_to_merge)` |
| Reduce nr of topics | `.reduce_topics(docs, nr_topics=30)` |
| Reduce outliers | `.reduce_outliers(docs, topics)` |
| Find topics | `.find_topics("vehicle")` |
| Save model | `.save("my_model")` |
| Load model | `BERTopic.load("my_model")` |
| Get parameters | `.get_params()` |


### Attributes
After having trained your BERTopic model, a number of attributes are saved within your model. These attributes, in part,
After having trained your BERTopic model, several attributes are saved within your model. These attributes, in part,
refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in `_` and are
public attributes that can be used to access model information.

| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_ | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_ | The size of each topic |
| topic_mapper_ | A class for tracking topics and their mappings anytime they are merged/reduced. |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values. |
| c_tf_idf_ | The topic-term matrix as calculated through c-TF-IDF. |
| topic_labels_ | The default labels for each topic. |
| custom_labels_ | Custom labels for each topic as generated through `.set_topic_labels`. |
| topic_embeddings_ | The embeddings for each topic if `embedding_model` was used. |
| representative_docs_ | The representative documents for each topic if HDBSCAN is used. |
| `.topics_` | The topics that are generated for each document after training or updating the topic model. |
| `.probabilities_` | The probabilities that are generated for each document if HDBSCAN is used. |
| `.topic_sizes_` | The size of each topic |
| `.topic_mapper_` | A class for tracking topics and their mappings anytime they are merged/reduced. |
| `.topic_representations_` | The top *n* terms per topic and their respective c-TF-IDF values. |
| `.c_tf_idf_` | The topic-term matrix as calculated through c-TF-IDF. |
| `.topic_labels_` | The default labels for each topic. |
| `.custom_labels_` | Custom labels for each topic as generated through `.set_topic_labels`. |
| `.topic_embeddings_` | The embeddings for each topic if `embedding_model` was used. |
| `.representative_docs_` | The representative documents for each topic if HDBSCAN is used. |


### Variations
There are many different use cases in which topic modeling can be used. As such, a number of
variations of BERTopic have been developed such that one package can be used across across many use cases.
There are many different use cases in which topic modeling can be used. As such, several variations of BERTopic have been developed such that one package can be used across many use cases.

| Method | Code |
|-----------------------|---|
| (semi-) Supervised Topic Modeling | `.fit(docs, y=y)` |
| Topic Modeling per Class | `.topics_per_class(docs, classes)` |
| Dynamic Topic Modeling | `.topics_over_time(docs, timestamps)` |
| Hierarchical Topic Modeling | `.hierarchical_topics(docs)` |
| Guided Topic Modeling | `BERTopic(seed_topic_list=seed_topic_list)` |
| [Topic Distribution Approximation](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html) | `.approximate_distribution(docs)` |
| [Online Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/online/online.html) | `.partial_fit(doc)` |
| [Semi-supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html) | `.fit(docs, y=y)` |
| [Supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html) | `.fit(docs, y=y)` |
| [Manual Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html) | `.fit(docs, y=y)` |
| [Topic Modeling per Class](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html) | `.topics_per_class(docs, classes)` |
| [Dynamic Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) | `.topics_over_time(docs, timestamps)` |
| [Hierarchical Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html) | `.hierarchical_topics(docs)` |
| [Guided Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) | `BERTopic(seed_topic_list=seed_topic_list)` |


### Visualizations
Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation.
Expand Down
2 changes: 1 addition & 1 deletion bertopic/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from bertopic._bertopic import BERTopic

__version__ = "0.12.0"
__version__ = "0.13.0"

__all__ = [
"BERTopic",
Expand Down
Loading