Fix issue with zeroshot topic modeling missing outlier #1957

MaartenGr · 2024-04-29T12:58:57Z

Adresses #1946

timdhu

Thanks for picking up this bug so quickly (before I'd even noticed it!)

I've tested the changes locally using the example in the tutorial

from datasets import load_dataset
from pandas import DataFrame

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

# %%
# We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

# We define a number of topics that we know are in the documents
zeroshot_topic_list = ["Clustering", "Topic Modeling", "Large Language Models"]

# We fit our model using the zero-shot topics
# and we define a minimum similarity. For each document,
# if the similarity does not exceed that value, it will be used
# for clustering instead.
topic_model = BERTopic(
    embedding_model="thenlper/gte-small", 
    min_topic_size=15,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.85,
    representation_model=KeyBERTInspired()
)
topics, probabilities = topic_model.fit_transform(docs)

and with this change I get the error

File [/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:448](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:448), in BERTopic.fit_transform(self, documents, embeddings, images, y)
    [446](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:446) # Combine Zero-shot with outliers
    [447](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:447) if self._is_zeroshot() and len(documents) != len(doc_ids):
--> [448](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:448)     predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
    [450](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:450) return predictions, self.probabilities_

File [/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3717](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3717), in BERTopic._combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
   [3714](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3714)         new_mappings[topic] = topic - 1
   [3716](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3716) # Re-map the topics including all representations (labels, sizes, embeddings, etc.)
-> [3717](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3717) self.topics_ = [new_mappings[topic] for topic in self.topics_]
   [3718](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3718) self.topic_representations_ = {new_mappings[topic]: repr for topic, repr in self.topic_representations_.items()}
   [3719](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3719) self.topic_labels_ = {new_mappings[topic]: label for topic, label in self.topic_labels_.items()}

File [/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3717](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3717), in <listcomp>(.0)
   [3714](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3714)         new_mappings[topic] = topic - 1
   [3716](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3716) # Re-map the topics including all representations (labels, sizes, embeddings, etc.)
-> [3717](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3717) self.topics_ = [new_mappings[topic] for topic in self.topics_]
   [3718](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3718) self.topic_representations_ = {new_mappings[topic]: repr for topic, repr in self.topic_representations_.items()}
   [3719](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py:3719) self.topic_labels_ = {new_mappings[topic]: label for topic, label in self.topic_labels_.items()}

KeyError: nan

I think this is because the reverse_topic_labels is applied before "Outliers" is added to it. If I move the assignment up by 2 lines then the tutorial produces the expected result.

timdhu · 2024-05-02T13:42:21Z

bertopic/_bertopic.py

        df.Label = df.Label.map(reverse_topic_labels)
        merged_model.topics_ = df.Label.values
+        if self._outliers:
+            reverse_topic_labels["Outliers"] = -1


Suggested change

df.Label = df.Label.map(reverse_topic_labels)

merged_model.topics_ = df.Label.values

if self._outliers:

reverse_topic_labels["Outliers"] = -1

if self._outliers:

reverse_topic_labels["Outliers"] = -1

df.Label = df.Label.map(reverse_topic_labels)

merged_model.topics_ = df.Label.values

Otherwise "Outliers" isn't added to reverse_topic_labels until after it's used, meaning that outliers are assigned nan instead of -1 in merged_model.topics_ and line 3718 throws an error.

Aside: I think that the way this is set up means that the values in merged_model.topics_ have data type Int64, because initially df.Label has integers and nan values in them, before the nans are replaced with -1.

Not a big issue but it means that the data type for self.topics_ is different if you merge models.

Awesome! Thanks for looking into this, it is highly appreciated. I made both changes you suggested which, hopefully, should have resolved this issue. If these changes indeed fix the underlying issue, most likely I will create a new minor release (0.16.2) considering zero-shot topic modeling in BERTopic is widely used.

Fix issue with zeroshot topic modeling missing outlier

2f22474

MaartenGr mentioned this pull request Apr 29, 2024

IndexError: list index out of range when using zeroshot_topic_list in 0.16.1 #1946

Open

MaartenGr changed the title ~~Fix issue with zeroshot topic modeling missing outlier~~ Fix issue with zeroshot topic modeling missing outlier (#1946) Apr 29, 2024

MaartenGr changed the title ~~Fix issue with zeroshot topic modeling missing outlier (#1946)~~ Fix issue with zeroshot topic modeling missing outlier Apr 29, 2024

timdhu suggested changes May 2, 2024

View reviewed changes

MaartenGr mentioned this pull request May 2, 2024

bertopic version 0.16.0 - when adding representation model together with zeroshot_topic_list end with failure #1963

Open

MaartenGr added 2 commits May 2, 2024 16:53

Make sure the order is now correct of outlier assignment

a17f4d4

Update typing

f106a85

timdhu approved these changes May 2, 2024

View reviewed changes

ianrandman mentioned this pull request May 3, 2024

Issues with Zero-shot Topic Modeling regarding outliers and future operations #1967

Closed

MaartenGr merged commit 1aa73b3 into master May 7, 2024
2 checks passed

MaartenGr deleted the fix_1946 branch July 22, 2024 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with zeroshot topic modeling missing outlier #1957

Fix issue with zeroshot topic modeling missing outlier #1957

MaartenGr commented Apr 29, 2024 •

edited

Loading

timdhu left a comment

timdhu May 2, 2024

timdhu May 2, 2024 •

edited

Loading

MaartenGr May 2, 2024

Fix issue with zeroshot topic modeling missing outlier #1957

Fix issue with zeroshot topic modeling missing outlier #1957

Conversation

MaartenGr commented Apr 29, 2024 • edited Loading

timdhu left a comment

Choose a reason for hiding this comment

timdhu May 2, 2024

Choose a reason for hiding this comment

timdhu May 2, 2024 • edited Loading

Choose a reason for hiding this comment

MaartenGr May 2, 2024

Choose a reason for hiding this comment

MaartenGr commented Apr 29, 2024 •

edited

Loading

timdhu May 2, 2024 •

edited

Loading