-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero topic distributions for some documents using approximate_distribution() #2150
Comments
Have you tried looking at some of the hyperparameters of |
Thanks for your reply! However, may I ask why minimum similarity affects my problem that i have some documents have no probabilities to any one of clusters/topics created by the model? |
Sure! You need a minimum similarity to decide which subset of topics is most related to your document. It allows you to filter the most related topics. By lowering the minimum similarity, you will get more topics related to the document (although their similarity values will not change). |
Thanks for your explanations! So there is a mechanism within |
That's correct!
Yes! In practice, it calculates all the similarities and then just ignores those that do not exceed the threshold but the result is the same. |
Have you searched existing issues? 🔎
Desribe the bug
Dear creators of BERTopic,
Thanks for your work and this package is amazing. I have been using it for a long time. However, I found some documents (no matter whether they are used to train the model) have zero topic distributions for all topics created by BERTopic after applying
approximate_distribution()
function on them. It means that the topic distribution matrix produced byapproximate_distribution()
has some rows having sum of 0. Codes below did several things: (1) build a simple setup for a BERTopic model with PCA and KMEANS (fromcuML
) as the dimension reduction and clustering technique. (2) define a splitting function to split documents and pre-caculated embeddings. (3) fit the model on training data and compute topic distributions for both training and testing data set.If more information is needed, please let me know. Thanks!
Reproduction
BERTopic Version
0.16.2
The text was updated successfully, but these errors were encountered: