-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify zero-shot topic modeling #2060
Simplify zero-shot topic modeling #2060
Conversation
- zero-shot topic modeling is now only the equivalent of a clustering step - removed implementation where this functionality is done through merging two models - all documents are used at once when calculating representations - probability comes from cosine similarity when zeroshot topics are used - validate `nr_topics` with respect to how many zero-shot topics matched - track `self._outliers` and `self.topic_labels_` using `@property`, as they are derivatives of other attributes - validate existence of outliers before outlier reduction
…d with new topic embedding (#2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this! I left a couple of small comments. Other than that, can you run ruff
? With a PR that was recently merged, we now use ruff
for the formatting/linting.
# Conflicts: # bertopic/_bertopic.py
…strings, lower threshold zeroshot test, fix outliers for probabilities during zeroshot (#2)
Thanks for pointing out the recent incorporation with I have fixed my changes with |
It looks like this is still failing because I think you only ran one of the two Ruff commands. Ruff has |
I believe the code check in python 3.8 failed because its not familiar with the |
@ianrandman Awesome, everything passed and I think we addressed all the comments we had. Just to be 100% sure, shall I go ahead and merge this? |
Yes, all good to merge if it looks good to you. Happy to be done with this :). |
@ianrandman Awesome, thank you for taking the time the last couple of works to work on this. It is greatly appreciated and hopefully this will also make it easier for you to use BERTopic instead of your own fork. If there are any other changes you would like to see, please let me know! |
type(self.hdbscan_model) != BaseCluster
when checking whether model is zero-shotBaseCluster
duringfit_transform()
during zero-shot topic modeling.self._outliers
rather than tracking it to maintain alignment using@property
@property
topic_to
,topics_from
for mappingreduce_outliers()
zeroshot_min_similarity
. Otherwise, the calculated representation is used.Fixes #1967