You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a recent version of scikit-learn, I believe it was v1.3, HDBSCAN was implemented with base functionality. Considering scikit-learn is already a requirement of BERTopic it stands to reason to use that implementation instead of the original implementation since scikit-learn has more contributors. Moreover, common installation issues related to HDBSCAN might be alleviated with this.
There are a couple of issues worth mentioning:
Calculation of probabilities is if I'm not mistaken, not implemented in scikit-learn's HDBSCAN
A solution would be to use the cosine similarities as the default method of calculating probabilities
The feature set is smaller than the original implementation
Speed needs to be tested to identify whether this is worth it
Accuracy, whatever that means in this context, might also need some exploration
For those reading this, I'm interested to hear what you all think about this suggested change!
The text was updated successfully, but these errors were encountered:
Anything that simplifies the dependency structure is progress. Would it affect the use of the cuML implementation? If not, you could leave that as the faster alternative.
It doesn't affect the cuML implementation since it should be nothing more than a drop-in replacement of the original implementation. One major disadvantage is that currently soft-clustering is not implemented in the scikit-learn version which makes it difficult to return probabilities.
In a recent version of scikit-learn, I believe it was v1.3, HDBSCAN was implemented with base functionality. Considering scikit-learn is already a requirement of BERTopic it stands to reason to use that implementation instead of the original implementation since scikit-learn has more contributors. Moreover, common installation issues related to HDBSCAN might be alleviated with this.
There are a couple of issues worth mentioning:
For those reading this, I'm interested to hear what you all think about this suggested change!
The text was updated successfully, but these errors were encountered: