Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scikit-learn's HDBSCAN Implementation #2031

Open
MaartenGr opened this issue Jun 3, 2024 · 2 comments
Open

Scikit-learn's HDBSCAN Implementation #2031

MaartenGr opened this issue Jun 3, 2024 · 2 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@MaartenGr
Copy link
Owner

In a recent version of scikit-learn, I believe it was v1.3, HDBSCAN was implemented with base functionality. Considering scikit-learn is already a requirement of BERTopic it stands to reason to use that implementation instead of the original implementation since scikit-learn has more contributors. Moreover, common installation issues related to HDBSCAN might be alleviated with this.

There are a couple of issues worth mentioning:

  • Calculation of probabilities is if I'm not mistaken, not implemented in scikit-learn's HDBSCAN
    • A solution would be to use the cosine similarities as the default method of calculating probabilities
  • The feature set is smaller than the original implementation
  • Speed needs to be tested to identify whether this is worth it
  • Accuracy, whatever that means in this context, might also need some exploration

For those reading this, I'm interested to hear what you all think about this suggested change!

@MaartenGr MaartenGr added enhancement New feature or request question Further information is requested labels Jun 8, 2024
@MaartenGr MaartenGr pinned this issue Jun 22, 2024
@joshhilton
Copy link

Anything that simplifies the dependency structure is progress. Would it affect the use of the cuML implementation? If not, you could leave that as the faster alternative.

@MaartenGr
Copy link
Owner Author

It doesn't affect the cuML implementation since it should be nothing more than a drop-in replacement of the original implementation. One major disadvantage is that currently soft-clustering is not implemented in the scikit-learn version which makes it difficult to return probabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants