Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sklearn Pipeline Embedder #791

Merged
merged 7 commits into from
Nov 1, 2022
Merged

Conversation

koaning
Copy link
Contributor

@koaning koaning commented Oct 21, 2022

This PR fixes #768.

@koaning
Copy link
Contributor Author

koaning commented Oct 21, 2022

This is a bit of a WIP because there are a few questions on my end.

  • I didn't know which tests you'd like me to add for this.
  • I noticed that I had to install pip install mkdocstrings-python in order to run mkdocs serve. This is possibly because you didn't hard-pin the mkdocstrings>=0.8.0 dependency in setup.py. Want me to change that as well?
  • I could be wrong, but are you running unit-tests on your docs to make sure that they run? If not, I could explore if it makes sense to run mktestdocs there.
  • I mentioned scikit-partial in the docs. This is a personal project, so I just want to make sure you're cool with that.
  • I kept verbose=False in the embed method even though it's a no-op. Should that be documented explicitly?
  • Your sklearn dependency is a bit old, maybe its time to upgrade? Not 100% if it's needed though.

@MaartenGr
Copy link
Owner

Thanks for the PR and apologies for the late reply!

I didn't know which tests you'd like me to add for this.

This is something I am still trying to figure out seeing as most language backends are quite time-consuming to test from a computational perspective. Especially when using the default pipeline using HDBSCAN, there need to be quite some documents (a couple of thousand) for HDBSCAN to find a sufficient number of topics (> 2) to properly test the topic model. As a result, creating a minimal example of a BERTopic model out of the box can take some time. This, now that I think about it, ties nicely in with the possibility of a more light-weight pipeline and thereby this PR.

In other words, as much as I would love tests for this. I am afraid it might take you quite some time to figure out the best approach for this. If you have time, please do, but do note that it is by no means expected here.

I noticed that I had to install pip install mkdocstrings-python in order to run mkdocs serve. This is possibly because you didn't hard-pin the mkdocstrings>=0.8.0 dependency in setup.py. Want me to change that as well?

Nice catch, that would be great!

I could be wrong, but are you running unit-tests on your docs to make sure that they run? If not, I could explore if it makes sense to run mktestdocs there.

Nope, I have no unit-tests on the documentation/docstrings themselves. In the past, there definitely have been mistakes in there that would have been corrected with mktestdocs. However, the default pipeline with HDBSCAN requires quite a number of documents which I think would result in tests that can run for several hours. It might be worthwhile to explore this a bit in the future but there are quite a few examples that then would have to be re-written like:

```python
from bertopic import BERTopic
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
hierarchical_topics = topic_model.hierarchical_topics(docs)
```

I mentioned scikit-partial in the docs. This is a personal project, so I just want to make sure you're cool with that.

No problem! It fits nicely with the online topic modeling approach.

I kept verbose=False in the embed method even though it's a no-op. Should that be documented explicitly?

Seeing as you can define the verbosity in the sklearn pipeline, it might be worthwhile to document that explicitly.

@koaning
Copy link
Contributor Author

koaning commented Oct 24, 2022

Another alternative is to have a weekly job that runs the heavy documentation tests. The thinking here is that you'd like to fix them when they occur, but checking that once a week might be fine.

I'll leave it up to you to consider if that'll work, but it's a pattern that works grand for other things as well. At scikit-lego, we run a weekly job that checks what happens when we install the latest version of sklearn. It typically catches a breaking bug before our users do.

@koaning
Copy link
Contributor Author

koaning commented Oct 24, 2022

Cool. I think I just addressed your comments. Let me know if I forgot something!

@MaartenGr
Copy link
Owner

Another alternative is to have a weekly job that runs the heavy documentation tests. The thinking here is that you'd like to fix them when they occur, but checking that once a week might be fine.

I'll leave it up to you to consider if that'll work, but it's a pattern that works grand for other things as well. At scikit-lego, we run a weekly job that checks what happens when we install the latest version of sklearn. It typically catches a breaking bug before our users do.

Thanks! It indeed might make more sense to leave the heavy testing for a cronjob instead of having to go through that each time something small is tested. I'll take a look at the implementation at scikit-lego.

Cool. I think I just addressed your comments. Let me know if I forgot something!

Yep, you definitely did, thanks! There is one last thing that I forgot to mention. In order to minimize the code necessary to use different language models, you can supply, for example, a SentenceTransformer model directly into embedding_model without the need to access bertopic.backend as it will automatically be recognized as a SentenceTransformerBackend. This is defined in in backend/_utils.py.

Could you add the SklearnEmbedder also there? As a result, you would be able to do the following instead:

from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

pipe = make_pipeline(
    TfidfVectorizer(),
    TruncatedSVD(100)
)

# Use `pipe` directly instead of needing to import it from BERTopic
topic_model = BERTopic(embedding_model=pipe)

@koaning
Copy link
Contributor Author

koaning commented Oct 25, 2022

Ah, nice feature. I'll give that another look and will also check if the docs still make sense.

@koaning
Copy link
Contributor Author

koaning commented Oct 28, 2022

I figured I'd try working on this but hit this error:

Traceback (most recent call last):
  File "/workspaces/BERTopic/check.py", line 5, in <module>
    from bertopic import BERTopic
  File "/workspaces/BERTopic/bertopic/__init__.py", line 1, in <module>
    from bertopic._bertopic import BERTopic
  File "/workspaces/BERTopic/bertopic/_bertopic.py", line 23, in <module>
    import hdbscan
  File "/usr/local/python/3.10.4/lib/python3.10/site-packages/hdbscan/__init__.py", line 1, in <module>
    from .hdbscan_ import HDBSCAN, hdbscan
  File "/usr/local/python/3.10.4/lib/python3.10/site-packages/hdbscan/hdbscan_.py", line 509, in <module>
    memory=Memory(cachedir=None, verbose=0),
TypeError: Memory.__init__() got an unexpected keyword argument 'cachedir'

Just to check, this is on your radar?

@koaning
Copy link
Contributor Author

koaning commented Oct 28, 2022

@MaartenGr I made the changes, no tests still, but I think the changes are in.

@MaartenGr
Copy link
Owner

Just to check, this is on your radar?

Yes, thanks for mentioning though. It is an issue with HDBSCAN that has not taken into account the upcoming changes of joblib in its 1.2.0 release. Although there is a fix for that in HDBSCAN's main branch, it is not yet released into pypi. A quick fix for this in BERTopic would mean fixing joblib to 1.1.0 but might not be preferred due to: https://nvd.nist.gov/vuln/detail/CVE-2022-21797. I don't think it should be an issue in this context but I want to be sure and wait for a HDBSCAN release. For now, removing HDBSCAN and re-installing from its main branch should solve your issue. Would be nice if I had mentioned this to you though before you started working on it, sorry! 😅

All in all, a bit of a tricky situation seeing as pypi does not allow requirements directly from main branches/commits but only from releases.

@MaartenGr I made the changes, no tests still, but I think the changes are in.

Thanks for the work, LGTM! I'll wait a few days to see if HDBSCAN gets updated but if it does not, I'll go ahead and merge this.

@MaartenGr MaartenGr merged commit e5a2912 into MaartenGr:master Nov 1, 2022
@MaartenGr MaartenGr mentioned this pull request Dec 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Would you be open to a lightweight vectorizer/embedder?
2 participants