Sklearn Pipeline Embedder #791

koaning · 2022-10-21T09:23:19Z

This PR fixes #768.

koaning · 2022-10-21T09:32:01Z

This is a bit of a WIP because there are a few questions on my end.

I didn't know which tests you'd like me to add for this.
I noticed that I had to install pip install mkdocstrings-python in order to run mkdocs serve. This is possibly because you didn't hard-pin the mkdocstrings>=0.8.0 dependency in setup.py. Want me to change that as well?
I could be wrong, but are you running unit-tests on your docs to make sure that they run? If not, I could explore if it makes sense to run mktestdocs there.
I mentioned scikit-partial in the docs. This is a personal project, so I just want to make sure you're cool with that.
I kept verbose=False in the embed method even though it's a no-op. Should that be documented explicitly?
Your sklearn dependency is a bit old, maybe its time to upgrade? Not 100% if it's needed though.

MaartenGr · 2022-10-24T07:09:59Z

Thanks for the PR and apologies for the late reply!

I didn't know which tests you'd like me to add for this.

This is something I am still trying to figure out seeing as most language backends are quite time-consuming to test from a computational perspective. Especially when using the default pipeline using HDBSCAN, there need to be quite some documents (a couple of thousand) for HDBSCAN to find a sufficient number of topics (> 2) to properly test the topic model. As a result, creating a minimal example of a BERTopic model out of the box can take some time. This, now that I think about it, ties nicely in with the possibility of a more light-weight pipeline and thereby this PR.

In other words, as much as I would love tests for this. I am afraid it might take you quite some time to figure out the best approach for this. If you have time, please do, but do note that it is by no means expected here.

I noticed that I had to install pip install mkdocstrings-python in order to run mkdocs serve. This is possibly because you didn't hard-pin the mkdocstrings>=0.8.0 dependency in setup.py. Want me to change that as well?

Nice catch, that would be great!

I could be wrong, but are you running unit-tests on your docs to make sure that they run? If not, I could explore if it makes sense to run mktestdocs there.

Nope, I have no unit-tests on the documentation/docstrings themselves. In the past, there definitely have been mistakes in there that would have been corrected with mktestdocs. However, the default pipeline with HDBSCAN requires quite a number of documents which I think would result in tests that can run for several hours. It might be worthwhile to explore this a bit in the future but there are quite a few examples that then would have to be re-written like:

BERTopic/bertopic/_bertopic.py

Lines 828 to 833 in 09c1732

  ```python 

  from bertopic import BERTopic 

  topic_model = BERTopic() 

  topics, probs = topic_model.fit_transform(docs) 

  hierarchical_topics = topic_model.hierarchical_topics(docs) 

  ```

I mentioned scikit-partial in the docs. This is a personal project, so I just want to make sure you're cool with that.

No problem! It fits nicely with the online topic modeling approach.

I kept verbose=False in the embed method even though it's a no-op. Should that be documented explicitly?

Seeing as you can define the verbosity in the sklearn pipeline, it might be worthwhile to document that explicitly.

koaning · 2022-10-24T13:22:28Z

Another alternative is to have a weekly job that runs the heavy documentation tests. The thinking here is that you'd like to fix them when they occur, but checking that once a week might be fine.

I'll leave it up to you to consider if that'll work, but it's a pattern that works grand for other things as well. At scikit-lego, we run a weekly job that checks what happens when we install the latest version of sklearn. It typically catches a breaking bug before our users do.

koaning · 2022-10-24T13:31:34Z

Cool. I think I just addressed your comments. Let me know if I forgot something!

MaartenGr · 2022-10-25T04:38:42Z

Another alternative is to have a weekly job that runs the heavy documentation tests. The thinking here is that you'd like to fix them when they occur, but checking that once a week might be fine.

I'll leave it up to you to consider if that'll work, but it's a pattern that works grand for other things as well. At scikit-lego, we run a weekly job that checks what happens when we install the latest version of sklearn. It typically catches a breaking bug before our users do.

Thanks! It indeed might make more sense to leave the heavy testing for a cronjob instead of having to go through that each time something small is tested. I'll take a look at the implementation at scikit-lego.

Cool. I think I just addressed your comments. Let me know if I forgot something!

Yep, you definitely did, thanks! There is one last thing that I forgot to mention. In order to minimize the code necessary to use different language models, you can supply, for example, a SentenceTransformer model directly into embedding_model without the need to access bertopic.backend as it will automatically be recognized as a SentenceTransformerBackend. This is defined in in backend/_utils.py.

Could you add the SklearnEmbedder also there? As a result, you would be able to do the following instead:

from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

pipe = make_pipeline(
    TfidfVectorizer(),
    TruncatedSVD(100)
)

# Use `pipe` directly instead of needing to import it from BERTopic
topic_model = BERTopic(embedding_model=pipe)

koaning · 2022-10-25T07:36:05Z

Ah, nice feature. I'll give that another look and will also check if the docs still make sense.

koaning · 2022-10-28T14:22:03Z

I figured I'd try working on this but hit this error:

Traceback (most recent call last):
  File "/workspaces/BERTopic/check.py", line 5, in <module>
    from bertopic import BERTopic
  File "/workspaces/BERTopic/bertopic/__init__.py", line 1, in <module>
    from bertopic._bertopic import BERTopic
  File "/workspaces/BERTopic/bertopic/_bertopic.py", line 23, in <module>
    import hdbscan
  File "/usr/local/python/3.10.4/lib/python3.10/site-packages/hdbscan/__init__.py", line 1, in <module>
    from .hdbscan_ import HDBSCAN, hdbscan
  File "/usr/local/python/3.10.4/lib/python3.10/site-packages/hdbscan/hdbscan_.py", line 509, in <module>
    memory=Memory(cachedir=None, verbose=0),
TypeError: Memory.__init__() got an unexpected keyword argument 'cachedir'

Just to check, this is on your radar?

koaning · 2022-10-28T14:27:54Z

@MaartenGr I made the changes, no tests still, but I think the changes are in.

MaartenGr · 2022-10-30T05:38:18Z

Just to check, this is on your radar?

Yes, thanks for mentioning though. It is an issue with HDBSCAN that has not taken into account the upcoming changes of joblib in its 1.2.0 release. Although there is a fix for that in HDBSCAN's main branch, it is not yet released into pypi. A quick fix for this in BERTopic would mean fixing joblib to 1.1.0 but might not be preferred due to: https://nvd.nist.gov/vuln/detail/CVE-2022-21797. I don't think it should be an issue in this context but I want to be sure and wait for a HDBSCAN release. For now, removing HDBSCAN and re-installing from its main branch should solve your issue. Would be nice if I had mentioned this to you though before you started working on it, sorry! 😅

All in all, a bit of a tricky situation seeing as pypi does not allow requirements directly from main branches/commits but only from releases.

@MaartenGr I made the changes, no tests still, but I think the changes are in.

Thanks for the work, LGTM! I'll wait a few days to see if HDBSCAN gets updated but if it does not, I'll go ahead and merge this.

koaning added 2 commits October 21, 2022 09:17

docs

6ad9b68

added-mkdocs

b6e9cb6

koaning added 3 commits October 24, 2022 13:23

hard pin all doc dependencies

9def509

document

aae4010

small rewrite of docs

ae46270

koaning added 2 commits October 28, 2022 14:24

added-instance-check

246cc3b

update-docs-too

c2187c6

MaartenGr merged commit e5a2912 into MaartenGr:master Nov 1, 2022

MaartenGr mentioned this pull request Dec 27, 2022

v0.13 #840

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sklearn Pipeline Embedder #791

Sklearn Pipeline Embedder #791

koaning commented Oct 21, 2022

koaning commented Oct 21, 2022 •

edited

Loading

MaartenGr commented Oct 24, 2022

koaning commented Oct 24, 2022

koaning commented Oct 24, 2022

MaartenGr commented Oct 25, 2022

koaning commented Oct 25, 2022

koaning commented Oct 28, 2022

koaning commented Oct 28, 2022

MaartenGr commented Oct 30, 2022

Sklearn Pipeline Embedder #791

Sklearn Pipeline Embedder #791

Conversation

koaning commented Oct 21, 2022

koaning commented Oct 21, 2022 • edited Loading

MaartenGr commented Oct 24, 2022

koaning commented Oct 24, 2022

koaning commented Oct 24, 2022

MaartenGr commented Oct 25, 2022

koaning commented Oct 25, 2022

koaning commented Oct 28, 2022

koaning commented Oct 28, 2022

MaartenGr commented Oct 30, 2022

koaning commented Oct 21, 2022 •

edited

Loading