-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sklearn Pipeline Embedder #791
Conversation
This is a bit of a WIP because there are a few questions on my end.
|
Thanks for the PR and apologies for the late reply!
This is something I am still trying to figure out seeing as most language backends are quite time-consuming to test from a computational perspective. Especially when using the default pipeline using HDBSCAN, there need to be quite some documents (a couple of thousand) for HDBSCAN to find a sufficient number of topics (> 2) to properly test the topic model. As a result, creating a minimal example of a BERTopic model out of the box can take some time. This, now that I think about it, ties nicely in with the possibility of a more light-weight pipeline and thereby this PR. In other words, as much as I would love tests for this. I am afraid it might take you quite some time to figure out the best approach for this. If you have time, please do, but do note that it is by no means expected here.
Nice catch, that would be great!
Nope, I have no unit-tests on the documentation/docstrings themselves. In the past, there definitely have been mistakes in there that would have been corrected with BERTopic/bertopic/_bertopic.py Lines 828 to 833 in 09c1732
No problem! It fits nicely with the online topic modeling approach.
Seeing as you can define the verbosity in the sklearn pipeline, it might be worthwhile to document that explicitly. |
Another alternative is to have a weekly job that runs the heavy documentation tests. The thinking here is that you'd like to fix them when they occur, but checking that once a week might be fine. I'll leave it up to you to consider if that'll work, but it's a pattern that works grand for other things as well. At scikit-lego, we run a weekly job that checks what happens when we install the latest version of sklearn. It typically catches a breaking bug before our users do. |
Cool. I think I just addressed your comments. Let me know if I forgot something! |
Thanks! It indeed might make more sense to leave the heavy testing for a cronjob instead of having to go through that each time something small is tested. I'll take a look at the implementation at scikit-lego.
Yep, you definitely did, thanks! There is one last thing that I forgot to mention. In order to minimize the code necessary to use different language models, you can supply, for example, a SentenceTransformer model directly into Could you add the SklearnEmbedder also there? As a result, you would be able to do the following instead: from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
pipe = make_pipeline(
TfidfVectorizer(),
TruncatedSVD(100)
)
# Use `pipe` directly instead of needing to import it from BERTopic
topic_model = BERTopic(embedding_model=pipe) |
Ah, nice feature. I'll give that another look and will also check if the docs still make sense. |
I figured I'd try working on this but hit this error:
Just to check, this is on your radar? |
@MaartenGr I made the changes, no tests still, but I think the changes are in. |
Yes, thanks for mentioning though. It is an issue with HDBSCAN that has not taken into account the upcoming changes of joblib in its 1.2.0 release. Although there is a fix for that in HDBSCAN's main branch, it is not yet released into pypi. A quick fix for this in BERTopic would mean fixing joblib to 1.1.0 but might not be preferred due to: https://nvd.nist.gov/vuln/detail/CVE-2022-21797. I don't think it should be an issue in this context but I want to be sure and wait for a HDBSCAN release. For now, removing HDBSCAN and re-installing from its main branch should solve your issue. Would be nice if I had mentioned this to you though before you started working on it, sorry! 😅 All in all, a bit of a tricky situation seeing as pypi does not allow requirements directly from main branches/commits but only from releases.
Thanks for the work, LGTM! I'll wait a few days to see if HDBSCAN gets updated but if it does not, I'll go ahead and merge this. |
This PR fixes #768.