-
Notifications
You must be signed in to change notification settings - Fork 776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation from cuML in Berttopic #495
Comments
There currently is a GPU-accelerated implementation by rapidsai that you can find here that you can try out. I have yet to try it out but from what I have heard there is quite a big speed-up! |
cc @VibhuJawa |
@p-dre A few days ago, I released BERTopic v0.10.0 which allows you to use different models for HDBSCAN and UMAP. This also allows you to use the GPU-accelerated version of HDBSCAN and UMAP developed by cuML. After installing cuML, you can run it with BERTopic as follows: from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True)
# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs) It should speed up BERTopic quite a bit! Also, since you now can replace HDBSCAN and UMAP, you could also replace them with algorithms, like PCA and kMeans, which might be a bit faster. It could hurt the quality of the resulting topics though, so some experimentation might be necessary. |
@MaartenGr , Thanks a lot. Its great to learn that now it is possible to use different models for From a benchmark perspective on a workflow we saw following speedups on a end to end BERTopic Workflow. (Checkout the full blog here) UMAP: |
@MaartenGr amazing, Thank you very much!!! |
@MaartenGr As I'm also wondering whether the |
@kuchenrolle After using the |
As a note, membership_vector and all_points_membership_vectors are on our radar for cuML's HDBSCAN. Perhaps this might be an opportunity to define something like For example: def is_dataframe_like(df) -> bool:
"""Looks like a Pandas DataFrame"""
if (df.__class__.__module__, df.__class__.__name__) == (
"pandas.core.frame",
"DataFrame",
):
# fast exec for most likely input
return True
typ = df.__class__
return (
all(hasattr(typ, name) for name in ("groupby", "head", "merge", "mean"))
and all(hasattr(df, name) for name in ("dtypes", "columns"))
and not any(hasattr(typ, name) for name in ("name", "dtype"))
) The AutoML library TPOT did something similar when they added support for cuML and defined def _is_selector(estimator):
selector_attributes = [
"get_support",
"transform",
"inverse_transform",
"fit_transform",
]
return all(hasattr(estimator, attr) for attr in selector_attributes) I'd be happy to participate in a discussion on this topic if there is interest. |
I would also be very interested in the "all_points_membership_vectors" functionality via cuML HDBSCAN. In some use cases this offers a good way to reduce the -1 clusters considerably without significant quality loss. However, with the use of the hdbscan.HDBSCAN implementation and large datasets (several millions of records) it suffers greatly in terms of efficiency. |
@beckernick Interesting! Haven't seen such a pattern before but it definitely seems like it would fit nicely with the use cases described here. Assuming the goal is to have a 1:1 mapping of functionality between the original HDBSCAN and cuML HDBSCAN, a few functions are missing like |
We're a big fan of these duck typing based utilities. I think whether it makes sense to wait depends on the nature of the integration you'd be interested in supporting. We do plan to expand our HDBSCAN support. At the moment (if folks didn't want to wait), I suspect we could resolve the "missing probabilities" issue noted above with some duck typing or light special casing around here (and the equivalent in the BERTopic/bertopic/_bertopic.py Lines 1431 to 1437 in 407fd4f
Having thought a bit more about the duck typing approach, because functions like A basic dispatch procedure based on explicitly supported types/backends could be appealing, as it's conceptually quite similar to the Embedder backends you've built already but oriented for hdbscan dispatch rather than embedders. We do something similar in cuML to enable a variety of input and output data types that we've opted to support. If BERTopic doesn't expect an explosion of many HDBSCAN backends beyond hdbscan and cuML (like the NumPy/SciPy community does and has for different kinds of arrays), the explicit backend approach you've done for Embedders and the equivalent dispatch approach we took in cuML could work well and be quite lightweight here. Perhaps some kind of dispatching mechanism for module-level functions vaguely like the following might be of interest (but for import numpy as np
SUPPORTED_FUNCTIONS = {
"arange",
"empty",
}
def _has_cupy(): # has_cuml
try:
import cupy
return True
except ImportError:
return False
def delegator(obj, func):
if func not in SUPPORTED_FUNCTIONS:
raise AttributeError("Unsupported function")
if isinstance(obj, np.ndarray):
return getattr(np, func)
elif _has_cupy():
import cupy
if isinstance(obj, cupy.ndarray):
return getattr(cupy, func)
else:
raise TypeError("Unsupported backend")
delegator(np.array([0,1]), "arange"), delegator(cp.array([0,1]), "empty") # assume cupy is available at runtime for some users
(<function numpy.arange>,
<function cupy._creation.basic.empty(shape, dtype=<class 'float'>, order='C')>) This would potentially enable something like: BERTopic/bertopic/_bertopic.py Lines 388 to 389 in 407fd4f
To become: if is_supported_hdbscan(self.hdbscan_model):
predictions, probabilities = approximate_predict_dispatch(self.hdbscan_model, umap_embeddings) And handle both backends. |
I'm attempting to install RAPIDS on Colab using the RAPIDS notebook (rapids-colab-template). It installs and then I install BERTopic (pip install bertopic). However when I "from bertopic import BERTopic" I get:
When BERTopic imports UMAP. pynndescent shows as being installed (ver 0.5.7). Has anyone successfully used RAPIDS with BERTopic on Colab? If so how are you doing the install? |
cuML and RAPIDS generally follow the NumPy Deprecation Policy and as a result dropped support for Python 3.7 after December 2021. Colab doesn't support Python 3.8+. This means that RAPIDS libraries on Colab are tied to the 21.12 release. It's possible something in the environment (perhaps cuML but potentially another package) is inconsistent with the pynndescent that pip is trying to install. You can try SageMaker Studio Lab as a Colab replacement, but note that it can take a few tries to get a GPU due to demand. I was able to get a GPU after a few attempts within 3-5 minutes. If you'd like to try RAPIDS on SageMaker Studio Lab, I recommend using the RAPIDS start page and clicking "Open in Studio Lab", as it provides a getting started notebook. I was able to use cuML + BERTopic after creating the following environment at the terminal in Studio Lab:
|
Super! Thanks so much for taking the time. |
I could make RAPIDS work on Colab simply by installing BERTopic before running the rapidsai-csp-utils scripts. Alternatively, you could patch |
Hi, Exception when importing BERTopic: |
It should work if you run |
Hi @MaartenGr, Is it possible to run
Thank you. |
@PeggyFan In BERTopic v0.12 the |
The speedup from using cuML for umap and hdbscan is fantastic! However, I was having an issue predicting new instances. An error was thrown when using the .transform function after instantiating with the cuML hdbscan. This is because the cuML hdbscan does not have a 'predict' function nor is it an instance of hdbscan.HDBSCAN (as pointed out by @beckernick). Code that causes the issue in .transform: BERTopic/bertopic/_bertopic.py Lines 427 to 437 in 09c1732
It seems that an approximate_predict function was recently added to cuml.cluster. rapidsai/cuml@cb2d681. So, I was able to hack around this by creating a custom HDBSCAN class as follows:
This gives a predict function and seems to circumvent the issue (as long as you don't need probabilities of the predictions). ... Hopefully this helps anyone experiencing the same problem. |
It looks like cuML's latest release implemented both |
@ldsands Thank you for mentioning this. I am indeed already working on exploring this implementation within BERTopic. There are a few other features that I am currently working but I'll let you know as soon as a first draft is online! |
A few days ago, the v0.13 version of BERTopic was released. It has implemented support cuML's new features and should work nicely. I'll keep this page open for all other updates regarding cuML. |
@MaartenGr Thank you very much. Do you plan to update the conda version as well? We had problems to install bertopic over pip on a HPC-Cluster, but it worked well with conda. |
@p-dre My apologies, I keep forgetting to update the conda version! I just merged the updated feedstock it so it should be released soon. If it does not work out, please let me know! |
Since cuML is now fully supported in BERTopic, I'll close this issue. |
In my experience, umap and HDBSCAN are the very computationally intensive parts of Berttopic. However, in the original form, the packages are only partially parallel and not usable on gpu.
However, NVIDIA RAPIDS cuML library (https://github.com/rapidsai/cuml) includes a solution for both models that is usable on gpu. This would significantly increase the speed of the calculation.
https://developer.nvidia.com/blog/gpu-accelerated-hierarchical-dbscan-with-rapids-cuml-lets-get-back-to-the-future/
Is an implementation conceivable?
The text was updated successfully, but these errors were encountered: