Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation from cuML in Berttopic #495

Closed
p-dre opened this issue Apr 4, 2022 · 26 comments
Closed

Implementation from cuML in Berttopic #495

p-dre opened this issue Apr 4, 2022 · 26 comments

Comments

@p-dre
Copy link

p-dre commented Apr 4, 2022

In my experience, umap and HDBSCAN are the very computationally intensive parts of Berttopic. However, in the original form, the packages are only partially parallel and not usable on gpu.

However, NVIDIA RAPIDS cuML library (https://github.com/rapidsai/cuml) includes a solution for both models that is usable on gpu. This would significantly increase the speed of the calculation.
https://developer.nvidia.com/blog/gpu-accelerated-hierarchical-dbscan-with-rapids-cuml-lets-get-back-to-the-future/
Is an implementation conceivable?

@MaartenGr
Copy link
Owner

There currently is a GPU-accelerated implementation by rapidsai that you can find here that you can try out. I have yet to try it out but from what I have heard there is quite a big speed-up!

@beckernick
Copy link
Contributor

cc @VibhuJawa

@MaartenGr
Copy link
Owner

@p-dre A few days ago, I released BERTopic v0.10.0 which allows you to use different models for HDBSCAN and UMAP. This also allows you to use the GPU-accelerated version of HDBSCAN and UMAP developed by cuML. After installing cuML, you can run it with BERTopic as follows:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)

It should speed up BERTopic quite a bit! Also, since you now can replace HDBSCAN and UMAP, you could also replace them with algorithms, like PCA and kMeans, which might be a bit faster. It could hurt the quality of the resulting topics though, so some experimentation might be necessary.

@VibhuJawa
Copy link

VibhuJawa commented May 2, 2022

@MaartenGr , Thanks a lot. Its great to learn that now it is possible to use different models for HDBSCAN and UMAP.

From a benchmark perspective on a workflow we saw following speedups on a end to end BERTopic Workflow. (Checkout the full blog here)

UMAP: 2718 s (CPU) to 98 s (GPU)
HDBSCAN: 382.00 (CPU) to 92 s (GPU).

@p-dre
Copy link
Author

p-dre commented May 2, 2022

@MaartenGr amazing, Thank you very much!!!

@kuchenrolle
Copy link

@MaartenGr As cuml.cluster.HDBSCAN is not an instance of hdbscan.HDBSCAN, the isinstance checks in lines 388, 1431 and 1548 return False, resulting in the probabilities (hdbscan_model.probabilities_) being ignored, although the cuml implementation does provide them.

I'm also wondering whether the hdbscan.HDBSCAN could be initialized with the result from cuml.cluster.HDBSCAN, so that the HDBSCAN.membership_vector method could be used, when BERTopic is called with calculate_probabilities=True?

@MaartenGr
Copy link
Owner

@kuchenrolle After using the cuml.cluster.HDBSCAN model, you can access the probabilities with topic_model.hdbscan_model.probabilities_. I am not entirely sure though whether we can use the membership_vector in cuml through the original method.

@beckernick
Copy link
Contributor

beckernick commented May 20, 2022

As a note, membership_vector and all_points_membership_vectors are on our radar for cuML's HDBSCAN.

Perhaps this might be an opportunity to define something like is_hdbscan_like in the spirit of scikit-learn's is_classifier and is_regressor? We use this pattern in Dask quite a bit for duck-typing based checks to support multiple backends via dispatching. (Perhaps explicit dispatching might be of interest here, too).

For example:

def is_dataframe_like(df) -> bool:
    """Looks like a Pandas DataFrame"""
    if (df.__class__.__module__, df.__class__.__name__) == (
        "pandas.core.frame",
        "DataFrame",
    ):
        # fast exec for most likely input
        return True
    typ = df.__class__
    return (
        all(hasattr(typ, name) for name in ("groupby", "head", "merge", "mean"))
        and all(hasattr(df, name) for name in ("dtypes", "columns"))
        and not any(hasattr(typ, name) for name in ("name", "dtype"))
    )

The AutoML library TPOT did something similar when they added support for cuML and defined _is_selector and _is_transformer. They used this pattern again when they later added _is_resampler to include support for the scikit-learn-contrib project imbalanced-learn.

def _is_selector(estimator):
    selector_attributes = [
        "get_support",
        "transform",
        "inverse_transform",
        "fit_transform",
    ]
    return all(hasattr(estimator, attr) for attr in selector_attributes)

I'd be happy to participate in a discussion on this topic if there is interest.

@nilsblessing
Copy link

nilsblessing commented May 23, 2022

I would also be very interested in the "all_points_membership_vectors" functionality via cuML HDBSCAN. In some use cases this offers a good way to reduce the -1 clusters considerably without significant quality loss. However, with the use of the hdbscan.HDBSCAN implementation and large datasets (several millions of records) it suffers greatly in terms of efficiency.

@MaartenGr
Copy link
Owner

@beckernick Interesting! Haven't seen such a pattern before but it definitely seems like it would fit nicely with the use cases described here.

Assuming the goal is to have a 1:1 mapping of functionality between the original HDBSCAN and cuML HDBSCAN, a few functions are missing like .membership_vector and I believe .approximate_predict that are necessary to reach the same functionality. Would it make sense to first wait until those are developed before creating a is_hdbscan_like function?

@beckernick
Copy link
Contributor

beckernick commented May 25, 2022

We're a big fan of these duck typing based utilities. I think whether it makes sense to wait depends on the nature of the integration you'd be interested in supporting. We do plan to expand our HDBSCAN support.

At the moment (if folks didn't want to wait), I suspect we could resolve the "missing probabilities" issue noted above with some duck typing or light special casing around here (and the equivalent in the transform codepath):

BERTopic/bertopic/_bertopic.py

Lines 1431 to 1437 in 407fd4f

if isinstance(self.hdbscan_model, hdbscan.HDBSCAN):
probabilities = self.hdbscan_model.probabilities_
self._save_representative_docs(documents)
if self.calculate_probabilities:
probabilities = hdbscan.all_points_membership_vectors(self.hdbscan_model)
else:
probabilities = None

Having thought a bit more about the duck typing approach, because functions like all_points_membership_vectors, approximate_predict, and membership_vector are in the top-level module namespace, it's more challenging to rely on pure duck typing alone instead of including some kind of explicit dispatch/delegation process. Protocol-based dispatch mechanisms are elegant (such as NEP-18 and NEP-35 in NumPy), but I don't think there's clarity on such a protocol in this scenario.

A basic dispatch procedure based on explicitly supported types/backends could be appealing, as it's conceptually quite similar to the Embedder backends you've built already but oriented for hdbscan dispatch rather than embedders. We do something similar in cuML to enable a variety of input and output data types that we've opted to support.

If BERTopic doesn't expect an explosion of many HDBSCAN backends beyond hdbscan and cuML (like the NumPy/SciPy community does and has for different kinds of arrays), the explicit backend approach you've done for Embedders and the equivalent dispatch approach we took in cuML could work well and be quite lightweight here. Perhaps some kind of dispatching mechanism for module-level functions vaguely like the following might be of interest (but for approximate_predict, all_points_membership_vectors, and membership_vector in hdbscan/cuml) ?

import numpy as np

SUPPORTED_FUNCTIONS = {
    "arange",
    "empty",
}

def _has_cupy(): # has_cuml
    try:
        import cupy
        return True
    except ImportError:
        return False

def delegator(obj, func):
    if func not in SUPPORTED_FUNCTIONS:
        raise AttributeError("Unsupported function")
    
    if isinstance(obj, np.ndarray):
        return getattr(np, func)
    elif _has_cupy():
        import cupy
        if isinstance(obj, cupy.ndarray):
            return getattr(cupy, func)
    else:
        raise TypeError("Unsupported backend")
        
delegator(np.array([0,1]), "arange"), delegator(cp.array([0,1]), "empty") # assume cupy is available at runtime for some users
(<function numpy.arange>,
 <function cupy._creation.basic.empty(shape, dtype=<class 'float'>, order='C')>)

This would potentially enable something like:

if isinstance(self.hdbscan_model, hdbscan.HDBSCAN):
predictions, probabilities = hdbscan.approximate_predict(self.hdbscan_model, umap_embeddings)

To become:

if is_supported_hdbscan(self.hdbscan_model):
    predictions, probabilities = approximate_predict_dispatch(self.hdbscan_model, umap_embeddings)

And handle both backends.

@drob-xx
Copy link

drob-xx commented Jun 5, 2022

I'm attempting to install RAPIDS on Colab using the RAPIDS notebook (rapids-colab-template). It installs and then I install BERTopic (pip install bertopic). However when I "from bertopic import BERTopic" I get:

DistributionNotFound: The 'pynndescent' distribution was not found and is required by the application

When BERTopic imports UMAP. pynndescent shows as being installed (ver 0.5.7). Has anyone successfully used RAPIDS with BERTopic on Colab? If so how are you doing the install?

@beckernick
Copy link
Contributor

beckernick commented Jun 8, 2022

cuML and RAPIDS generally follow the NumPy Deprecation Policy and as a result dropped support for Python 3.7 after December 2021.

Colab doesn't support Python 3.8+. This means that RAPIDS libraries on Colab are tied to the 21.12 release. It's possible something in the environment (perhaps cuML but potentially another package) is inconsistent with the pynndescent that pip is trying to install. You can try SageMaker Studio Lab as a Colab replacement, but note that it can take a few tries to get a GPU due to demand. I was able to get a GPU after a few attempts within 3-5 minutes.

If you'd like to try RAPIDS on SageMaker Studio Lab, I recommend using the RAPIDS start page and clicking "Open in Studio Lab", as it provides a getting started notebook.

Screen Shot 2022-06-08 at 10 58 18 AM

I was able to use cuML + BERTopic after creating the following environment at the terminal in Studio Lab:

mamba create -n rapids-22.04 -c rapidsai -c nvidia -c conda-forge rapids=22.04 python=3.9 cudatoolkit=11.4
conda activate rapids-22.04
pip install bertopic
(rapids-22.04) studio-lab-user@default:~$ ipython
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from bertopic import BERTopic
   ...: from cuml.cluster import HDBSCAN
   ...: from cuml.manifold import UMAP
   ...: from sklearn.datasets import fetch_20newsgroups
   ...: 
   ...: docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
   ...: 
   ...: # Create instances of GPU-accelerated UMAP and HDBSCAN
   ...: umap_model = UMAP(n_components=5, min_dist=0.0)
   ...: hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True)
   ...: 
   ...: # Pass the above models to be used in BERTopic
   ...: topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
   ...: topics, probs = topic_model.fit_transform(docs)
   ...: 
Downloading: 100%|████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 1.31MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 211kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 10.2k/10.2k [00:00<00:00, 9.32MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 667kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 114kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 3.49MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 396kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:01<00:00, 85.7MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 59.4kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 164kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 8.28MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 343kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 13.2k/13.2k [00:00<00:00, 13.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 5.07MB/s]
Label prop iterations: 23
Label prop iterations: 6
Label prop iterations: 5
Label prop iterations: 4
Label prop iterations: 2
Iterations: 5
1592,148,632,24,235,1116
Label prop iterations: 2
Iterations: 1
329,45,115,9,44,83

@drob-xx
Copy link

drob-xx commented Jun 9, 2022

Super! Thanks so much for taking the time.

@thefonseca
Copy link

thefonseca commented Jul 8, 2022

I'm attempting to install RAPIDS on Colab using the RAPIDS notebook (rapids-colab-template). It installs and then I install BERTopic (pip install bertopic). However when I "from bertopic import BERTopic" I get:

DistributionNotFound: The 'pynndescent' distribution was not found and is required by the application

When BERTopic imports UMAP. pynndescent shows as being installed (ver 0.5.7). Has anyone successfully used RAPIDS with BERTopic on Colab? If so how are you doing the install?

I could make RAPIDS work on Colab simply by installing BERTopic before running the rapidsai-csp-utils scripts.

Alternatively, you could patch _bertopic.py and plotting/_topics.py by changing the imports from umap import UMAP to cuml.manifold import UMAP. Not elegant but it works :)

@esettouf
Copy link

Hi,
I have a follow up question. I downloaded and ran the rapidsai-csp-utils scripts after installing BERTopic. But I have issues with importing BERTopic because of a version mismatch of cffi. BERTopic requires version 1.15.0 but rapidsai requires version 1.15.1. I tried (un-)installing the 1.15.0 version but I still got an error. Did you encounter similar issues or know how I could fix this?

Exception when importing BERTopic:
Exception: Version mismatch: this is the 'cffi' package version 1.15.1, located in '/usr/local/lib/python3.7/dist-packages/cffi/api.py'. When we import the top-level '_cffi_backend' extension module, we get version 1.15.0, located in '/usr/local/lib/python3.7/dist-packages/_cffi_backend.cpython-37m-x86_64-linux-gnu.so'. The two versions should be equal; check your installation.

@thefonseca
Copy link

It should work if you run pip uninstall -y cffi followed by pip install cffi. But don't forget to restart the runtime before importing BERTopic.

@beckernick beckernick mentioned this issue Sep 19, 2022
@PeggyFan
Copy link

Hi @MaartenGr,

Is it possible to run merge_topics on the cuML implementation?
For one thing, the probs is missing from the model using cuML HDBSCAN
and I got the following error:

AttributeError                            Traceback (most recent call last)
Input In [32], in <cell line: 1>()
----> 1 topics= topic_model._map_predictions(topic_model.hdbscan_model.labels)
      2 probs = hdbscan.all_points_membership_vectors(topic_model.hdbscan_model)
      3 probs = topic_model._map_probabilities(probs, original_topics=True)

File base.pyx:269, in cuml.common.base.Base.__getattr__()

AttributeError: labels

Thank you.

@MaartenGr
Copy link
Owner

@PeggyFan In BERTopic v0.12 the merge_topics function should be working with other models besides the default CPU-based HDBSACN model. The code that you shared seems to be custom code so I cannot say much about what is happening there.

@emarsc
Copy link

emarsc commented Oct 27, 2022

The speedup from using cuML for umap and hdbscan is fantastic! However, I was having an issue predicting new instances. An error was thrown when using the .transform function after instantiating with the cuML hdbscan.

This is because the cuML hdbscan does not have a 'predict' function nor is it an instance of hdbscan.HDBSCAN (as pointed out by @beckernick).

Code that causes the issue in .transform:

if isinstance(self.hdbscan_model, hdbscan.HDBSCAN):
predictions, probabilities = hdbscan.approximate_predict(self.hdbscan_model, umap_embeddings)
# Calculate probabilities
if self.calculate_probabilities:
probabilities = hdbscan.membership_vector(self.hdbscan_model, umap_embeddings)
logger.info("Calculated probabilities with HDBSCAN")
else:
predictions = self.hdbscan_model.predict(umap_embeddings)
probabilities = None

It seems that an approximate_predict function was recently added to cuml.cluster. rapidsai/cuml@cb2d681. So, I was able to hack around this by creating a custom HDBSCAN class as follows:

from cuml.cluster import HDBSCAN, approximate_predict
        
class GPUHDBSCAN(HDBSCAN):
    def predict(self, umap_embeddings):
        predictions, probabilities = approximate_predict(self, umap_embeddings)
        return predictions

This gives a predict function and seems to circumvent the issue (as long as you don't need probabilities of the predictions).

... Hopefully this helps anyone experiencing the same problem.

@ldsands
Copy link

ldsands commented Nov 4, 2022

It looks like cuML's latest release implemented both approximate_predict and all_points_membership_vectors. I'm not sure if it is possible yet but it would be great to see seamless cuML integration into BERTopic!

@MaartenGr
Copy link
Owner

@ldsands Thank you for mentioning this. I am indeed already working on exploring this implementation within BERTopic. There are a few other features that I am currently working but I'll let you know as soon as a first draft is online!

@MaartenGr
Copy link
Owner

A few days ago, the v0.13 version of BERTopic was released. It has implemented support cuML's new features and should work nicely. I'll keep this page open for all other updates regarding cuML.

@p-dre
Copy link
Author

p-dre commented Jan 9, 2023

@MaartenGr Thank you very much. Do you plan to update the conda version as well? We had problems to install bertopic over pip on a HPC-Cluster, but it worked well with conda.

@MaartenGr
Copy link
Owner

@p-dre My apologies, I keep forgetting to update the conda version! I just merged the updated feedstock it so it should be released soon. If it does not work out, please let me know!

@MaartenGr
Copy link
Owner

Since cuML is now fully supported in BERTopic, I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests