Implementation from cuML in Berttopic #495

p-dre · 2022-04-04T14:55:10Z

In my experience, umap and HDBSCAN are the very computationally intensive parts of Berttopic. However, in the original form, the packages are only partially parallel and not usable on gpu.

However, NVIDIA RAPIDS cuML library (https://github.com/rapidsai/cuml) includes a solution for both models that is usable on gpu. This would significantly increase the speed of the calculation.
https://developer.nvidia.com/blog/gpu-accelerated-hierarchical-dbscan-with-rapids-cuml-lets-get-back-to-the-future/
Is an implementation conceivable?

MaartenGr · 2022-04-05T13:07:22Z

There currently is a GPU-accelerated implementation by rapidsai that you can find here that you can try out. I have yet to try it out but from what I have heard there is quite a big speed-up!

beckernick · 2022-04-12T13:41:14Z

cc @VibhuJawa

MaartenGr · 2022-05-02T07:36:01Z

@p-dre A few days ago, I released BERTopic v0.10.0 which allows you to use different models for HDBSCAN and UMAP. This also allows you to use the GPU-accelerated version of HDBSCAN and UMAP developed by cuML. After installing cuML, you can run it with BERTopic as follows:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)

It should speed up BERTopic quite a bit! Also, since you now can replace HDBSCAN and UMAP, you could also replace them with algorithms, like PCA and kMeans, which might be a bit faster. It could hurt the quality of the resulting topics though, so some experimentation might be necessary.

VibhuJawa · 2022-05-02T17:25:06Z

@MaartenGr , Thanks a lot. Its great to learn that now it is possible to use different models for HDBSCAN and UMAP.

From a benchmark perspective on a workflow we saw following speedups on a end to end BERTopic Workflow. (Checkout the full blog here)

UMAP: 2718 s (CPU) to 98 s (GPU)
HDBSCAN: 382.00 (CPU) to 92 s (GPU).

p-dre · 2022-05-02T19:13:16Z

@MaartenGr amazing, Thank you very much!!!

kuchenrolle · 2022-05-19T19:21:34Z

@MaartenGr As cuml.cluster.HDBSCAN is not an instance of hdbscan.HDBSCAN, the isinstance checks in lines 388, 1431 and 1548 return False, resulting in the probabilities (hdbscan_model.probabilities_) being ignored, although the cuml implementation does provide them.

I'm also wondering whether the hdbscan.HDBSCAN could be initialized with the result from cuml.cluster.HDBSCAN, so that the HDBSCAN.membership_vector method could be used, when BERTopic is called with calculate_probabilities=True?

MaartenGr · 2022-05-20T05:59:02Z

@kuchenrolle After using the cuml.cluster.HDBSCAN model, you can access the probabilities with topic_model.hdbscan_model.probabilities_. I am not entirely sure though whether we can use the membership_vector in cuml through the original method.

beckernick · 2022-05-20T13:32:30Z

As a note, membership_vector and all_points_membership_vectors are on our radar for cuML's HDBSCAN.

Perhaps this might be an opportunity to define something like is_hdbscan_like in the spirit of scikit-learn's is_classifier and is_regressor? We use this pattern in Dask quite a bit for duck-typing based checks to support multiple backends via dispatching. (Perhaps explicit dispatching might be of interest here, too).

For example:

def is_dataframe_like(df) -> bool:
    """Looks like a Pandas DataFrame"""
    if (df.__class__.__module__, df.__class__.__name__) == (
        "pandas.core.frame",
        "DataFrame",
    ):
        # fast exec for most likely input
        return True
    typ = df.__class__
    return (
        all(hasattr(typ, name) for name in ("groupby", "head", "merge", "mean"))
        and all(hasattr(df, name) for name in ("dtypes", "columns"))
        and not any(hasattr(typ, name) for name in ("name", "dtype"))
    )

The AutoML library TPOT did something similar when they added support for cuML and defined _is_selector and _is_transformer. They used this pattern again when they later added _is_resampler to include support for the scikit-learn-contrib project imbalanced-learn.

def _is_selector(estimator):
    selector_attributes = [
        "get_support",
        "transform",
        "inverse_transform",
        "fit_transform",
    ]
    return all(hasattr(estimator, attr) for attr in selector_attributes)

I'd be happy to participate in a discussion on this topic if there is interest.

nilsblessing · 2022-05-23T06:01:23Z

I would also be very interested in the "all_points_membership_vectors" functionality via cuML HDBSCAN. In some use cases this offers a good way to reduce the -1 clusters considerably without significant quality loss. However, with the use of the hdbscan.HDBSCAN implementation and large datasets (several millions of records) it suffers greatly in terms of efficiency.

MaartenGr · 2022-05-23T08:38:38Z

@beckernick Interesting! Haven't seen such a pattern before but it definitely seems like it would fit nicely with the use cases described here.

Assuming the goal is to have a 1:1 mapping of functionality between the original HDBSCAN and cuML HDBSCAN, a few functions are missing like .membership_vector and I believe .approximate_predict that are necessary to reach the same functionality. Would it make sense to first wait until those are developed before creating a is_hdbscan_like function?

beckernick · 2022-05-25T15:09:09Z

We're a big fan of these duck typing based utilities. I think whether it makes sense to wait depends on the nature of the integration you'd be interested in supporting. We do plan to expand our HDBSCAN support.

At the moment (if folks didn't want to wait), I suspect we could resolve the "missing probabilities" issue noted above with some duck typing or light special casing around here (and the equivalent in the transform codepath):

BERTopic/bertopic/_bertopic.py

Lines 1431 to 1437 in 407fd4f

    
           if isinstance(self.hdbscan_model, hdbscan.HDBSCAN): 
        
               probabilities = self.hdbscan_model.probabilities_ 
        
               self._save_representative_docs(documents) 
        
               if self.calculate_probabilities: 
        
                   probabilities = hdbscan.all_points_membership_vectors(self.hdbscan_model) 
        
           else: 
        
               probabilities = None

Having thought a bit more about the duck typing approach, because functions like all_points_membership_vectors, approximate_predict, and membership_vector are in the top-level module namespace, it's more challenging to rely on pure duck typing alone instead of including some kind of explicit dispatch/delegation process. Protocol-based dispatch mechanisms are elegant (such as NEP-18 and NEP-35 in NumPy), but I don't think there's clarity on such a protocol in this scenario.

A basic dispatch procedure based on explicitly supported types/backends could be appealing, as it's conceptually quite similar to the Embedder backends you've built already but oriented for hdbscan dispatch rather than embedders. We do something similar in cuML to enable a variety of input and output data types that we've opted to support.

If BERTopic doesn't expect an explosion of many HDBSCAN backends beyond hdbscan and cuML (like the NumPy/SciPy community does and has for different kinds of arrays), the explicit backend approach you've done for Embedders and the equivalent dispatch approach we took in cuML could work well and be quite lightweight here. Perhaps some kind of dispatching mechanism for module-level functions vaguely like the following might be of interest (but for approximate_predict, all_points_membership_vectors, and membership_vector in hdbscan/cuml) ?

import numpy as np

SUPPORTED_FUNCTIONS = {
    "arange",
    "empty",
}

def _has_cupy(): # has_cuml
    try:
        import cupy
        return True
    except ImportError:
        return False

def delegator(obj, func):
    if func not in SUPPORTED_FUNCTIONS:
        raise AttributeError("Unsupported function")
    
    if isinstance(obj, np.ndarray):
        return getattr(np, func)
    elif _has_cupy():
        import cupy
        if isinstance(obj, cupy.ndarray):
            return getattr(cupy, func)
    else:
        raise TypeError("Unsupported backend")
        
delegator(np.array([0,1]), "arange"), delegator(cp.array([0,1]), "empty") # assume cupy is available at runtime for some users
(<function numpy.arange>,
 <function cupy._creation.basic.empty(shape, dtype=<class 'float'>, order='C')>)

This would potentially enable something like:

BERTopic/bertopic/_bertopic.py

Lines 388 to 389 in 407fd4f

    
           if isinstance(self.hdbscan_model, hdbscan.HDBSCAN): 
        
               predictions, probabilities = hdbscan.approximate_predict(self.hdbscan_model, umap_embeddings)

To become:

if is_supported_hdbscan(self.hdbscan_model):
    predictions, probabilities = approximate_predict_dispatch(self.hdbscan_model, umap_embeddings)

And handle both backends.

drob-xx · 2022-06-05T20:37:41Z

I'm attempting to install RAPIDS on Colab using the RAPIDS notebook (rapids-colab-template). It installs and then I install BERTopic (pip install bertopic). However when I "from bertopic import BERTopic" I get:

DistributionNotFound: The 'pynndescent' distribution was not found and is required by the application

When BERTopic imports UMAP. pynndescent shows as being installed (ver 0.5.7). Has anyone successfully used RAPIDS with BERTopic on Colab? If so how are you doing the install?

beckernick · 2022-06-08T15:17:38Z

cuML and RAPIDS generally follow the NumPy Deprecation Policy and as a result dropped support for Python 3.7 after December 2021.

Colab doesn't support Python 3.8+. This means that RAPIDS libraries on Colab are tied to the 21.12 release. It's possible something in the environment (perhaps cuML but potentially another package) is inconsistent with the pynndescent that pip is trying to install. You can try SageMaker Studio Lab as a Colab replacement, but note that it can take a few tries to get a GPU due to demand. I was able to get a GPU after a few attempts within 3-5 minutes.

If you'd like to try RAPIDS on SageMaker Studio Lab, I recommend using the RAPIDS start page and clicking "Open in Studio Lab", as it provides a getting started notebook.

I was able to use cuML + BERTopic after creating the following environment at the terminal in Studio Lab:

mamba create -n rapids-22.04 -c rapidsai -c nvidia -c conda-forge rapids=22.04 python=3.9 cudatoolkit=11.4
conda activate rapids-22.04
pip install bertopic

(rapids-22.04) studio-lab-user@default:~$ ipython
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from bertopic import BERTopic
   ...: from cuml.cluster import HDBSCAN
   ...: from cuml.manifold import UMAP
   ...: from sklearn.datasets import fetch_20newsgroups
   ...: 
   ...: docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
   ...: 
   ...: # Create instances of GPU-accelerated UMAP and HDBSCAN
   ...: umap_model = UMAP(n_components=5, min_dist=0.0)
   ...: hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True)
   ...: 
   ...: # Pass the above models to be used in BERTopic
   ...: topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
   ...: topics, probs = topic_model.fit_transform(docs)
   ...: 
Downloading: 100%|████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 1.31MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 211kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 10.2k/10.2k [00:00<00:00, 9.32MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 667kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 114kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 3.49MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 396kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:01<00:00, 85.7MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 59.4kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 164kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 8.28MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 343kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 13.2k/13.2k [00:00<00:00, 13.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 5.07MB/s]
Label prop iterations: 23
Label prop iterations: 6
Label prop iterations: 5
Label prop iterations: 4
Label prop iterations: 2
Iterations: 5
1592,148,632,24,235,1116
Label prop iterations: 2
Iterations: 1
329,45,115,9,44,83

drob-xx · 2022-06-09T15:34:45Z

Super! Thanks so much for taking the time.

thefonseca · 2022-07-08T00:04:47Z

I'm attempting to install RAPIDS on Colab using the RAPIDS notebook (rapids-colab-template). It installs and then I install BERTopic (pip install bertopic). However when I "from bertopic import BERTopic" I get:

DistributionNotFound: The 'pynndescent' distribution was not found and is required by the application

When BERTopic imports UMAP. pynndescent shows as being installed (ver 0.5.7). Has anyone successfully used RAPIDS with BERTopic on Colab? If so how are you doing the install?

I could make RAPIDS work on Colab simply by installing BERTopic before running the rapidsai-csp-utils scripts.

Alternatively, you could patch _bertopic.py and plotting/_topics.py by changing the imports from umap import UMAP to cuml.manifold import UMAP. Not elegant but it works :)

esettouf · 2022-08-23T13:37:40Z

Hi,
I have a follow up question. I downloaded and ran the rapidsai-csp-utils scripts after installing BERTopic. But I have issues with importing BERTopic because of a version mismatch of cffi. BERTopic requires version 1.15.0 but rapidsai requires version 1.15.1. I tried (un-)installing the 1.15.0 version but I still got an error. Did you encounter similar issues or know how I could fix this?

Exception when importing BERTopic:
Exception: Version mismatch: this is the 'cffi' package version 1.15.1, located in '/usr/local/lib/python3.7/dist-packages/cffi/api.py'. When we import the top-level '_cffi_backend' extension module, we get version 1.15.0, located in '/usr/local/lib/python3.7/dist-packages/_cffi_backend.cpython-37m-x86_64-linux-gnu.so'. The two versions should be equal; check your installation.

thefonseca · 2022-08-23T13:51:28Z

It should work if you run pip uninstall -y cffi followed by pip install cffi. But don't forget to restart the runtime before importing BERTopic.

PeggyFan · 2022-09-30T21:40:18Z

Hi @MaartenGr,

Is it possible to run merge_topics on the cuML implementation?
For one thing, the probs is missing from the model using cuML HDBSCAN
and I got the following error:

AttributeError                            Traceback (most recent call last)
Input In [32], in <cell line: 1>()
----> 1 topics= topic_model._map_predictions(topic_model.hdbscan_model.labels)
      2 probs = hdbscan.all_points_membership_vectors(topic_model.hdbscan_model)
      3 probs = topic_model._map_probabilities(probs, original_topics=True)

File base.pyx:269, in cuml.common.base.Base.__getattr__()

AttributeError: labels

Thank you.

MaartenGr · 2022-10-02T07:38:37Z

@PeggyFan In BERTopic v0.12 the merge_topics function should be working with other models besides the default CPU-based HDBSACN model. The code that you shared seems to be custom code so I cannot say much about what is happening there.

emarsc · 2022-10-27T19:55:22Z

The speedup from using cuML for umap and hdbscan is fantastic! However, I was having an issue predicting new instances. An error was thrown when using the .transform function after instantiating with the cuML hdbscan.

This is because the cuML hdbscan does not have a 'predict' function nor is it an instance of hdbscan.HDBSCAN (as pointed out by @beckernick).

Code that causes the issue in .transform:

BERTopic/bertopic/_bertopic.py

Lines 427 to 437 in 09c1732

    
           if isinstance(self.hdbscan_model, hdbscan.HDBSCAN): 
        
               predictions, probabilities = hdbscan.approximate_predict(self.hdbscan_model, umap_embeddings) 
        
               # Calculate probabilities 
        
               if self.calculate_probabilities: 
        
                   probabilities = hdbscan.membership_vector(self.hdbscan_model, umap_embeddings) 
        
                   logger.info("Calculated probabilities with HDBSCAN") 
        
           else: 
        
               predictions = self.hdbscan_model.predict(umap_embeddings) 
        
               probabilities = None

It seems that an approximate_predict function was recently added to cuml.cluster. rapidsai/cuml@cb2d681. So, I was able to hack around this by creating a custom HDBSCAN class as follows:

from cuml.cluster import HDBSCAN, approximate_predict
        
class GPUHDBSCAN(HDBSCAN):
    def predict(self, umap_embeddings):
        predictions, probabilities = approximate_predict(self, umap_embeddings)
        return predictions

This gives a predict function and seems to circumvent the issue (as long as you don't need probabilities of the predictions).

... Hopefully this helps anyone experiencing the same problem.

ldsands · 2022-11-04T22:43:06Z

It looks like cuML's latest release implemented both approximate_predict and all_points_membership_vectors. I'm not sure if it is possible yet but it would be great to see seamless cuML integration into BERTopic!

MaartenGr · 2022-11-06T09:50:03Z

@ldsands Thank you for mentioning this. I am indeed already working on exploring this implementation within BERTopic. There are a few other features that I am currently working but I'll let you know as soon as a first draft is online!

MaartenGr · 2023-01-09T12:04:46Z

A few days ago, the v0.13 version of BERTopic was released. It has implemented support cuML's new features and should work nicely. I'll keep this page open for all other updates regarding cuML.

p-dre · 2023-01-09T12:16:02Z

@MaartenGr Thank you very much. Do you plan to update the conda version as well? We had problems to install bertopic over pip on a HPC-Cluster, but it worked well with conda.

MaartenGr · 2023-01-11T08:53:25Z

@p-dre My apologies, I keep forgetting to update the conda version! I just merged the updated feedstock it so it should be released soon. If it does not work out, please let me know!

MaartenGr · 2023-05-23T08:39:13Z

Since cuML is now fully supported in BERTopic, I'll close this issue.

nilsblessing mentioned this issue May 18, 2022

BERTopic and CuML on Azure Databricks #542

Closed

beckernick mentioned this issue Sep 19, 2022

GPU error #728

Closed

MaartenGr closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation from cuML in Berttopic #495

Implementation from cuML in Berttopic #495

p-dre commented Apr 4, 2022

MaartenGr commented Apr 5, 2022

beckernick commented Apr 12, 2022

MaartenGr commented May 2, 2022

VibhuJawa commented May 2, 2022 •

edited

Loading

p-dre commented May 2, 2022

kuchenrolle commented May 19, 2022

MaartenGr commented May 20, 2022

beckernick commented May 20, 2022 •

edited

Loading

nilsblessing commented May 23, 2022 •

edited

Loading

MaartenGr commented May 23, 2022

beckernick commented May 25, 2022 •

edited

Loading

drob-xx commented Jun 5, 2022

beckernick commented Jun 8, 2022 •

edited

Loading

drob-xx commented Jun 9, 2022

thefonseca commented Jul 8, 2022 •

edited

Loading

esettouf commented Aug 23, 2022

thefonseca commented Aug 23, 2022

PeggyFan commented Sep 30, 2022

MaartenGr commented Oct 2, 2022

emarsc commented Oct 27, 2022

ldsands commented Nov 4, 2022

MaartenGr commented Nov 6, 2022

MaartenGr commented Jan 9, 2023

p-dre commented Jan 9, 2023

MaartenGr commented Jan 11, 2023

MaartenGr commented May 23, 2023

Implementation from cuML in Berttopic #495

Implementation from cuML in Berttopic #495

Comments

p-dre commented Apr 4, 2022

MaartenGr commented Apr 5, 2022

beckernick commented Apr 12, 2022

MaartenGr commented May 2, 2022

VibhuJawa commented May 2, 2022 • edited Loading

p-dre commented May 2, 2022

kuchenrolle commented May 19, 2022

MaartenGr commented May 20, 2022

beckernick commented May 20, 2022 • edited Loading

nilsblessing commented May 23, 2022 • edited Loading

MaartenGr commented May 23, 2022

beckernick commented May 25, 2022 • edited Loading

drob-xx commented Jun 5, 2022

beckernick commented Jun 8, 2022 • edited Loading

drob-xx commented Jun 9, 2022

thefonseca commented Jul 8, 2022 • edited Loading

esettouf commented Aug 23, 2022

thefonseca commented Aug 23, 2022

PeggyFan commented Sep 30, 2022

MaartenGr commented Oct 2, 2022

emarsc commented Oct 27, 2022

ldsands commented Nov 4, 2022

MaartenGr commented Nov 6, 2022

MaartenGr commented Jan 9, 2023

p-dre commented Jan 9, 2023

MaartenGr commented Jan 11, 2023

MaartenGr commented May 23, 2023

VibhuJawa commented May 2, 2022 •

edited

Loading

beckernick commented May 20, 2022 •

edited

Loading

nilsblessing commented May 23, 2022 •

edited

Loading

beckernick commented May 25, 2022 •

edited

Loading

beckernick commented Jun 8, 2022 •

edited

Loading

thefonseca commented Jul 8, 2022 •

edited

Loading