Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can i use "precomputed" HDBSCAN in BerTopic? #2209

Open
shadabmeymandi opened this issue Nov 11, 2024 Discussed in #2202 · 9 comments
Open

How can i use "precomputed" HDBSCAN in BerTopic? #2209

shadabmeymandi opened this issue Nov 11, 2024 Discussed in #2202 · 9 comments

Comments

@shadabmeymandi
Copy link

shadabmeymandi commented Nov 11, 2024

BerTopic Version = 0.16.4

umap_model = umap.UMAP(n_neighbors=15, n_components=24, min_dist=0.0, metric='cosine', random_state=100)
embedding_model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
hdbscan_model = HDBSCAN(metric='euclidean',min_cluster_size=10, cluster_selection_method='eom', prediction_data=True)
model = BERTopic( hdbscan_model=hdbscan_model,embedding_model=embedding_model,umap_model=umap_model, language="english", calculate_probabilities=True)
topics, probabilities =model.fit_transform(sentecnes)

as you know when i run the above code there is no problem and every thing is ok , but i must use the hdbscan as below:

hdbscan_model = HDBSCAN(metric='precomputed',min_cluster_size=10, cluster_selection_method='eom', prediction_data=True)
cluster_labels = hdbscan_model.fit_predict(distances)

because i have a pairwaise matrix "distances[ ]" as the distances between my embeddings.
with this metric='precomputed' i cant't run my bertopic model and the error is :

ValueError: operands could not be broadcast together with shapes (24,) (24,489)

@MaartenGr
Copy link
Owner

I don't think it's possible to to use "precomputed" at the moment, but I'm not entirely sure. You would have to ignore the UMAP steps with an empty model and perhaps use the distance matrix in place of the embeddings. So perhaps something like this:

from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction

# Fit BERTopic without actually performing any dimensionality reduction
empty_dimensionality_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_dimensionality_model)

# Fit with distance matrix <- Not sure if this works though
topic_model.fit_transform(documents, distance_matrix)

Note: When you create an issue make sure to fill in the bug report as information is still missing, such as the version of BERTopic. Reproduction also shows how you can nicely format your code so that it's easy to read, which is missing here.

@shadabmeymandi
Copy link
Author

thank you for your answer but it dosen't work,

BerTopic Version = 0.16.4
when I use :

from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction

# Fit BERTopic without actually performing any dimensionality reduction
empty_dimensionality_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_dimensionality_model,hdbscan_model=hdbscan_model,embedding_model=embedding_model,language="english", calculate_probabilities=True)

topic_model.fit_transform(documents, distance_matrix) 

the error is :
AttributeError: No prediction data was generated

and when i use :
topic_model.fit_transform(documents)
instead of
topic_model.fit_transform(documents, distance_matrix)

the error is:
ValueError: operands could not be broadcast together with shapes (384,) (384,489)

also when I use :
topic_model.fit_transform(distance_matrix)

the error is:
TypeError: Make sure that the iterable only contains strings.

@MaartenGr
Copy link
Owner

the error is :
AttributeError: No prediction data was generated

Could you share the full error log? Without it, I have no clue which lines of code this refers to.

@shadabmeymandi
Copy link
Author

Sure, Here you are:
distance_matrix is a (489, 489) matrix,
and "sentecnes" is an array of 489 sentences.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-109-c6760120d626>](https://localhost:8080/#) in <cell line: 2>()
      1 # Fit with distance matrix <- Not sure if this works though
----> 2 topic_model.fit_transform(sentecnes,distance_matrix)

4 frames
[/usr/local/lib/python3.10/dist-packages/hdbscan/hdbscan_.py](https://localhost:8080/#) in prediction_data_(self)
   1432     def prediction_data_(self):
   1433         if self._prediction_data is None:
-> 1434             raise AttributeError("No prediction data was generated")
   1435         else:
   1436             return self._prediction_data

AttributeError: No prediction data was generated

@MaartenGr
Copy link
Owner

Thank you for sharing. That is not the full error log though. Note those 4 frames, you will need to click on that to see the full log.

@shadabmeymandi
Copy link
Author

The full error log :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-109-c211ff22e61f>](https://localhost:8080/#) in <cell line: 2>()
      1 # # Fit with distance matrix <- Not sure if this works though
----> 2 topic_model.fit_transform(sentecnes,distance_matrix)

4 frames
[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in fit_transform(self, documents, embeddings, images, y)
    461         if len(documents) > 0:
    462             # Cluster reduced embeddings
--> 463             documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
    464             if self._is_zeroshot() and len(assigned_documents) > 0:
    465                 documents, embeddings = self._combine_zeroshot_topics(

[/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py](https://localhost:8080/#) in _cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
   3793 
   3794             if self.calculate_probabilities and is_supported_hdbscan(self.hdbscan_model):
-> 3795                 probabilities = hdbscan_delegator(self.hdbscan_model, "all_points_membership_vectors")
   3796 
   3797         if not partial_fit:

[/usr/local/lib/python3.10/dist-packages/bertopic/cluster/_utils.py](https://localhost:8080/#) in hdbscan_delegator(model, func, embeddings)
     35     if func == "all_points_membership_vectors":
     36         if isinstance(model, hdbscan.HDBSCAN):
---> 37             return hdbscan.all_points_membership_vectors(model)
     38 
     39         str_type_model = str(type(model)).lower()

[/usr/local/lib/python3.10/dist-packages/hdbscan/prediction.py](https://localhost:8080/#) in all_points_membership_vectors(clusterer)
    646     """
    647     clusters = np.array(sorted(list(clusterer.condensed_tree_._select_clusters()))).astype(np.intp)
--> 648     all_points = clusterer.prediction_data_.raw_data
    649 
    650     # When no clusters found, return array of 0's

[/usr/local/lib/python3.10/dist-packages/hdbscan/hdbscan_.py](https://localhost:8080/#) in prediction_data_(self)
   1432     def prediction_data_(self):
   1433         if self._prediction_data is None:
-> 1434             raise AttributeError("No prediction data was generated")
   1435         else:
   1436             return self._prediction_data

AttributeError: No prediction data was generated

@MaartenGr
Copy link
Owner

Hmmm, not sure why it states that no prediction data was generated. I can assume you still had prediction_data=True, right? You could also try it without calculate_probabilities=True and calculate those later.

@shadabmeymandi
Copy link
Author

shadabmeymandi commented Nov 23, 2024

No diffrence when I had prediction_data=True or not, but without calculate_probabilities=True the problem solved and I built my bertopic model. Unfortunately this is a model without umap steps!
Actually I need all steps, so do you have any suggestion how can I calculate topic coherence score under this circumstances:

  1. My model comes from:
hdbscan_model_1= HDBSCAN(metric='euclidean',min_cluster_size=5, cluster_selection_method='eom', prediction_data=True)
model = BERTopic( hdbscan_model=hdbscan_model_1,embedding_model=embedding_model,umap_model=umap_model, language="english", calculate_probabilities=True)
_, _=model.fit_transform(documents)
  1. My topics come from:
hdbscan_model_2 =  HDBSCAN(metric='precomputed',min_cluster_size=5, cluster_selection_method='eom', prediction_data=True)
topics= hdbscan_model_2.fit_predict(distance_matrix)

In other words, how can I compound 1 and 2 for calculating my topic coherence score ?

@MaartenGr
Copy link
Owner

In other words, how can I compound 1 and 2 for calculating my topic coherence score ?

You can't. That's not how this works since you would effectively have two different sets of clusters being generated.

Instead, what you might be able to do is to something like this:

model = HDBSCAN(
  metric='precomputed',
  min_cluster_size=5, 
  cluster_selection_method='eom', 
  prediction_data=True
)

class ClusterModel:
    def fit(self, embeddings):
        self.hdbscan_model = model.fit(embeddings)

        # create distance matrix
        distance_matrix = ...

        # Fit model
        self.labels_ = hdbscan_model.fit_predict(distance_matrix )
        return self

    def predict(self, embeddings):
        # create distance matrix
        distance_matrix = ...
        return self.hdbscan_model.predict(distance_matrix)

and then pass that as the cluster model instead:

hdbscan_model = ClusterModel()
topic_model = BERTopic (hdbscan_model=hdbscan_model, umap_model=umap_model)

I haven't checked it myself, but now you have the basic template to work out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants