-
Notifications
You must be signed in to change notification settings - Fork 778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can i use "precomputed" HDBSCAN in BerTopic? #2209
Comments
I don't think it's possible to to use "precomputed" at the moment, but I'm not entirely sure. You would have to ignore the UMAP steps with an empty model and perhaps use the distance matrix in place of the embeddings. So perhaps something like this: from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction
# Fit BERTopic without actually performing any dimensionality reduction
empty_dimensionality_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_dimensionality_model)
# Fit with distance matrix <- Not sure if this works though
topic_model.fit_transform(documents, distance_matrix) Note: When you create an issue make sure to fill in the bug report as information is still missing, such as the version of BERTopic. Reproduction also shows how you can nicely format your code so that it's easy to read, which is missing here. |
thank you for your answer but it dosen't work, BerTopic Version = 0.16.4
the error is : and when i use : the error is: also when I use : the error is: |
Could you share the full error log? Without it, I have no clue which lines of code this refers to. |
Sure, Here you are:
|
Thank you for sharing. That is not the full error log though. Note those |
The full error log :
|
Hmmm, not sure why it states that no prediction data was generated. I can assume you still had |
No diffrence when I had
In other words, how can I compound 1 and 2 for calculating my topic coherence score ? |
You can't. That's not how this works since you would effectively have two different sets of clusters being generated. Instead, what you might be able to do is to something like this: model = HDBSCAN(
metric='precomputed',
min_cluster_size=5,
cluster_selection_method='eom',
prediction_data=True
)
class ClusterModel:
def fit(self, embeddings):
self.hdbscan_model = model.fit(embeddings)
# create distance matrix
distance_matrix = ...
# Fit model
self.labels_ = hdbscan_model.fit_predict(distance_matrix )
return self
def predict(self, embeddings):
# create distance matrix
distance_matrix = ...
return self.hdbscan_model.predict(distance_matrix) and then pass that as the cluster model instead: hdbscan_model = ClusterModel()
topic_model = BERTopic (hdbscan_model=hdbscan_model, umap_model=umap_model) I haven't checked it myself, but now you have the basic template to work out. |
BerTopic Version = 0.16.4
as you know when i run the above code there is no problem and every thing is ok , but i must use the hdbscan as below:
because i have a pairwaise matrix "distances[ ]" as the distances between my embeddings.
with this metric='precomputed' i cant't run my bertopic model and the error is :
ValueError: operands could not be broadcast together with shapes (24,) (24,489)
The text was updated successfully, but these errors were encountered: