-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexError on visualize_heatmap()
when all outliers are removed by reduce_outliers()
and update_topics()
#1455
Comments
visualize_heatmap()
when all outliers are removed by reduce_outliers()
and update_topics
visualize_heatmap()
when all outliers are removed by reduce_outliers()
and update_topics()
Although I think the solution you propose makes sense, I am not entirely sure how that would affect other parts of the topic model. There are quite a few places where I think this needs to be extensively tested across a couple of use cases where |
Yes, I agree with you. However, BERTopic/bertopic/_bertopic.py Lines 3231 to 3234 in 37064e2
|
I am not entirely sure about this. It is not so much about whether
Here, it is safe to do this because this is the very first place where you will actually found outliers to be generated. After that, attributes are updated based on the existence of outliers or not. These updated attributes should be checked if we are going to set |
I understand your concerns. Indeed, it seems necessary to carefully check the impact that the modification would have. I have performed unit testing after making this update, and no errors were found. However, I am also unaware of the internal effects that this fix is having. Unfortunately, since I am not familiar with this library, further checks seem difficult. By the way, is it unusual behavior for all outliers to be eliminated by |
Unfortunately, I do not think that the unit tests cover that specific case, so it can easily be missed. Feel free to make the PR request and then if I can find sometime in the upcoming weeks, I can manually check what needs to be checked.
Kinda a poor interface on my part. I figured that users would manually tweak the threshold parameter to reduce some outliers but not all. The idea was that if users wanted to remove all outliers, they would not use HDBSCAN at all since there are other clustering algorithms out there that do not assume outliers and might perform better. So yes, this should definitely be addressed since setting a default threshold parameter is not feasible due to the different ranges of similarity scores across the strategies. If you want, a PR with the fix of setting |
Thank you. I have created a Pull Request for this fix. (#1466) |
I encountered an IndexError with the visualization of heatmap. Here's the sequence of steps I took and the resulting error.
I instantiated a BERTopic model and ran the fit_transform() method as follows:
Subsequently, I executed the following code:
After that, when I used
topic_model.visualize_heatmap()
, I encountered this IndexError:I suspect the problem originates from
topic_model._outliers
. Before reducing outliers, the index was off by one compared to the topic number, as shown below (where the first column represents the index and the second column represents the topic number):After outlier reduction, it appears that all outliers were removed, resulting in the same index and topic number like below:
However,
topic_model._outliers
returned 1 for both patterns. In thevisualize_heatmap()
method of_heatmap.py
, the embeddings array is sliced usingtopic_model._outliers
likeembeddings = np.array(topic_model.topic_embeddings_)[topic_model._outliers:]
Both patterns produce an
indices
array like[ 0 1 2 3 ... 17 18]
. The shape ofembeddings
varies depending on the slice. With outliers,embeddings.shape
is(19, 768)
, but without outliers, it's(18, 768)
because of slicing. This discrepancy causes the IndexError in embeddings = embeddings[indices].I believe
topic_model._outliers
should be updated to 0 when all outliers are removed. A potential fix could involve adding the following line to theupdate_topics()
method in_bertopic.py
:I've tested this fix for my specific error and it seemed to resolve the issue. However, I haven't tested other parts of the library, so I'm unsure this is right solution. If it's appropriate, I'm willing to submit a PR.
The text was updated successfully, but these errors were encountered: