-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probabilities table has topics out of order #1024
Comments
In the docstrings of
This enables the soft clustering capabilities of HDBSCAN as described here. This soft clustering approach can be viewed as a post-hoc approximation of the probabilities and as such try to estimate the probabilities which give in to different probabilities. You can read more about that in the link about soft clustering. |
Thanks, I've had a read of that page. That make sense but I don't think it answers my query. That documentation states that "The probability value at the ith entry of the vector is the probability that that point is a member of the ith cluster". My point is that that's not holding true. If we take the first entry of topic 3 in the table above (index 2942) as an example, it has been assigned to topic number 3 with a probability of 0.306. Therefore that probability value (approximately) should appear in column 3 of the probabilities matrix, but it instead appears in column 0 and the probability in column 3 is much lower. Based on the highest probability in the probabilities matrix, it should have been assigned to topic 0, not 3. (I'm not here saying the topic assignment is incorrect - the topic assignments look amazing - only that the probability vector ordering is mixed up) I'm assuming here that each document (that isn't an outlier) is assigned to the topic that it has the highest probability of being a member of, and that the columns of the probability matrix align to the topic numbers. If either of those aren't true that may be where I'm going wrong. |
What I meant with my message is that the soft clustering is merely an approximation and that it may happen that the highest probability will not match a given cluster since it is an post-hoc approximation of the cluster-document probabilities. As such, it is not surprising that the index of the highest probability does not match the assigned topic since it approximates the probabilities which are not inherent to the cluster assignment. |
Ah I see what you're saying. I did misunderstand your initial comment. So HDBSCAN does a hard clustering step to determine the topics and then a soft clustering step to determine the probability of membership, and the two processes are independent? That make more sense. Still, the results I'm seeing are surprising. Using the dataset I mentioned in #1006 with five very distinct topics, hdbscan clusters them near perfectly, but then the probability matrix suggests that all of my documents about cows have a 70-100% chance of belonging to the cluster about immunology; and all of my documents about immunology have a 70-100% chance of belonging to the cluster about rocks. This seems systematic, not simply fuzzy as I might expect if the soft clustering is not an exact representation. I have been looking through the code and I think the issue lies in the
My documents about cows then have a high probability of belonging to the cows cluster; my rocks documents have a high probability of belonging to the rocks cluster, etc. My guess is that when |
Thanks for figuring where this all might be going wrong. Based on what you're seeing, it would indeed seem that something systematic is at fault here. I would be surprised if it is a result of the BERTopic/bertopic/_bertopic.py Line 376 in 1ee8141
My guess is that the mapping at that step is not working correctly but as far as I know, it has always worked and I am not sure what changed between the last few versions that would explain this. Is the dataset that you mentioned by chance publicly available? If so, would you mind creating a reproducible example of what is happening here? If I can re-create the issue, then perhaps fixing it becomes much easier. |
Certainly! It's not a public dataset but happy to share.
Thanks for taking a look at this |
It took a while but I think I know what is happening. To start off, you are using a nice trick for reducing outliers by specifying the As it so happens, when the topics are sorted by their frequency, the 2D-probabilities get mapped to those topics also. At least, that is was should be happening. Typically, these 2D-probabilities are only generated with HDBSCAN, which almost always produces outliers to some extent. This assumption can be found in the following lines: BERTopic/bertopic/_bertopic.py Lines 3336 to 3338 in 1ee8141
I think, but I am not sure yet, that simply removing |
Ah interesting. Thanks for tracking it down. And sorry, I thought I had checked to make sure my UMAP settings weren't the culprit but I must've still not been generating any outliers. Apologies if I led you a little astray in that. |
I've been following this thread, very interesting edge case. Just for clarification I believe the the |
Yes, I believe it should be removing |
Hi there,
I'm quite confused by the probabilities table produced using
calculate_probabilities=True
. I think in some cases the topics are all out of order.I've processed the sample dataset with 5 topics so that the resulting table is easier to interpret, and the embeddings calculated separately so the topic clustering is quicker to rerun. Neither of these steps change the behaviour I'm seeing.
Resulting topics:
I've joined the probabilities table to the document info table:
I would expect the probability of the chosen topic to appear in the corresponding column of the probabilities table. It appears that it always does if the probability of the chosen topic = 1. If the probability is < 1, it may appear in another column. In the example below, that is usually column 0 (but not always).
(I have cut some unnecessary columns from the screenshot)
I should note that due to the stochastic nature of the process, this behaviour is variable. It is not always column 0 where the probability ends up, and I have seen the probabilities in the correct columns occasionally, so you may need to run it a few times if you don't see it happening, but I would say I'm not seeing them not line up 4 times out of 5.
bertopic 0.14.0
The text was updated successfully, but these errors were encountered: