-
Notifications
You must be signed in to change notification settings - Fork 778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with Zero-shot Topic Modeling regarding outliers and future operations #1967
Comments
@ianrandman Thanks for sharing the issues extensively and a bunch of great solutions! I haven't read through them all but wanted to let you know that this is on my radar. Hopefully somewhere next week, I'll find a moment to spend time reading this properly (as it definitely deserves proper attention!). What I can share right now is that it is my intention to release a hotfix soon (v0.16.2) when some things on my side slow down a bit before I delve into any major new updates, like these. |
Being able to use topics.over.time with zero-shot would indeed be a very welcome addition. |
Indeed, the outlier should have had made its way to the merged model which caused some issues. The pull request (and v0.16.2) should fix that but as we will discuss below, there may be better solutions.
I am not sure whether I agree that the documents of the zero-shot topics should be included in the general c-TF-IDF process as that would require a third manual model that requires the topics of both the zeroshot and the regular model in order to correctly generate the representations. Having said that, let's go over the proposed solutions and see what's what:
Although this should be fixed in the new release, having said a check could be worthwhile in case this happens more often. I would want to prevent having too much code though since the repo is already getting quite big for a sole maintainer (me).
Those should not be added since that would make
I believe this should be fixed with the new release, foregoing any issues with respect to the ordering.
This is rather tricky since we want to prevent either calculating too much on the hand and having too much manual work on the other hand. The easiest solution is to introduce a third model that uses the topics generated from the first two models in order to easily generate the c-TF-IDF representations, topic embeddings, etc. without the need for more manual work. However, that would mean that all representation models are going to be run twice which is definitely not preferred. An option would be to find the zero-shot topics, calculate the clusters and then create a single manual BERTopic model using the zero-shot topics and clusters. That way, there is very minimal work needed and representations only need to run once.
With zero-shot topics, representations should be run only once, not twice since the merging produce does nothing more than simply merging the models and not recalculating them.
Ah! Going in chronological order with answering seems to have its downsides as, if I'm not mistaken, that's the same solution I presented above. Calculate zero-shot topics and clusters without the representations and then using both in a manual BERTopic model, thereby simplifying the process and still allowing for a single calculation of the representations whilst having c-TF-IDF representations of the topics. Right?
I think I answered most of this although I would almost think about going into a call with this many changes to make sure we are all on the same page. But that depends on how you see your role in this.
I think nothing expect the few mentions above. Mostly, I think it's important to minimize any complexity going forward since those has been the main source of issues. Wherever we can simplify things, that would be preferred.
At the moment, I can't think of any reason not to do this. At the time it seemed like an elegant solution to the approach but this is indeed more efficient.
The code base is already fairly large and although I do not mind adding to it, it needs to be done from a "were are going to simplify things"-perspective. Your suggestions are great. I understand the bigger picture you are sketching and might still need to deep-dive into things but as it stands now I think mainly the procedure of replacing What do you think? |
Hi Maarten, I have made changes discussed during our call and have run into a handful of minor issues I hope you can provide insight on. Some are directly related to zero-shot. Some not. Probabilities assignmentsWe agreed to use The second problem that I believe should be addressed happens during Checking for zero-shotWe agreed to replace the HDBSCAN model with a BERTopic/bertopic/_bertopic.py Lines 3599 to 3608 in ef0b9a9
Thoughts on how to handle this check for the HDBSCAN model? Or an alternative suggestion for the assignment of the model during Topic reductionWhen a model is zero-shot, topic reduction can result in multiple zero-shot topics being mapped to a single new topic. A design decision is needed for how to handle these topic labels. I propose concatenation, joined with This step also jumbled the ordering of topics, so new topics that include at least one original zero-shot topic are not guaranteed to come before other topics. I could add some logic to realign the ordering with this rule, but it is a bit tedious with all the bits that must be changed. On a side note, I think there is a variable naming issue in the topic reduction function. BERTopic/bertopic/_bertopic.py Lines 4169 to 4172 in ef0b9a9
This includes a mapping from the new topic ID to the old topic IDs that were reduced to the new topic. Therefore, I believe Custom labels
|
if self.custom_labels_ is not None: | |
if len(self.custom_labels_) == len(info): | |
labels = {topic - self._outliers: label for topic, label in enumerate(self.custom_labels_)} | |
info["CustomName"] = info["Topic"].map(labels) |
or update_topics()
should handle this somehow. Thoughts?
They are indeed not perfectly aligned with one another. With flat probabilities/similarities, this is not an issue as we can simply give back the similarity of the assigned topic. With topic-doc probabilities/similarities, I'm not sure how to solve that issue without attempting to create a separate model (i.e., logistic regression) that is trained (and perhaps even overfitted) on the embeddings (X) and assigned topic (y) to extract hopefully more accurate probabilities. It is something interesting to think about but indeed, let's ignore it for now as it might be outside of the scope of this PR.
Yeah, that indeed makes more sense. It was there originally when using the HDBSCAN probabilities (which kinda sum to 1) but that does not work here. I believe softmax would work here to create these distributions of probabilities. It has been suggested before and I think there is no harm in trying it out here. It might require some manual checking to see whether the distributions make sense but I believe this is common practice.
Hmmm, we could replace the HDBSCAN model after doing all of the zero-shot work but I can imagine it would be odd to have a single line somewhere at the end of
This is a difficult one. Although I generally agree with concatenation, it might merge dozens of topics together which results in a rather strange topic representation. The thing is, if you merge a relatively small zero-shot topic with a large non-zero-shot topic, then from a theoretical perspective the topic representation should be geared more towards the large non-zero-shot topic and not the smaller. In many cases, I believe these merged topics should then get entirely new labels which are calculated through the topic representations.
Personally, I would skip that for now. It seems like a lot of work keeping track of the order for something that many users are unlikely to be using.
If I'm not mistaken, the functionality is correct but merely the names are incorrect? If so, then they should indeed be renamed.
Oh, that's definitely a bug then if |
I agree with leaving discussions on probabilities to the future. Some more research shows there may be some better ways than softmax for treating cosine similarities as probabilities.
I think this is also best left for a future PR.
Agreed.
Yes, functionality is correct. Will make this change.
Bit confused here. I thought your suggestion during the call was to replace the HDBSCAN model with a BERTopic/bertopic/_bertopic.py Lines 521 to 529 in 0a28916
I am not sure of the need for checking
After some thought and disucssion, I believe there is a good compromise. If the original topics that make up the new topic all came from clustering, proceed as normal (i.e. generate a name using the representation model(s)). If one or more of the original topics are zero-shot, calculate the cosine similarity of all the documents in the new cluster (i.e. the topic embedding) with each of the zero-shot labels from the original documents. Choose the best fitting label, if it meets the |
That certainly sounds reasonable! |
Yes, in my changes, |
Hmmm, we could either remove that check or add a parameter in |
I am using zero-shot topic modeling and have run into some issues (a subset of which has already been brought up in passed issues), plus I have some concerns about correctness of the resulting topic model. I have two strategies in mind to resolve these, one of which I have already mostly written as a patch to the process for my project.
My flow is roughly:
topic_model = BERTopic(...)
topic_model.fit(...)
topic_model.reduce_topics(...)
topic_model.reduce_outliers(...)
topic_model.update_topics(...)
The Problems
Let's consider the case that some zero-shot topics have matched with some of the documents. The next step is to continue with clustering and normal topic modeling for the remaining documents in
self
.In
_combine_zeroshot_topics
, a dummy model for the zero-shot topics is made and fitted to generate topic representations. This is merged withself
(which current represents the topic model for un-assigned topics that went on to be clustered) usingBERTopic.merge_models
to createmerged_model
. Some cleanup is done before attributes inself
are replaced with those inmerged_model
.The final topics are 0 through
$#_matching_zeroshot_topics_-_one
and$#_matching_zeroshot_topics
through$#_matching_zeroshot_topics_+_#_clusters_-_1
. The outlier topic is now treated as a normal topic with ID$#_matching_zeroshot_topics
.Outliers
The first issue is the loss of the outlier topic. The
merged_model._outliers
is0
, which makes its way intoself._outliers
.This attribute is now wrong and it will no longer correctly be used in future operations as an offset to exclude the outlier topic.
The
-1
topic is now gone, so we have lost any meaningful reference to outliers existing. Future operations look at topic-1
for identifying outlier documents. This issue is outlined in #1957.If we were to call
reduce_outliers
on a model without outliers, there is no validation. I have not looked into other strategies, but for thec-tf-idf
strategy, the function will fail onbow_doc = self.vectorizer_model.transform(outlier_docs)
becauseoutlier_docs
is empty. A subset of thereduce_outliers
issue is described in #1771.self.c_tf_idf_
Any future operation that references the
self.c_tf_idf_
matrix no longer works, becauseself.c_tf_idf_ == None
, which is a carry-over from themerged_model
. In my workflow, theself.c_tf_idf_
matrix will be set correctly if Ireduce_topics
, because the final step is to extract the topic representations, which includes fittingself.vetcorizer_model
andself.ctfidf_model
and settingself.c_tf_idf_
.Configuration-related attributes
We have now see outliers and the C-TF-IDF matrix not persist through the merge. Any other configuration-related attributes that may be important later are also not persisted. In my case, the relevant ones are
self.zeroshot_topic_list
,self.vetcorizer_model
, andself.ctfidf_model
, but there are many more. These get reset to their defaults. This especially becomes a problem if representations are ever updated again (which they are a couple more times in my workflow), because important properties likereduce_frequent_words=True
in theClassTfidfTransformer
are no longer set.This is a direct consequence of
in
_combine_zeroshot_topics
.Correctness of representations
The representations in the
merged_model
, which a copied over intoself
are effective largely a concatenation of the represenations from thezeroshot_model
andself
(i.e. the clustered model). These representations are calculated in part with thec_tf_idf_
matrix, which is produced in part from the size of the vocabulary. The problem is the vocabulary inzeroshot_model
andself
are only subsets of the real vocabulary, which includes all documents. As a user, I expect my zero-shot topics to be used at the very start as a sort of filtering step but mostly not have any effects later on (except the zero-shot topic labels persisting). This implies I expect representations to be calculated the way they would normally be, which is to use all documents.While the representations calculated in each model separately are not necessarily wrong, I would argue it would be more correct for the represenations to be calculated considering all documents. That is, I could call
self._extract_topics(...)
after fitting, and the resulting representations (barring the special handling of zero-shot topic labels) should not change.topic_embeddings_
For this one, I am not 100% sure yet whether it is an issue. I would expect that upon merging the
zeroshot_model
andself
, themerged_model.topic_embeddings_
should essentially just be a concatenation of the two sets of topic embeddings (plus maybe some quirks relating to where that outlier topic goes). In my debugging, it appears that the topic embeddings fromzeroshot_model.topic_embeddings_
do not change, but the ones for the topics from clustering (fromself.topic_embeddings_
) do change. And that the cosine similarity of those with the ones in the newmerged_model
is not even high. Perhaps I am missing something when doing this analysis, or perhapsBERTopic.merge_models
truly is not preserving topic embeddings, which I believe is wrong, as topic embeddings are only a function of the documents in each topic (and their embeddings) in this case. Maybe it's related to themappings
parameter in_create_topic_vectors
? Still debugging this one...Proposed Solutions
reduce_outliers
I discussed a lack of validation for the existence of outliers, leading to unclear errors. This function should start with a validation check to look for
-1 in self.topics_
(or really any attribute that has a mapping from topic IDs), or a more explicit solution is to check the value ofself._outliers
.This private attribute can fall out of sync with reality, so I propose deriving it from one of the other attributes. This can look like
One of the specific reasons I propose this is that in my flow, I calculate new topic IDs for all of my documents using
topic_model.reduce_outliers(...)
. This does not change the model at all, so I settopic_model.topics_ = topic_model.reduce_outliers(...)
. I will later calltopic_model.update_topics(...)
to update the representations (which also sets theself._outliers
attribute, but if I look for whether thetopic_model
still has outliers between these operations,topic_model._outliers
is no longer a reliable source of information._combine_zeroshot_topics
This is where the bulk of my fixes through patching have been for the previously described problems.
Persisting of attributes from
self
I described why I believe confguration-related attributes should not get cleared out. Obviously, (the majority of) attributes that were affected by fitting should be overriden in
self
by those inmerged_model
. To be more specific,are the majority of attributes that should get overridden. I have also added
umap_model
andhdbscan_model
as attributes that should come frommerged_model
, but I am less sure about the implications of these.I copy all attributes that are not these into a dictionary. The end of the function is changed to
to keep those configuration-related properties. The
vectorizer_model
andctf_idf
model are kept for their configurations, even though they were only fitted to the clustered documents. This could be problematic. The result of this patch includesself._outliers
being set properly.Missing Outlier Topic
I have described that the resulting model has the zero-shot topics starting at ID
0
followed by the clustered topic, including the outlier topic. Right after the merge happens, the outlier topic is actually in the correct spot with ID-1
but a gap exists after the zero-shot topics for where it will be moved to. I let this and related operations to the representation-related attributes run.If
self._outliers
is set to 1, I will manually move the outlier topic back to ID-1
. This involves three operations. For each ofthe a new mapping from
-1
to the info of the outlier topic (which is at ID$#_matching_zeroshot_topics
) is made. The mappings at$#_matching_zeroshot_topics_+_1
(all of the non-outlier topics that came from clustering) are shifted by 1 to a lower topic number. After the shift, the last mapping (ID$#_matching_zeroshot_topics_+_#_clusters_-_1
) is deleted.The same concept is implemented for
merged_model.topics_
(topic assignment for each document) but through the use ofnumpy
operations before it is converted back to alist
.For an implementation that I think would be more appropriate to be merged into this repo, I think the
BERTopic.merge_models
should be changed to get this correct from the start.Missing
c_tf_idf_
Matrix and Lack of CorrectnessBoth of these issues can be handled in the same operation, which can be summarized by
merged_model._extract_topics(all_docs, all_embeddings)
.We want to update the representations in the merged model, which fits the
vectorizer_model
andctfidf_model
and sets thec_tf_idf_
matrix, but this time with all documents. The_combine_zeroshot_topics
function only accepts the assigned documents (assigned_documents
), assigned embeddings (referred to in this function asembeddings
), and remaining documents (documents
). To get all embeddings, I patch thefit_transform
function to pass in the embeddings of the clustered documents. In_combine_zeroshot_topics
,embeddings
is moved toassigned_embeddings
to keep the naming consistent with_zeroshot_topic_modeling
, andembeddings
is now used for the embeddings of clustered documents (documents
).Before extracting topics, I set
so we use the right configuration for those.
I also make sure afterwards to
merged_model.topic_labels_.update(zeroshot_topic_labels)
so they do not get overridden by the topic words. I have setzeroshot_topic_labels
as a mapping from topic ID to label only for the zeroshot-topics.I have not tested it yet, but I should also
_save_representative_docs
to update the representative docs, as these are a function of thec_tf_idf_
matrix and thevectorizer_model
andctfidf_model
models that are different now (by using all docs) than when representative docs were originally calculated (when the matrix and models were from vocabulary subsets in the zeroshot_model and self (i.e. the clustered model).fit_transform
So that seems like changes to
_combine_zeroshot_topics
are lengthy, potentially incomplete and computationally inefficient (representation mode(s) are run again and the representations fromzeroshot_model
andself
before merge are essentially thrown out. A simpler solution that I have not written or detailed extensively is to avoid this merging of models in the first place.The idea starts with separating assigned documents and assigned embeddings from the document and embeddings to be clustered -- same as the current implementation. The next steps are dimensionality reduction and clustering -- no changes yet.
Now, instead of performing representation-related operations on just the topics that came from clustering, we can merge the topics from clustering with the topics from zero-shot topic modeling. Some care needs to be taken to ensure the outlier topic stays at -1, but I do not think it matters much if zero-shot topics come next or get tacked onto the end (in fact, this alternate ordering in proposed in #1771). The representations will be calculated using all topics and all documents. There would be no need for
_combine_zeroshot_topics
, but it perhaps could be repurposed for the combining of zero-shot topics with the topics from clusters after the clustering step. Thevectorizer_model
andctfidf_model
will be fitted to all documents, andself.c_tf_idf_
will include the total vocabulary.Now, operations that reference these (
reduce_topics
andreduce_outliers
in my case) will (mostly) work, making no distinction between zero-shot topics and topics from clustering. There will still need to be some extra care whenever topic representations are updated (like inreduce_topics
) to not override the zero-shot topic labels.Couple more extra bit of care I can think of needs to be taken:
self.nr_topics
may be set. I believe from the perspective of the user, this should apply to all topics, not just ones from clustering, as it currently the case. There should be validation thatself.nr_topics
exceeds the number of zero-shot topics that matched. If this is not the case, the user should be prompted to either raiseself.nr_topics
or raise thezeroshot_min_similarity
.self._map_probabilities(probabilities, original_topics=True)
needs to have the right shape. I have not dealt with these, but I fear these could already have problems in the current implementation when zero-shot topic modeling is used, because the probabilities appear to be only for the clustered documents, not all documents.Final Discussion
BERTopic
topic models?The text was updated successfully, but these errors were encountered: