Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Zero-shot Topic Modeling regarding outliers and future operations #1967

Closed
ianrandman opened this issue May 3, 2024 · 9 comments · Fixed by #2060
Closed

Issues with Zero-shot Topic Modeling regarding outliers and future operations #1967

ianrandman opened this issue May 3, 2024 · 9 comments · Fixed by #2060

Comments

@ianrandman
Copy link
Contributor

ianrandman commented May 3, 2024

I am using zero-shot topic modeling and have run into some issues (a subset of which has already been brought up in passed issues), plus I have some concerns about correctness of the resulting topic model. I have two strategies in mind to resolve these, one of which I have already mostly written as a patch to the process for my project.

My flow is roughly:

  1. topic_model = BERTopic(...)
  2. topic_model.fit(...)
  3. topic_model.reduce_topics(...)
  4. topic_model.reduce_outliers(...)
  5. topic_model.update_topics(...)

The Problems

Let's consider the case that some zero-shot topics have matched with some of the documents. The next step is to continue with clustering and normal topic modeling for the remaining documents in self.

In _combine_zeroshot_topics, a dummy model for the zero-shot topics is made and fitted to generate topic representations. This is merged with self (which current represents the topic model for un-assigned topics that went on to be clustered) using BERTopic.merge_models to create merged_model. Some cleanup is done before attributes in self are replaced with those in merged_model.

The final topics are 0 through $#_matching_zeroshot_topics_-_one and $#_matching_zeroshot_topics through $#_matching_zeroshot_topics_+_#_clusters_-_1. The outlier topic is now treated as a normal topic with ID $#_matching_zeroshot_topics.

Outliers

The first issue is the loss of the outlier topic. The merged_model._outliers is 0, which makes its way into self._outliers.
This attribute is now wrong and it will no longer correctly be used in future operations as an offset to exclude the outlier topic.
The -1 topic is now gone, so we have lost any meaningful reference to outliers existing. Future operations look at topic -1 for identifying outlier documents. This issue is outlined in #1957.

If we were to call reduce_outliers on a model without outliers, there is no validation. I have not looked into other strategies, but for the c-tf-idf strategy, the function will fail on bow_doc = self.vectorizer_model.transform(outlier_docs) because outlier_docs is empty. A subset of the reduce_outliers issue is described in #1771.

self.c_tf_idf_

Any future operation that references the self.c_tf_idf_ matrix no longer works, because self.c_tf_idf_ == None, which is a carry-over from the merged_model. In my workflow, the self.c_tf_idf_ matrix will be set correctly if I reduce_topics, because the final step is to extract the topic representations, which includes fitting self.vetcorizer_model and self.ctfidf_model and setting self.c_tf_idf_.

Configuration-related attributes

We have now see outliers and the C-TF-IDF matrix not persist through the merge. Any other configuration-related attributes that may be important later are also not persisted. In my case, the relevant ones are self.zeroshot_topic_list , self.vetcorizer_model, and self.ctfidf_model, but there are many more. These get reset to their defaults. This especially becomes a problem if representations are ever updated again (which they are a couple more times in my workflow), because important properties like reduce_frequent_words=True in the ClassTfidfTransformer are no longer set.

This is a direct consequence of

self.__dict__.clear()
self.__dict__.update(merged_model.__dict__)

in _combine_zeroshot_topics.

Correctness of representations

The representations in the merged_model, which a copied over into self are effective largely a concatenation of the represenations from the zeroshot_model and self (i.e. the clustered model). These representations are calculated in part with the c_tf_idf_ matrix, which is produced in part from the size of the vocabulary. The problem is the vocabulary in zeroshot_model and self are only subsets of the real vocabulary, which includes all documents. As a user, I expect my zero-shot topics to be used at the very start as a sort of filtering step but mostly not have any effects later on (except the zero-shot topic labels persisting). This implies I expect representations to be calculated the way they would normally be, which is to use all documents.

While the representations calculated in each model separately are not necessarily wrong, I would argue it would be more correct for the represenations to be calculated considering all documents. That is, I could call self._extract_topics(...) after fitting, and the resulting representations (barring the special handling of zero-shot topic labels) should not change.

topic_embeddings_

For this one, I am not 100% sure yet whether it is an issue. I would expect that upon merging the zeroshot_model and self, the merged_model.topic_embeddings_ should essentially just be a concatenation of the two sets of topic embeddings (plus maybe some quirks relating to where that outlier topic goes). In my debugging, it appears that the topic embeddings from zeroshot_model.topic_embeddings_ do not change, but the ones for the topics from clustering (from self.topic_embeddings_) do change. And that the cosine similarity of those with the ones in the new merged_model is not even high. Perhaps I am missing something when doing this analysis, or perhaps BERTopic.merge_models truly is not preserving topic embeddings, which I believe is wrong, as topic embeddings are only a function of the documents in each topic (and their embeddings) in this case. Maybe it's related to the mappings parameter in _create_topic_vectors? Still debugging this one...

Proposed Solutions

reduce_outliers

I discussed a lack of validation for the existence of outliers, leading to unclear errors. This function should start with a validation check to look for -1 in self.topics_ (or really any attribute that has a mapping from topic IDs), or a more explicit solution is to check the value of self._outliers.

This private attribute can fall out of sync with reality, so I propose deriving it from one of the other attributes. This can look like

@property
def _outliers() -> int:
    # alternative solution which is less efficient:
    # return int(-1 in self.topics_)
    # other mapping attributes like self.topic_sizes_ can be used
    return int(-1 in self.topic_labels_)  # O(1) operation

One of the specific reasons I propose this is that in my flow, I calculate new topic IDs for all of my documents using topic_model.reduce_outliers(...). This does not change the model at all, so I set topic_model.topics_ = topic_model.reduce_outliers(...). I will later call topic_model.update_topics(...) to update the representations (which also sets the self._outliers attribute, but if I look for whether the topic_model still has outliers between these operations, topic_model._outliers is no longer a reliable source of information.

_combine_zeroshot_topics

This is where the bulk of my fixes through patching have been for the previously described problems.

Persisting of attributes from self

I described why I believe confguration-related attributes should not get cleared out. Obviously, (the majority of) attributes that were affected by fitting should be overriden in self by those in merged_model. To be more specific,

# Public attributes
self.topics_ = None
self.probabilities_ = None
self.topic_sizes_ = None
self.topic_mapper_ = None
self.topic_representations_ = None
self.topic_embeddings_ = None
self.topic_labels_ = None
self.custom_labels_ = None
self.c_tf_idf_ = None
self.representative_images_ = None
self.representative_docs_ = {}
self.topic_aspects_ = {}

are the majority of attributes that should get overridden. I have also added umap_model and hdbscan_model as attributes that should come from merged_model, but I am less sure about the implications of these.

I copy all attributes that are not these into a dictionary. The end of the function is changed to

# Update the class internally
self.__dict__.clear()
self.__dict__.update(merged_model.__dict__)
####### Patched Changes #######
self.__dict__.update(properties_to_keep)
###############################

to keep those configuration-related properties. The vectorizer_model and ctf_idf model are kept for their configurations, even though they were only fitted to the clustered documents. This could be problematic. The result of this patch includes self._outliers being set properly.

Missing Outlier Topic

I have described that the resulting model has the zero-shot topics starting at ID 0 followed by the clustered topic, including the outlier topic. Right after the merge happens, the outlier topic is actually in the correct spot with ID -1 but a gap exists after the zero-shot topics for where it will be moved to. I let this and related operations to the representation-related attributes run.

If self._outliers is set to 1, I will manually move the outlier topic back to ID -1. This involves three operations. For each of

[
    merged_model.topic_labels_,
    merged_model.topic_sizes_,
    merged_model.representative_docs_,
    merged_model.topic_representations_
]

the a new mapping from -1 to the info of the outlier topic (which is at ID $#_matching_zeroshot_topics ) is made. The mappings at $#_matching_zeroshot_topics_+_1 (all of the non-outlier topics that came from clustering) are shifted by 1 to a lower topic number. After the shift, the last mapping (ID $#_matching_zeroshot_topics_+_#_clusters_-_1) is deleted.

The same concept is implemented for merged_model.topics_ (topic assignment for each document) but through the use of numpy operations before it is converted back to a list.

For an implementation that I think would be more appropriate to be merged into this repo, I think the BERTopic.merge_models should be changed to get this correct from the start.

Missing c_tf_idf_ Matrix and Lack of Correctness

Both of these issues can be handled in the same operation, which can be summarized by merged_model._extract_topics(all_docs, all_embeddings).

We want to update the representations in the merged model, which fits the vectorizer_model and ctfidf_model and sets the c_tf_idf_ matrix, but this time with all documents. The _combine_zeroshot_topics function only accepts the assigned documents (assigned_documents), assigned embeddings (referred to in this function as embeddings), and remaining documents (documents). To get all embeddings, I patch the fit_transform function to pass in the embeddings of the clustered documents. In _combine_zeroshot_topics, embeddings is moved to assigned_embeddings to keep the naming consistent with _zeroshot_topic_modeling, and embeddings is now used for the embeddings of clustered documents (documents).

Before extracting topics, I set

merged_model.ctfidf_model = self.ctfidf_model
merged_model.vectorizer_model = self.vectorizer_model

so we use the right configuration for those.

I also make sure afterwards to merged_model.topic_labels_.update(zeroshot_topic_labels) so they do not get overridden by the topic words. I have set zeroshot_topic_labels as a mapping from topic ID to label only for the zeroshot-topics.

I have not tested it yet, but I should also _save_representative_docs to update the representative docs, as these are a function of the c_tf_idf_ matrix and the vectorizer_model and ctfidf_model models that are different now (by using all docs) than when representative docs were originally calculated (when the matrix and models were from vocabulary subsets in the zeroshot_model and self (i.e. the clustered model).

fit_transform

So that seems like changes to _combine_zeroshot_topics are lengthy, potentially incomplete and computationally inefficient (representation mode(s) are run again and the representations from zeroshot_model and self before merge are essentially thrown out. A simpler solution that I have not written or detailed extensively is to avoid this merging of models in the first place.

The idea starts with separating assigned documents and assigned embeddings from the document and embeddings to be clustered -- same as the current implementation. The next steps are dimensionality reduction and clustering -- no changes yet.

Now, instead of performing representation-related operations on just the topics that came from clustering, we can merge the topics from clustering with the topics from zero-shot topic modeling. Some care needs to be taken to ensure the outlier topic stays at -1, but I do not think it matters much if zero-shot topics come next or get tacked onto the end (in fact, this alternate ordering in proposed in #1771). The representations will be calculated using all topics and all documents. There would be no need for _combine_zeroshot_topics, but it perhaps could be repurposed for the combining of zero-shot topics with the topics from clusters after the clustering step. The vectorizer_model and ctfidf_model will be fitted to all documents, and self.c_tf_idf_ will include the total vocabulary.

Now, operations that reference these (reduce_topics and reduce_outliers in my case) will (mostly) work, making no distinction between zero-shot topics and topics from clustering. There will still need to be some extra care whenever topic representations are updated (like in reduce_topics) to not override the zero-shot topic labels.

Couple more extra bit of care I can think of needs to be taken:

  • When fitting, self.nr_topics may be set. I believe from the perspective of the user, this should apply to all topics, not just ones from clustering, as it currently the case. There should be validation that self.nr_topics exceeds the number of zero-shot topics that matched. If this is not the case, the user should be prompted to either raise self.nr_topics or raise the zeroshot_min_similarity.
  • The returned probabilities and related self._map_probabilities(probabilities, original_topics=True) needs to have the right shape. I have not dealt with these, but I fear these could already have problems in the current implementation when zero-shot topic modeling is used, because the probabilities appear to be only for the clustered documents, not all documents.

Final Discussion

  • Are there reasons for the quirks I have described in the problems section that are not explained in the referenced issue and PR?
  • Do you see flaws in my description of my current patches?
  • Does my alternative implementation of zero-shot topic modeling make sense, or are there specific reasons for why it is implemented as the merging of two BERTopic topic models?
  • Anything else relevant to know or discuss?
@MaartenGr
Copy link
Owner

@ianrandman Thanks for sharing the issues extensively and a bunch of great solutions! I haven't read through them all but wanted to let you know that this is on my radar. Hopefully somewhere next week, I'll find a moment to spend time reading this properly (as it definitely deserves proper attention!).

What I can share right now is that it is my intention to release a hotfix soon (v0.16.2) when some things on my side slow down a bit before I delve into any major new updates, like these.

@torqw
Copy link

torqw commented May 16, 2024

Being able to use topics.over.time with zero-shot would indeed be a very welcome addition.

@MaartenGr
Copy link
Owner

Outliers
The first issue is the loss of the outlier topic. The merged_model._outliers is 0, which makes its way into self._outliers.
This attribute is now wrong and it will no longer correctly be used in future operations as an offset to exclude the outlier topic.
The -1 topic is now gone, so we have lost any meaningful reference to outliers existing. Future operations look at topic -1 for identifying outlier documents. This issue is outlined in #1957.
If we were to call reduce_outliers on a model without outliers, there is no validation. I have not looked into other strategies, but for the c-tf-idf strategy, the function will fail on bow_doc = self.vectorizer_model.transform(outlier_docs) because outlier_docs is empty. A subset of the reduce_outliers issue is described in #1771.

Indeed, the outlier should have had made its way to the merged model which caused some issues. The pull request (and v0.16.2) should fix that but as we will discuss below, there may be better solutions.

Correctness of representations
The representations in the merged_model, which a copied over into self are effective largely a concatenation of the represenations from the zeroshot_model and self (i.e. the clustered model). These representations are calculated in part with the c_tf_idf_ matrix, which is produced in part from the size of the vocabulary. The problem is the vocabulary in zeroshot_model and self are only subsets of the real vocabulary, which includes all documents. As a user, I expect my zero-shot topics to be used at the very start as a sort of filtering step but mostly not have any effects later on (except the zero-shot topic labels persisting). This implies I expect representations to be calculated the way they would normally be, which is to use all documents.
While the representations calculated in each model separately are not necessarily wrong, I would argue it would be more correct for the represenations to be calculated considering all documents. That is, I could call self._extract_topics(...) after fitting, and the resulting representations (barring the special handling of zero-shot topic labels) should not change.

I am not sure whether I agree that the documents of the zero-shot topics should be included in the general c-TF-IDF process as that would require a third manual model that requires the topics of both the zeroshot and the regular model in order to correctly generate the representations. Having said that, let's go over the proposed solutions and see what's what:

I discussed a lack of validation for the existence of outliers, leading to unclear errors. This function should start with a validation check to look for -1 in self.topics_ (or really any attribute that has a mapping from topic IDs), or a more explicit solution is to check the value of self._outliers.

Although this should be fixed in the new release, having said a check could be worthwhile in case this happens more often. I would want to prevent having too much code though since the repo is already getting quite big for a sole maintainer (me).

are the majority of attributes that should get overridden. I have also added umap_model and hdbscan_model as attributes that should come from merged_model, but I am less sure about the implications of these.

Those should not be added since that would make .transform not possible. HDBSCAN wasn't trained on the zeroshot topics and as such using the cosine similarity between topics and document embeddings is preferred.

For an implementation that I think would be more appropriate to be merged into this repo, I think the BERTopic.merge_models should be changed to get this correct from the start.

I believe this should be fixed with the new release, foregoing any issues with respect to the ordering.

Missing c_tf_idf_ Matrix and Lack of Correctness

This is rather tricky since we want to prevent either calculating too much on the hand and having too much manual work on the other hand. The easiest solution is to introduce a third model that uses the topics generated from the first two models in order to easily generate the c-TF-IDF representations, topic embeddings, etc. without the need for more manual work. However, that would mean that all representation models are going to be run twice which is definitely not preferred. An option would be to find the zero-shot topics, calculate the clusters and then create a single manual BERTopic model using the zero-shot topics and clusters. That way, there is very minimal work needed and representations only need to run once.

So that seems like changes to _combine_zeroshot_topics are lengthy, potentially incomplete and computationally inefficient (representation mode(s) are run again and the representations from zeroshot_model and self before merge are essentially thrown out.

With zero-shot topics, representations should be run only once, not twice since the merging produce does nothing more than simply merging the models and not recalculating them.

Now, instead of performing representation-related operations on just the topics that came from clustering, we can merge the topics from clustering with the topics from zero-shot topic modeling. Some care needs to be taken to ensure the outlier topic stays at -1, but I do not think it matters much if zero-shot topics come next or get tacked onto the end (in fact, this alternate ordering in proposed in #1771). The representations will be calculated using all topics and all documents. There would be no need for combine_zeroshot_topics, but it perhaps could be repurposed for the combining of zero-shot topics with the topics from clusters after the clustering step. The vectorizer_model and ctfidf_model will be fitted to all documents, and self.c_tf_idf will include the total vocabulary.

Ah! Going in chronological order with answering seems to have its downsides as, if I'm not mistaken, that's the same solution I presented above. Calculate zero-shot topics and clusters without the representations and then using both in a manual BERTopic model, thereby simplifying the process and still allowing for a single calculation of the representations whilst having c-TF-IDF representations of the topics. Right?

Are there reasons for the quirks I have described in the problems section that are not explained in the referenced issue and PR?

I think I answered most of this although I would almost think about going into a call with this many changes to make sure we are all on the same page. But that depends on how you see your role in this.

Do you see flaws in my description of my current patches?

I think nothing expect the few mentions above. Mostly, I think it's important to minimize any complexity going forward since those has been the main source of issues. Wherever we can simplify things, that would be preferred.

Does my alternative implementation of zero-shot topic modeling make sense, or are there specific reasons for why it is implemented as the merging of two BERTopic topic models?

At the moment, I can't think of any reason not to do this. At the time it seemed like an elegant solution to the approach but this is indeed more efficient.

Anything else relevant to know or discuss?

The code base is already fairly large and although I do not mind adding to it, it needs to be done from a "were are going to simplify things"-perspective. Your suggestions are great. I understand the bigger picture you are sketching and might still need to deep-dive into things but as it stands now I think mainly the procedure of replacing .merge_models with a manual BERTopic would be the way to go.

What do you think?

@ianrandman
Copy link
Contributor Author

ianrandman commented Jun 13, 2024

Hi Maarten,

I have made changes discussed during our call and have run into a handful of minor issues I hope you can provide insight on. Some are directly related to zero-shot. Some not.

Probabilities assignments

We agreed to use transform() to produce probabilities using cosine similarity in place of the probabilities produced by the HDBSCAN model. The first issue I have is these probabilities are not necessarily aligned with the topic assignments at this stage (i.e. the most "probable" topic assignment for a given document is not necessarily the one it was assigned to). This is more of an issue in theory, and we can agree to just ignore it.

The second problem that I believe should be addressed happens during self.probabilities_ = self._map_probabilities(probabilities, original_topics=True). I am not too familiar with this function in depth, but it seems to me that adding "probabilities" that are actually cosine similarities can produce undesired behavior (perhaps on the user's side) when the sum does not equal 1, possibly even exceeding it. Though I have not thought through it too much, I wonder if applying a softmax to the cosine similarities in transform() would be helpful in producing "probabilities" that make more sense.

Checking for zero-shot

We agreed to replace the HDBSCAN model with a BaseCluster during the fit_transform stage. However, this produces false output when checking whether the model is zero-shot, causing downstream problems.

BERTopic/bertopic/_bertopic.py

Lines 3599 to 3608 in ef0b9a9

def _is_zeroshot(self):
""" Check whether zero-shot topic modeling is possible
* There should be a cluster model used
* Embedding model is necessary to convert zero-shot topics to embeddings
* Zero-shot topics should be defined
"""
if self.zeroshot_topic_list is not None and self.embedding_model is not None and type(self.hdbscan_model) != BaseCluster:
return True
return False

Thoughts on how to handle this check for the HDBSCAN model? Or an alternative suggestion for the assignment of the model during fit_transform?

Topic reduction

When a model is zero-shot, topic reduction can result in multiple zero-shot topics being mapped to a single new topic. A design decision is needed for how to handle these topic labels. I propose concatenation, joined with "_". An alternative could look something like keeping the best fitting label. Thoughts?

This step also jumbled the ordering of topics, so new topics that include at least one original zero-shot topic are not guaranteed to come before other topics. I could add some logic to realign the ordering with this rule, but it is a bit tedious with all the bits that must be changed.

On a side note, I think there is a variable naming issue in the topic reduction function.

BERTopic/bertopic/_bertopic.py

Lines 4169 to 4172 in ef0b9a9

mappings = {topic_from:
{"topics_to": topics_to,
"topic_sizes": [self.topic_sizes_[topic] for topic in topics_to]}
for topic_from, topics_to in mappings.items()}

This includes a mapping from the new topic ID to the old topic IDs that were reduced to the new topic. Therefore, I believe topic_from should be renamed to topic_to and topics_to renamed to topics_from.

Custom labels update_topics

In update_topics(), topic assignments may change, and some topics may disappear. This can cause the topics to fall out of sync with self.custom_labels_ if used, causing them to not show up in get_topic_info(). I believe there should be a clearer message in get_topic_info() when this happens

BERTopic/bertopic/_bertopic.py

Lines 1527 to 1530 in ef0b9a9

if self.custom_labels_ is not None:
if len(self.custom_labels_) == len(info):
labels = {topic - self._outliers: label for topic, label in enumerate(self.custom_labels_)}
info["CustomName"] = info["Topic"].map(labels)

or update_topics() should handle this somehow. Thoughts?

@MaartenGr
Copy link
Owner

The first issue I have is these probabilities are not necessarily aligned with the topic assignments at this stage (i.e. the most "probable" topic assignment for a given document is not necessarily the one it was assigned to). This is more of an issue in theory, and we can agree to just ignore it.

They are indeed not perfectly aligned with one another. With flat probabilities/similarities, this is not an issue as we can simply give back the similarity of the assigned topic. With topic-doc probabilities/similarities, I'm not sure how to solve that issue without attempting to create a separate model (i.e., logistic regression) that is trained (and perhaps even overfitted) on the embeddings (X) and assigned topic (y) to extract hopefully more accurate probabilities.

It is something interesting to think about but indeed, let's ignore it for now as it might be outside of the scope of this PR.

The second problem that I believe should be addressed happens during self.probabilities_ = self._map_probabilities(probabilities, original_topics=True). I am not too familiar with this function in depth, but it seems to me that adding "probabilities" that are actually cosine similarities can produce undesired behavior (perhaps on the user's side) when the sum does not equal 1, possibly even exceeding it. Though I have not thought through it too much, I wonder if applying a softmax to the cosine similarities in transform() would be helpful in producing "probabilities" that make more sense.

Yeah, that indeed makes more sense. It was there originally when using the HDBSCAN probabilities (which kinda sum to 1) but that does not work here. I believe softmax would work here to create these distributions of probabilities. It has been suggested before and I think there is no harm in trying it out here. It might require some manual checking to see whether the distributions make sense but I believe this is common practice.

Thoughts on how to handle this check for the HDBSCAN model? Or an alternative suggestion for the assignment of the model during fit_transform?

Hmmm, we could replace the HDBSCAN model after doing all of the zero-shot work but I can imagine it would be odd to have a single line somewhere at the end of .fit_transform just for replacing it with BaseCluster...

When a model is zero-shot, topic reduction can result in multiple zero-shot topics being mapped to a single new topic. A design decision is needed for how to handle these topic labels. I propose concatenation, joined with "_". An alternative could look something like keeping the best fitting label. Thoughts?

This is a difficult one. Although I generally agree with concatenation, it might merge dozens of topics together which results in a rather strange topic representation. The thing is, if you merge a relatively small zero-shot topic with a large non-zero-shot topic, then from a theoretical perspective the topic representation should be geared more towards the large non-zero-shot topic and not the smaller. In many cases, I believe these merged topics should then get entirely new labels which are calculated through the topic representations.

This step also jumbled the ordering of topics, so new topics that include at least one original zero-shot topic are not guaranteed to come before other topics. I could add some logic to realign the ordering with this rule, but it is a bit tedious with all the bits that must be changed.

Personally, I would skip that for now. It seems like a lot of work keeping track of the order for something that many users are unlikely to be using.

This includes a mapping from the new topic ID to the old topic IDs that were reduced to the new topic. Therefore, I believe topic_from should be renamed to topic_to and topics_to renamed to topics_from.

If I'm not mistaken, the functionality is correct but merely the names are incorrect? If so, then they should indeed be renamed.

In update_topics(), topic assignments may change, and some topics may disappear. This can cause the topics to fall out of sync with self.custom_labels_ if used, causing them to not show up in get_topic_info(). I believe there should be a clearer message in get_topic_info() when this happens or update_topics() should handle this somehow. Thoughts?

Oh, that's definitely a bug then if .update_topics does not re-align the self.custom_labels_. I believe this should be handled in .update_topics since updating the topics should also update the order of the custom labels.

@ianrandman
Copy link
Contributor Author

They are indeed not perfectly aligned with one another. With flat probabilities/similarities, this is not an issue as we can simply give back the similarity of the assigned topic. With topic-doc probabilities/similarities, I'm not sure how to solve that issue without attempting to create a separate model (i.e., logistic regression) that is trained (and perhaps even overfitted) on the embeddings (X) and assigned topic (y) to extract hopefully more accurate probabilities.

It is something interesting to think about but indeed, let's ignore it for now as it might be outside of the scope of this PR.

Yeah, that indeed makes more sense. It was there originally when using the HDBSCAN probabilities (which kinda sum to 1) but that does not work here. I believe softmax would work here to create these distributions of probabilities. It has been suggested before and I think there is no harm in trying it out here. It might require some manual checking to see whether the distributions make sense but I believe this is common practice.

I agree with leaving discussions on probabilities to the future. Some more research shows there may be some better ways than softmax for treating cosine similarities as probabilities.

Oh, that's definitely a bug then if .update_topics does not re-align the self.custom_labels_. I believe this should be handled in .update_topics since updating the topics should also update the order of the custom labels.

I think this is also best left for a future PR.

Personally, I would skip that for now. It seems like a lot of work keeping track of the order for something that many users are unlikely to be using.

Agreed.

If I'm not mistaken, the functionality is correct but merely the names are incorrect? If so, then they should indeed be renamed.

Yes, functionality is correct. Will make this change.

Hmmm, we could replace the HDBSCAN model after doing all of the zero-shot work but I can imagine it would be odd to have a single line somewhere at the end of .fit_transform just for replacing it with BaseCluster...

Bit confused here. I thought your suggestion during the call was to replace the HDBSCAN model with a BaseCluster near the end of fit_transform(). Part of the reasoning is to get probabilities using the transform() function if zero-shot, which checks for the HDBSCAN model being a BaseCluster, and if so, calculate cosine similarities for probabilities. The other reason being that future operations should not use the HDBSCAN model, because it was not fitted on all the documents.

elif type(self.hdbscan_model) == BaseCluster:
logger.info("Predicting topic assignments through cosine similarity of topic and document embeddings.")
sim_matrix = cosine_similarity(embeddings, np.array(self.topic_embeddings_))
predictions = np.argmax(sim_matrix, axis=1) - self._outliers
if self.calculate_probabilities:
probabilities = sim_matrix
else:
probabilities = np.max(sim_matrix, axis=1)

I am not sure of the need for checking type(self.hdbscan_model) != BaseCluster when determining whether the model is zero-shot. Removing this check would solve the problem. Unit tests also pass without it.

This is a difficult one. Although I generally agree with concatenation, it might merge dozens of topics together which results in a rather strange topic representation. The thing is, if you merge a relatively small zero-shot topic with a large non-zero-shot topic, then from a theoretical perspective the topic representation should be geared more towards the large non-zero-shot topic and not the smaller. In many cases, I believe these merged topics should then get entirely new labels which are calculated through the topic representations.

After some thought and disucssion, I believe there is a good compromise. If the original topics that make up the new topic all came from clustering, proceed as normal (i.e. generate a name using the representation model(s)). If one or more of the original topics are zero-shot, calculate the cosine similarity of all the documents in the new cluster (i.e. the topic embedding) with each of the zero-shot labels from the original documents. Choose the best fitting label, if it meets the zeroshot_min_similarity threshold. If none meet the threshold, a calculated name will be used, and the zero-shot labels will be thrown out.

@MaartenGr
Copy link
Owner

Bit confused here. I thought your suggestion during the call was to replace the HDBSCAN model with a BaseCluster near the end of fit_transform(). Part of the reasoning is to get probabilities using the transform() function if zero-shot, which checks for the HDBSCAN model being a BaseCluster, and if so, calculate cosine similarities for probabilities. The other reason being that future operations should not use the HDBSCAN model, because it was not fitted on all the documents.

_is_zeroshot is still run when HDBSCAN is not a BaseCluster right? If so and assuming it gets updated to a BaseCluster at the end of .fit_transform, there should be no issues with _is_zeroshot.

After some thought and disucssion, I believe there is a good compromise. If the original topics that make up the new topic all came from clustering, proceed as normal (i.e. generate a name using the representation model(s)). If one or more of the original topics are zero-shot, calculate the cosine similarity of all the documents in the new cluster (i.e. the topic embedding) with each of the zero-shot labels from the original documents. Choose the best fitting label, if it meets the zeroshot_min_similarity threshold. If none meet the threshold, a calculated name will be used, and the zero-shot labels will be thrown out.

That certainly sounds reasonable!

@ianrandman
Copy link
Contributor Author

_is_zeroshot is still run when HDBSCAN is not a BaseCluster right? If so and assuming it gets updated to a BaseCluster at the end of .fit_transform, there should be no issues with _is_zeroshot.

Yes, in my changes, _is_zeroshot() is used during topic reduction and whenever self.topic_labels is referenced. When setting the HDBSCAN model to BaseCluster() at the end of fit_transform(), subsequent checks for self._is_zeroshot() will incorrectly return False.

@MaartenGr
Copy link
Owner

Hmmm, we could either remove that check or add a parameter in _is_zeroshot that essentially asks whether we should check for the cluster model. Both do not seem that elegant to me but I'm alright either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants