Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1797 and #1798 #1804

Merged
merged 2 commits into from
Feb 14, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 27 additions & 1 deletion bertopic/_bertopic.py
Original file line number Diff line number Diff line change
Expand Up @@ -3237,7 +3237,7 @@ def merge_models(cls, models, min_similarity: float = .7, embedding_model=None):
merged_topics["topic_aspects"][str(new_topic_val)] = selected_topics["topic_aspects"][str(new_topic)]

# Add new embeddings
new_tensors = tensors[new_topic - selected_topics["_outliers"]]
new_tensors = tensors[new_topic + selected_topics["_outliers"]]
merged_tensors = np.vstack([merged_tensors, new_tensors])

# Topic Mapper
Expand Down Expand Up @@ -3663,6 +3663,32 @@ def _combine_zeroshot_topics(self,
self.__dict__.clear()
self.__dict__.update(merged_model.__dict__)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke too soon! merged_model.__dict__ replaces self.outliers with 0, so the outliers are never repositioned a couple of lines down.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just best to set outliers = self._outliers just before merging on line 3627, then using this on line 3668 instead?

Will also have to set self._outliers = 1 inside the if statement

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out and it looks good to me

outliers

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for checking this out! I just pushed a commit which should finally resolve this issue. If you could do a final check, I can merge this PR.

logger.info("Zeroshot Step 3 - Completed \u2713")

# Move -1 topic back to position 0 if it exists
if self._outliers:
nr_zeroshot_topics = len(set(y))

# Re-map the topics such that the -1 topic is at position 0
new_mappings = {}
for topic in self.topics_:
if topic < nr_zeroshot_topics:
new_mappings[topic] = topic
elif topic == nr_zeroshot_topics:
new_mappings[topic] = -1
else:
new_mappings[topic] = topic - 1

# Re-map the topics including all representations (labels, sizes, embeddings, etc.)
self.topics_ = [new_mappings[topic] for topic in self.topics_]
self.topic_representations_ = {new_mappings[topic]: repr for topic, repr in self.topic_representations_.items()}
self.topic_labels_ = {new_mappings[topic]: label for topic, label in self.topic_labels_.items()}
self.topic_sizes_ = collections.Counter(self.topics_)
self.topic_embeddings_ = np.vstack([
self.topic_embeddings_[nr_zeroshot_topics],
self.topic_embeddings_[:nr_zeroshot_topics],
self.topic_embeddings_[nr_zeroshot_topics+1:]
])

return self.topics_

def _guided_topic_modeling(self, embeddings: np.ndarray) -> Tuple[List[int], np.array]:
Expand Down
Loading