Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems with merge_topics #648

Closed
iamsha5q opened this issue Jul 31, 2022 · 5 comments
Closed

problems with merge_topics #648

iamsha5q opened this issue Jul 31, 2022 · 5 comments

Comments

@iamsha5q
Copy link

iamsha5q commented Jul 31, 2022

I created the following dataframe from the model output

topics, probs = model.fit_transform(vic_msg)
topic_df = model.get_topic_info()

And then I created another dataframe which consist of my messages, the topic from model output, and assign the highest probability topic when message is assigned to topic -1.

# create dataframe with topics
df = pd.DataFrame({'topic': topics, 'message': vic_msg})
df['topic_assigned'] = " "
for i, row in df.iterrows():
    if row.topic == -1:
        df.at[i,'topic_assigned'] = np.where(probs[i] == probs[i].max())[0][0]
    else:
        df.at[i,'topic_assigned'] = row.topic
df = df.merge(topic_df[['Topic', 'Name']], how='left', left_on='topic_assigned', right_on='Topic' )
df.rename(columns = {'Name':'topic_keywords'}, inplace = True)
df = df[['topic','topic_assigned', 'topic_keywords', 'message']]

df above works perfectly, until i decided to merge some topics as follow

topics_to_merge = [[141,142],[143,144]]
model.merge_topics(vic_msg, topics, topics_to_merge)

and then when i run df again, some messages are still assigned to topics that were deleted because of the topic merging. But when i run the topic_df it correctly showed the newly merged topic.

Say message[1] was allocated to topic 141, and before the topic merging if i do probs[1] or model.visualize_distribution(probs[1]) it will show some values. But not after merging.. I've reduced 140 topics to 115 topics. So any messages assigned to topics > 115 previously now have no topics to map.

When I run len(probs[1]) the size is still about 141 topics, which means the probs are not updated with the new probs from merging? but if i did the following i get an error

topics_merge, probs_merge = model.merge_topics(vic_msg, topics, topics_to_merge)
TypeError: cannot unpack non-iterable NoneType object

Do you have any idea what happen here @MaartenGr ?

@iamsha5q
Copy link
Author

Fixed it after running below found in another discussion. Thanks!

topics= model.map_predictions(model.hdbscan_model.labels)
probs = hdbscan.all_points_membership_vectors(model.hdbscan_model)
probs = model._map_probabilities(probs, original_topics=True)

@iamsha5q
Copy link
Author

iamsha5q commented Aug 9, 2022

Hi @MaartenGr , turns out that i'm still having issues with this. After executing the above commands, I just realize the representative docs are not assigned correctly to the new topics after merging. I'm still confused on how to assign the new topics from merging to the documents. Any help is appreciated.

@MaartenGr
Copy link
Owner

@iamsha5q There is indeed currently a bug in merge_topics. It will be fixed in the next release but there will be some significant changes to the internal structure so a quick fix will come with a new full release as a PR will not cover it entirely.

Having said that, I believe you can fix it by running the following:

self._map_representative_docs()
updated_probs = self._map_probabilities(probs)

There is already quite some code for the new release, so I am hoping to get a PR in the coming weeks so that you can already use the fix.

MaartenGr added a commit that referenced this issue Aug 10, 2022
@MaartenGr MaartenGr mentioned this issue Aug 10, 2022
@iamsha5q
Copy link
Author

Thanks Maarten, I might just wait for the next release then. Even after the map_representative_docs() it's still not mapped properly.

MaartenGr added a commit that referenced this issue Sep 11, 2022
* Online/incremental topic modeling with .partial_fit
* Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer
* Expose attributes for easier access to internal data
* Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm
* Added an example of combining BERTopic with KeyBERT
* Added many tests with the intention of making development a bit more stable
* Fix #632, #648, #673, #682, #667, #664
@MaartenGr
Copy link
Owner

With the new release, this should be fixed! However, if you still run into any issues, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants