-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected Behavior After Merging Topics #632
Comments
Update: get_representative_docs() seems to be drawing from the original labels, rather than the merged labels. |
Thank you for sharing this issue. From what I can see, there might be an issue with the way |
Thanks Maarten! Sorry for the slight mess. I copied this out of the notebook that I've been troubleshooting in. ` Calculate Embeddingssentence_model = SentenceTransformer("all-MiniLM-L6-v2") Set Up ModelInitialize Topic Modeltopic_model = BERTopic(verbose=True, n_gram_range = (1,2), min_topic_size=30, embedding_model=sentence_model, vectorizer_model=vectorizer_model) #Applying a custom coutn vectorizer in order to use our custom stopword list GO!Run it!topics, probs = topic_model.fit_transform(docs, embeddings) topic_model.get_topic_info().head(1000) topic_model.get_topic_info().to_excel('topic_summary.xlsx') Save out labeled topicsoutput_df = copy.deepcopy(df_for_processing) output_df.to_excel('Labeled_Docs.xlsx') Save out summary along with representative docs for each topic to help understand themsummary = topic_model.get_topic_info() Iterate through representatice docs and add to summarydef get_rep_docs_by_row(row): summary["Representative Docs"] = summary["Topic"].apply(get_rep_docs_by_row) summary.to_excel('Topic Summary.xlsx') fig = topic_model.visualize_topics() topic_model.visualize_topics() fig = topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False) topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False) hierarchical_topics = topic_model.hierarchical_topics(docs, topics) fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) Run the heirarchy visualization with the original embeddingstopic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings, hide_document_hover=False) Topics by classtopics_per_class = topic_model.topics_per_class(docs, topics, classes=df_for_processing['Classes']) fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True) topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True) Merge Topicstopic_model.load("my_model") # Reload model to avoid conflicts topic_model.update_topics(docs, topics, vectorizer_model=vectorizer_model) # For applying changes to topic labeling (e.g., stopwords) topics_to_merge = [[45,37,20,38,26, 22,10,27], topic_model.merge_topics(docs, topics, topics_to_merge) topic_model.save("my_model_merged", save_embedding_model=False) topic_model.get_topic_info().head(1000) Save out labeled topicsoutput_df = copy.deepcopy(df_for_processing) output_df.to_excel('Labeled_Docs_Merged.xlsx') topic_model.visualize_topics() Run the visualization with the original embeddingstopic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False) hierarchical_topics = topic_model.hierarchical_topics(docs, topics) fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True) topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True) |
One thing that I've noticed is that it tends to get more and more messed up with each iteration, if I go through multiple rounds of merges. |
I just checked the code of |
* Online/incremental topic modeling with .partial_fit * Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer * Expose attributes for easier access to internal data * Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm * Added an example of combining BERTopic with KeyBERT * Added many tests with the intention of making development a bit more stable * Fix #632, #648, #673, #682, #667, #664
Hey Maarten! It seems as if topic_model.merge_topics() still does not propagate the changes internally: If I run I get the same number of topics back. |
@darebfh Which version of BERTopic are you currently using? |
Hey Maarten, All the best and thanks for this awesome tool! |
@darebfh Thanks for the kind words and glad to hear that the issue was resolved! I'll keep it in mind :) |
As I run my analyses, I find that the outputs from my initial model are usually very good, but usually still require a certain amount of tweaking to work perfectly. I was using the reduce_topics method for this, but have recently switched to the merge_topics method for the greater degree of control that it affords. However, I have run into some issues:
When I try to get the topic mapping to each doc using
topic_model._map_predictions(topic_model.hdbscan_model.labels_)
, the topics that it outputs do not match those that are summarized when I runget_topic_info()
. In a recent example, the summary showed 13 topics, while there were only 3 unique topics in the list of mappings fromhdbscan_model.labels_
.Most of the visualizations look good, including: visualize_topics(), visualize_hierarchy(hierarchical_topics=hierarchical_topics), and visualize_topics_per_class. However, visualize_documents() appears to be using the bad mappings from problem 1, giving me an output like this:
It seems like theres must be accurate topic mappings stored somewhere, otherwise not of the visualizations and summaries would work. Am I missing something obvious?
Thanks!
The text was updated successfully, but these errors were encountered: