Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Behavior After Merging Topics #632

Closed
CoandaEffect opened this issue Jul 22, 2022 · 10 comments · Fixed by #668
Closed

Unexpected Behavior After Merging Topics #632

CoandaEffect opened this issue Jul 22, 2022 · 10 comments · Fixed by #668

Comments

@CoandaEffect
Copy link

As I run my analyses, I find that the outputs from my initial model are usually very good, but usually still require a certain amount of tweaking to work perfectly. I was using the reduce_topics method for this, but have recently switched to the merge_topics method for the greater degree of control that it affords. However, I have run into some issues:

  1. When I try to get the topic mapping to each doc using topic_model._map_predictions(topic_model.hdbscan_model.labels_), the topics that it outputs do not match those that are summarized when I run get_topic_info(). In a recent example, the summary showed 13 topics, while there were only 3 unique topics in the list of mappings from hdbscan_model.labels_.

  2. Most of the visualizations look good, including: visualize_topics(), visualize_hierarchy(hierarchical_topics=hierarchical_topics), and visualize_topics_per_class. However, visualize_documents() appears to be using the bad mappings from problem 1, giving me an output like this:

Screen Shot 2022-07-22 at 11 20 24 AM

It seems like theres must be accurate topic mappings stored somewhere, otherwise not of the visualizations and summaries would work. Am I missing something obvious?

Thanks!

@CoandaEffect
Copy link
Author

Update: get_representative_docs() seems to be drawing from the original labels, rather than the merged labels.

@MaartenGr
Copy link
Owner

Thank you for sharing this issue. From what I can see, there might be an issue with the way merge_topics is currently working but I cannot be sure. Can you share your entire code for getting these issues? Including training and merging topics.

@CoandaEffect
Copy link
Author

CoandaEffect commented Jul 25, 2022

Thanks Maarten! Sorry for the slight mess. I copied this out of the notebook that I've been troubleshooting in.

`

Calculate Embeddings

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Set Up Model

Initialize Topic Model

topic_model = BERTopic(verbose=True, n_gram_range = (1,2), min_topic_size=30, embedding_model=sentence_model, vectorizer_model=vectorizer_model) #Applying a custom coutn vectorizer in order to use our custom stopword list

GO!

Run it!

topics, probs = topic_model.fit_transform(docs, embeddings)
topic_model.save("my_model", save_embedding_model=False)

topic_model.get_topic_info().head(1000)

topic_model.get_topic_info().to_excel('topic_summary.xlsx')

Save out labeled topics

output_df = copy.deepcopy(df_for_processing)
output_df['Topics'] = topic_model.map_predictions(topic_model.hdbscan_model.labels)

output_df.to_excel('Labeled_Docs.xlsx')

Save out summary along with representative docs for each topic to help understand them

summary = topic_model.get_topic_info()
summary.drop(0, inplace = True)

Iterate through representatice docs and add to summary

def get_rep_docs_by_row(row):
return topic_model.get_representative_docs(row)

summary["Representative Docs"] = summary["Topic"].apply(get_rep_docs_by_row)

summary.to_excel('Topic Summary.xlsx')

fig = topic_model.visualize_topics()
fig.write_html('Topic Distances.html')

topic_model.visualize_topics()

fig = topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)
fig.write_html('Embedding Space.html')

topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)

hierarchical_topics = topic_model.hierarchical_topics(docs, topics)

fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.write_html("Topic Heirarchy.html")

topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

Run the heirarchy visualization with the original embeddings

topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings, hide_document_hover=False)

Topics by class

topics_per_class = topic_model.topics_per_class(docs, topics, classes=df_for_processing['Classes'])

fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True)
fig.write_html('Topics by Class.html')

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True)

Merge Topics

topic_model.load("my_model") # Reload model to avoid conflicts

topic_model.update_topics(docs, topics, vectorizer_model=vectorizer_model) # For applying changes to topic labeling (e.g., stopwords)

topics_to_merge = [[45,37,20,38,26, 22,10,27],
[19,43],
[23,36,35],
[42,31,24,33,13,39,28],
[16,1],
[40,29,18],
[47,34],
[9,0,41,44,5],
[21,6,2,30,8],
[32,11]]

topic_model.merge_topics(docs, topics, topics_to_merge)

topic_model.save("my_model_merged", save_embedding_model=False)

topic_model.get_topic_info().head(1000)

Save out labeled topics

output_df = copy.deepcopy(df_for_processing)
output_df['Topics'] = topic_model.map_predictions(topic_model.hdbscan_model.labels)

output_df.to_excel('Labeled_Docs_Merged.xlsx')

topic_model.visualize_topics()

Run the visualization with the original embeddings

topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)

hierarchical_topics = topic_model.hierarchical_topics(docs, topics)

fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.write_html("Topic Heirarchy Merged.html")

topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True)
fig.write_html("Topics by Class Merged.html")

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True)
`

@CoandaEffect
Copy link
Author

One thing that I've noticed is that it tends to get more and more messed up with each iteration, if I go through multiple rounds of merges.

@MaartenGr
Copy link
Owner

I just checked the code of merge_topics and I believe I understand the issue here. It seems that the topics are not properly updated across some of the functions. It is something that definitely can be fixed but most likely will require some work across BERTopic.

MaartenGr added a commit that referenced this issue Aug 10, 2022
@MaartenGr MaartenGr mentioned this issue Aug 10, 2022
MaartenGr added a commit that referenced this issue Sep 11, 2022
* Online/incremental topic modeling with .partial_fit
* Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer
* Expose attributes for easier access to internal data
* Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm
* Added an example of combining BERTopic with KeyBERT
* Added many tests with the intention of making development a bit more stable
* Fix #632, #648, #673, #682, #667, #664
@darebfh
Copy link

darebfh commented Nov 17, 2022

Hey Maarten! It seems as if topic_model.merge_topics() still does not propagate the changes internally:

If I run
topic_model.get_topic_info()
topic_model.merge_topics(docs, [1,2])
and again
topic_model.get_topic_info()

I get the same number of topics back.
Is there any way of updating the topic_model manually?

@MaartenGr
Copy link
Owner

@darebfh Which version of BERTopic are you currently using?

@darebfh
Copy link

darebfh commented Nov 18, 2022

I'm using the latest build, 0.12.0:
image

@darebfh
Copy link

darebfh commented Nov 18, 2022

Hey Maarten,
sorry for bothering you, of course the error was on my side! ;)
When creating the arrays of topics to be merged, I assumed that the user input would be implicitly converted to int, but it was parsed as string, hence not applying the merge at all.
You COULD add a "Received string, expected int" exception, but of course this would be the cherry on top ;)

All the best and thanks for this awesome tool!

@MaartenGr
Copy link
Owner

@darebfh Thanks for the kind words and glad to hear that the issue was resolved! I'll keep it in mind :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants