Unexpected Behavior After Merging Topics #632

CoandaEffect · 2022-07-22T17:22:51Z

As I run my analyses, I find that the outputs from my initial model are usually very good, but usually still require a certain amount of tweaking to work perfectly. I was using the reduce_topics method for this, but have recently switched to the merge_topics method for the greater degree of control that it affords. However, I have run into some issues:

When I try to get the topic mapping to each doc using topic_model._map_predictions(topic_model.hdbscan_model.labels_), the topics that it outputs do not match those that are summarized when I run get_topic_info(). In a recent example, the summary showed 13 topics, while there were only 3 unique topics in the list of mappings from hdbscan_model.labels_.
Most of the visualizations look good, including: visualize_topics(), visualize_hierarchy(hierarchical_topics=hierarchical_topics), and visualize_topics_per_class. However, visualize_documents() appears to be using the bad mappings from problem 1, giving me an output like this:

It seems like theres must be accurate topic mappings stored somewhere, otherwise not of the visualizations and summaries would work. Am I missing something obvious?

Thanks!

The text was updated successfully, but these errors were encountered:

CoandaEffect · 2022-07-22T21:07:39Z

Update: get_representative_docs() seems to be drawing from the original labels, rather than the merged labels.

MaartenGr · 2022-07-23T08:00:47Z

Thank you for sharing this issue. From what I can see, there might be an issue with the way merge_topics is currently working but I cannot be sure. Can you share your entire code for getting these issues? Including training and merging topics.

CoandaEffect · 2022-07-25T15:43:58Z

Thanks Maarten! Sorry for the slight mess. I copied this out of the notebook that I've been troubleshooting in.

`

Calculate Embeddings

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Set Up Model

Initialize Topic Model

topic_model = BERTopic(verbose=True, n_gram_range = (1,2), min_topic_size=30, embedding_model=sentence_model, vectorizer_model=vectorizer_model) #Applying a custom coutn vectorizer in order to use our custom stopword list

GO!

Run it!

topics, probs = topic_model.fit_transform(docs, embeddings)
topic_model.save("my_model", save_embedding_model=False)

topic_model.get_topic_info().head(1000)

topic_model.get_topic_info().to_excel('topic_summary.xlsx')

Save out labeled topics

output_df = copy.deepcopy(df_for_processing)
output_df['Topics'] = topic_model.map_predictions(topic_model.hdbscan_model.labels)

output_df.to_excel('Labeled_Docs.xlsx')

Save out summary along with representative docs for each topic to help understand them

summary = topic_model.get_topic_info()
summary.drop(0, inplace = True)

Iterate through representatice docs and add to summary

def get_rep_docs_by_row(row):
return topic_model.get_representative_docs(row)

summary["Representative Docs"] = summary["Topic"].apply(get_rep_docs_by_row)

summary.to_excel('Topic Summary.xlsx')

fig = topic_model.visualize_topics()
fig.write_html('Topic Distances.html')

topic_model.visualize_topics()

fig = topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)
fig.write_html('Embedding Space.html')

topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)

hierarchical_topics = topic_model.hierarchical_topics(docs, topics)

fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.write_html("Topic Heirarchy.html")

topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

Run the heirarchy visualization with the original embeddings

topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings, hide_document_hover=False)

Topics by class

topics_per_class = topic_model.topics_per_class(docs, topics, classes=df_for_processing['Classes'])

fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True)
fig.write_html('Topics by Class.html')

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True)

Merge Topics

topic_model.load("my_model") # Reload model to avoid conflicts

topic_model.update_topics(docs, topics, vectorizer_model=vectorizer_model) # For applying changes to topic labeling (e.g., stopwords)

topics_to_merge = [[45,37,20,38,26, 22,10,27],
[19,43],
[23,36,35],
[42,31,24,33,13,39,28],
[16,1],
[40,29,18],
[47,34],
[9,0,41,44,5],
[21,6,2,30,8],
[32,11]]

topic_model.merge_topics(docs, topics, topics_to_merge)

topic_model.save("my_model_merged", save_embedding_model=False)

topic_model.get_topic_info().head(1000)

Save out labeled topics

output_df = copy.deepcopy(df_for_processing)
output_df['Topics'] = topic_model.map_predictions(topic_model.hdbscan_model.labels)

output_df.to_excel('Labeled_Docs_Merged.xlsx')

topic_model.visualize_topics()

Run the visualization with the original embeddings

topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)

hierarchical_topics = topic_model.hierarchical_topics(docs, topics)

fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.write_html("Topic Heirarchy Merged.html")

topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True)
fig.write_html("Topics by Class Merged.html")

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True)
`

CoandaEffect · 2022-07-25T17:31:57Z

One thing that I've noticed is that it tends to get more and more messed up with each iteration, if I go through multiple rounds of merges.

MaartenGr · 2022-07-25T18:20:59Z

I just checked the code of merge_topics and I believe I understand the issue here. It seems that the topics are not properly updated across some of the functions. It is something that definitely can be fixed but most likely will require some work across BERTopic.

* Online/incremental topic modeling with .partial_fit * Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer * Expose attributes for easier access to internal data * Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm * Added an example of combining BERTopic with KeyBERT * Added many tests with the intention of making development a bit more stable * Fix #632, #648, #673, #682, #667, #664

darebfh · 2022-11-17T08:35:08Z

Hey Maarten! It seems as if topic_model.merge_topics() still does not propagate the changes internally:

If I run
topic_model.get_topic_info()
topic_model.merge_topics(docs, [1,2])
and again
topic_model.get_topic_info()

I get the same number of topics back.
Is there any way of updating the topic_model manually?

MaartenGr · 2022-11-18T06:45:27Z

@darebfh Which version of BERTopic are you currently using?

darebfh · 2022-11-18T07:35:46Z

I'm using the latest build, 0.12.0:

darebfh · 2022-11-18T08:42:05Z

Hey Maarten,
sorry for bothering you, of course the error was on my side! ;)
When creating the arrays of topics to be merged, I assumed that the user input would be implicitly converted to int, but it was parsed as string, hence not applying the merge at all.
You COULD add a "Received string, expected int" exception, but of course this would be the cherry on top ;)

All the best and thanks for this awesome tool!

MaartenGr · 2022-11-19T07:03:07Z

@darebfh Thanks for the kind words and glad to hear that the issue was resolved! I'll keep it in mind :)

kkadu mentioned this issue Aug 1, 2022

Issue with supervised topic modeling approach to predict new documents #645

Closed

MaartenGr added a commit that referenced this issue Aug 10, 2022

Fix #632 and #648

a6bbf49

MaartenGr mentioned this issue Aug 10, 2022

v0.12 #668

Merged

MaartenGr closed this as completed in #668 Sep 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Behavior After Merging Topics #632

Unexpected Behavior After Merging Topics #632

CoandaEffect commented Jul 22, 2022

CoandaEffect commented Jul 22, 2022

MaartenGr commented Jul 23, 2022

CoandaEffect commented Jul 25, 2022 •

edited

Loading

CoandaEffect commented Jul 25, 2022

MaartenGr commented Jul 25, 2022

darebfh commented Nov 17, 2022

MaartenGr commented Nov 18, 2022

darebfh commented Nov 18, 2022

darebfh commented Nov 18, 2022

MaartenGr commented Nov 19, 2022

Unexpected Behavior After Merging Topics #632

Unexpected Behavior After Merging Topics #632

Comments

CoandaEffect commented Jul 22, 2022

CoandaEffect commented Jul 22, 2022

MaartenGr commented Jul 23, 2022

CoandaEffect commented Jul 25, 2022 • edited Loading

Calculate Embeddings

Set Up Model

Initialize Topic Model

GO!

Run it!

Save out labeled topics

Save out summary along with representative docs for each topic to help understand them

Iterate through representatice docs and add to summary

Run the heirarchy visualization with the original embeddings

Topics by class

Merge Topics

Save out labeled topics

Run the visualization with the original embeddings

CoandaEffect commented Jul 25, 2022

MaartenGr commented Jul 25, 2022

darebfh commented Nov 17, 2022

MaartenGr commented Nov 18, 2022

darebfh commented Nov 18, 2022

darebfh commented Nov 18, 2022

MaartenGr commented Nov 19, 2022

CoandaEffect commented Jul 25, 2022 •

edited

Loading