Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce_outliers and update_topics remove stop_words and ngram_range effects #2114

Open
1 task done
shj37 opened this issue Aug 6, 2024 · 1 comment
Open
1 task done
Labels
bug Something isn't working

Comments

@shj37
Copy link

shj37 commented Aug 6, 2024

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

After running reduce_outliers and update_topics, the effects of all specifications used in vectorizer_model (stop words, ngram) are gone. The results' representation words only show single words. Thanks.
9b69b7b4e874cb2dfe351b87318e3e2d

vectorizer_model = CountVectorizer(stop_words=stop_words, ngram_range=(1, 4), min_df=5)
representation_model = MaximalMarginalRelevance(diversity=0.5)

topic_model_outlier_reduction = BERTopic(
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    top_n_words=15,
    min_topic_size=15,
    calculate_probabilities=True
)
topics_outlier_reduction, probs_outlier_reduction = topic_model_outlier_reduction.fit_transform(docs, embeddings)

new_topics = topic_model_outlier_reduction.reduce_outliers(docs, 
                                                           topics_outlier_reduction, 
                                                           threshold=0.2, strategy="distributions") # probabilities=probs_outlier_reduction,

topic_model_outlier_reduction.update_topics(docs, topics=new_topics)

BERTopic Version

0.16.0

@shj37 shj37 added the bug Something isn't working label Aug 6, 2024
@MaartenGr
Copy link
Owner

That's expected behavior since the .update_topics update the topic representations if you do not set them. So instead of this:

topic_model_outlier_reduction.update_topics(docs, topics=new_topics)

you should do this:

topic_model_outlier_reduction.update_topics(
    docs,
    topics=new_topics,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model 
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants