Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dominant topic in a document #954

Closed
dineshdisprz opened this issue Jan 24, 2023 · 5 comments
Closed

Dominant topic in a document #954

dineshdisprz opened this issue Jan 24, 2023 · 5 comments

Comments

@dineshdisprz
Copy link

Is there any way that I can get the dominant topic from the documnets?

@GeorgeDittmar
Copy link

When you say dominant topic do you mean the one with the most members or something else?

@MaartenGr
Copy link
Owner

After training the model, you can access the assigned topics for each document with topic_model.topics_. These are ordered with the input. Technically, these are the dominant topic in each document since each document is assigned to a single document.

However, if you want to model the distribution of topics in the documents, it might be worthwhile to use .approximate_distribution instead. You can find more about that here.

@dineshdisprz
Copy link
Author

When you say dominant topic do you mean the one with the most members or something else?

The document is represented by a set of topics. The topic that is most talked about in the document is the "dominant topic".

@dineshdisprz
Copy link
Author

After training the model, you can access the assigned topics for each document with topic_model.topics_. These are ordered with the input. Technically, these are the dominant topic in each document since each document is assigned to a single document.

However, if you want to model the distribution of topics in the documents, it might be worthwhile to use .approximate_distribution instead. You can find more about that here.

In the documentation with topic_model.topics_ most recent topics are tracked. Is it the same as the dominant topic? Because when I checked the probabilities of the topics with the documents ( by setting the parameter " calculate_probabilities=True " ). The topic that got max probability differs from the results I get from the topic_model.topics_ .

@MaartenGr
Copy link
Owner

In the documentation with topic_model.topics_ most recent topics are tracked. Is it the same as the dominant topic?

Yes, there are the dominant topics per document.

Because when I checked the probabilities of the topics with the documents ( by setting the parameter " calculate_probabilities=True " ). The topic that got max probability differs from the results I get from the topic_model.topics_ .

The probabilities that are calculated are approximations as a result of how HDBSCAN generates these probabilities. As a result, they indeed may be different and are used as just that, an approximation. I have had this question a couple of times before, so I'll definitely make sure to make this a bit more clear in the documentation.

MaartenGr added a commit that referenced this issue Feb 4, 2023
@MaartenGr MaartenGr mentioned this issue Feb 8, 2023
MaartenGr added a commit that referenced this issue Feb 14, 2023
* Add representation models
  * bertopic.representation.KeyBERTInspired
  * bertopic.representation.PartOfSpeech
  * bertopic.representation.MaximalMarginalRelevance
  * bertopic.representation.Cohere
  * bertopic.representation.OpenAI
  * bertopic.representation.TextGeneration
  * bertopic.representation.LangChain
  * bertopic.representation.ZeroShotClassification
* Fix topic selection when extracting repr docs
* Improve documentation, #769, #954, #912
* Add wordcloud example to documentation
* Add title param for each graph, #800
* Improved nr_topics procedure
* Fix #952, #903, #911, #965. Add #976
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants