-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic Modelling and Visualization #163
base: master
Are you sure you want to change the base?
Conversation
suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>
Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
…pic_model Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
@jbesomi we created a short notebook, where we display the functionality of those two pipelines. 🐰 We think those two will be the main use cases of the implemented functions. The third 🥉 use case, finding relevant words per topic and including them in a data frame is just a version of the second algorithm, but with the documents clustered with a clustering algorithm like kmeans or assigned to a topic with LSA/LDA. When those functions will be ready to merge, we will prepare an exhaustive tutorial for the user to introduce them to the Topic Modeling 💯 |
For now, reviewed only |
Thanks for the review! As I commented above, we'll have to go through this again anyway once #156 is merged 🙏 . |
PCoA is implemented in a sub-optimal way in the pyLDAvis library. We change this (by adding 1 character to their code). Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
we also have updated this branch, so it is now sourced from the master 🥳 it is now ready to be reviewed or merged 🦀 🤞 |
This would be a very useful feature. Any pending blockers or any expected date for merging and releasing? |
Hey @kepler The remaining step for the HeroSeries is:
|
Hello, do you have any news on this topic or when it will be released? Thanks :) |
Hi, is there any news on when this PR will be implemented? |
This PR implements support for Topic Modelling in Texthero (see #42). Maybe see the showcasing notebook first before reading this.
Overview
We implement 5 new functions:
lda
(Latent Dirichlet Allocation (LDA))truncatedSVD
(truncated Singular Value Decomposition), same as Latent Semantic Analysis / Indexing (LSA / LSI)visualize_topics
to visualize topics with pyLDAvistopics_from_topic_model
to get topics for documents after using lda/tSVDtop_words_per_document
to get the most relevant words ("keywords") for every documenttop_words_per_topic
to get the most relevant words for every topic (=cluster)There are now two main ways for users to find, visualize, and understand the topics in their datasets:
tfidf/count/term_frequency
[optional: -> flair embeddings] [optional: -> dimensionality reduction, tSVD] -> clustering. The clusters are now understood as "topics". Users can now use e.g.visualize_topics(s_tfidf, s_clustered)
to see their clusters/topics visualized, and they can dotop_words_per_topic(s_tfidf, s_clustered)
to get the most relevant words for every cluster.tfidf/count/term_frequency
->lda
. Users can now use e.g.visualize_topics(s_tfidf, s_lda)
to see the topics found by lda visualized, and they can dos_topics = topics_from_topic_model(s_tfidf, s_lda)
to get the best-matching topic for every document and then dotop_words_per_topic(s_tfidf, s_clustered)
to get the most relevant words for every topic.The new functions in detail (excerpts of their docstrings + some explanations)
LDA
This is a very straightforward implementation of sklearn's LDA. LDA returns a matrix with dimensions
number of documents X number of topics
("document-topic-matrix") that relates documents to topics (document_topic_matrix[i][j] says how strongly document i belongs to matrix j (unnormalized!)).truncatedSVD
Like e.g. PCA; see this for an example of using the sklearn implementation. As we can see, it'll be used like e.g. PCA.
visualize_topics
This is our coolest new function; it visualizes the topics interactively. It builds upon
pyLDAvis
and is extended in such a way as to allow us to not be restricted toLDA
to profit from the great visualization interface.The first input is the output of
tfidf/term_frequency/count
. This gives us a relation (/matrix) document->terms. The second input has to give us a relation document->topic. This can either be the output of one of our clustering functions (then the clusters are the topics, so we have one topic per document; we create a document-topic-matrix from that) or oflda
(then as described above in lda, we have a document-topic-matrix right there already).From those two relations (documents->topics, documents->terms), the function calculates a distribution of
documents to topics, and a distribution of topics to terms (similarly to pyLDAvis internally, but we extend it for clustering input and not only LDA). These distributions are then passed to pyLDAvis, which visualizes them. The function
visualize_topics
and its helper functions are really well documented 🥈 , so it should be clear what's happening in the code after reading this.topics_from_topic_model
Find the topics from a topic model. Input has to be output of one of
lda, truncatedSVD
, so the output of one of Texthero's Topic Modelling functions that returns a relation between documents and topics (the document_topic_matrix). The function uses the given relation of documents to topics to calculate the best-matching topic per document and returns a Series with the topic IDs.The document_topic_matrix relates documents to topics, so it shows for each document (so for each row), how
strongly that document belongs to a topic. So
document_topic_matrix[X][Y] = how strongly document X belongs to topic Y
(as explained above). We usenp.argmax
to find the index of the topic that a document belongs most strongly to for each document (so for each row). E.g. when the first row of the document_topic_matrix is[0.2, 0.1, 0.2, 0.5]
, then the first document will be put into topic / cluster 3 as the third entry (counting from 0) isthe best matching topic.
We return a CategorySeries (see #164), so a series with a ID per document describing to which cluster it belongs.
top_words_per_topic
The function takes as first input a DocumentTermDF (so output of
tfidf, term_frequency, count
) and as second input a CategorySeries (see #164) that assigns a topic/cluster to every document (so output of a clustering function ortopics_from_topic_model
).The function uses the given clustering from the second input, which relates documents to topics. The first input relates documents to terms. From those two relations (documents->topics, documents->terms), the function calculates a distribution of documents to topics, and a distribution of topics to terms. These distributions are used to find the most relevant terms per topic through pyLDAvis again (see their original paper on how they find relevant terms).
top_words_per_document
Very similar to
top_words_per_topic
, only that every document is treated as one topic/cluster so pyLDAvis finds relevant words ("keywords") that are characteristic for a document.Showcase / Example
See this notebook for examples for this PR