Topic Modelling and Visualization #163

mk2510 · 2020-08-25T11:59:29Z

This PR implements support for Topic Modelling in Texthero (see #42). Maybe see the showcasing notebook first before reading this.

Overview

We implement 5 new functions:

lda (Latent Dirichlet Allocation (LDA))
truncatedSVD (truncated Singular Value Decomposition), same as Latent Semantic Analysis / Indexing (LSA / LSI)
visualize_topics to visualize topics with pyLDAvis
topics_from_topic_model to get topics for documents after using lda/tSVD
top_words_per_document to get the most relevant words ("keywords") for every document
top_words_per_topic to get the most relevant words for every topic (=cluster)

There are now two main ways for users to find, visualize, and understand the topics in their datasets:

tfidf/count/term_frequency [optional: -> flair embeddings] [optional: -> dimensionality reduction, tSVD] -> clustering. The clusters are now understood as "topics". Users can now use e.g. visualize_topics(s_tfidf, s_clustered) to see their clusters/topics visualized, and they can do top_words_per_topic(s_tfidf, s_clustered) to get the most relevant words for every cluster.
tfidf/count/term_frequency -> lda. Users can now use e.g. visualize_topics(s_tfidf, s_lda) to see the topics found by lda visualized, and they can do s_topics = topics_from_topic_model(s_tfidf, s_lda) to get the best-matching topic for every document and then do top_words_per_topic(s_tfidf, s_clustered) to get the most relevant words for every topic.

The new functions in detail (excerpts of their docstrings + some explanations)

LDA

lda(s: Union[VectorSeries, DocumentTermDF], n_components=10, max_iter=10, random_state=None, n_jobs=-1) -> VectorSeries

This is a very straightforward implementation of sklearn's LDA. LDA returns a matrix with dimensions number of documents X number of topics ("document-topic-matrix") that relates documents to topics (document_topic_matrix[i][j] says how strongly document i belongs to matrix j (unnormalized!)).

truncatedSVD

Like e.g. PCA; see this for an example of using the sklearn implementation. As we can see, it'll be used like e.g. PCA.

visualize_topics

visualize_topics(s_document_term: DocumentTermDF, s_document_topic: Union[VectorSeries, CategorySeries (issue 164)], show_in_new_window=False, return_figure=False)

This is our coolest new function; it visualizes the topics interactively. It builds upon pyLDAvis and is extended in such a way as to allow us to not be restricted to LDA to profit from the great visualization interface.

The first input is the output of tfidf/term_frequency/count. This gives us a relation (/matrix) document->terms. The second input has to give us a relation document->topic. This can either be the output of one of our clustering functions (then the clusters are the topics, so we have one topic per document; we create a document-topic-matrix from that) or of lda (then as described above in lda, we have a document-topic-matrix right there already).

From those two relations (documents->topics, documents->terms), the function calculates a distribution of
documents to topics, and a distribution of topics to terms (similarly to pyLDAvis internally, but we extend it for clustering input and not only LDA). These distributions are then passed to pyLDAvis, which visualizes them. The function visualize_topics and its helper functions are really well documented 🥈 , so it should be clear what's happening in the code after reading this.

topics_from_topic_model

topics_from_topic_model(s_document_topic: VectorSeries) -> CategorySeries (issue 164)

Find the topics from a topic model. Input has to be output of one of lda, truncatedSVD, so the output of one of Texthero's Topic Modelling functions that returns a relation between documents and topics (the document_topic_matrix). The function uses the given relation of documents to topics to calculate the best-matching topic per document and returns a Series with the topic IDs.

The document_topic_matrix relates documents to topics, so it shows for each document (so for each row), how
strongly that document belongs to a topic. So document_topic_matrix[X][Y] = how strongly document X belongs to topic Y (as explained above). We use np.argmax to find the index of the topic that a document belongs most strongly to for each document (so for each row). E.g. when the first row of the document_topic_matrix is
[0.2, 0.1, 0.2, 0.5], then the first document will be put into topic / cluster 3 as the third entry (counting from 0) is
the best matching topic.

We return a CategorySeries (see #164), so a series with a ID per document describing to which cluster it belongs.

top_words_per_topic

top_words_per_topic(s_document_term: DocumentTermDF, s_clusters: CategorySeries, n_words=5) -> TokenSeries

The function takes as first input a DocumentTermDF (so output of tfidf, term_frequency, count) and as second input a CategorySeries (see #164) that assigns a topic/cluster to every document (so output of a clustering function or topics_from_topic_model).

The function uses the given clustering from the second input, which relates documents to topics. The first input relates documents to terms. From those two relations (documents->topics, documents->terms), the function calculates a distribution of documents to topics, and a distribution of topics to terms. These distributions are used to find the most relevant terms per topic through pyLDAvis again (see their original paper on how they find relevant terms).

top_words_per_document

top_words_per_document(s_document_term: DocumentTermDF, n_words=5) -> TokenSeries

Very similar to top_words_per_topic, only that every document is treated as one topic/cluster so pyLDAvis finds relevant words ("keywords") that are characteristic for a document.

Showcase / Example

See this notebook for examples for this PR

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>

*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>

Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

*missing: some test Co-authored-by: Henri Froese <hf2000510@gmail.com>

…ate/texthero into topic_modelling

missing tests

…lling

…ate/texthero into topic_modelling

…pic_model Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

mk2510 · 2020-09-05T20:08:25Z

@jbesomi we created a short notebook, where we display the functionality of those two pipelines. 🐰 We think those two will be the main use cases of the implemented functions. The third 🥉 use case, finding relevant words per topic and including them in a data frame is just a version of the second algorithm, but with the documents clustered with a clustering algorithm like kmeans or assigned to a topic with LSA/LDA.

When those functions will be ready to merge, we will prepare an exhaustive tutorial for the user to introduce them to the Topic Modeling 💯

texthero/representation.py

jbesomi · 2020-09-08T11:24:12Z

For now, reviewed only lda, see comments below

henrifroese · 2020-09-12T09:30:04Z

Thanks for the review! As I commented above, we'll have to go through this again anyway once #156 is merged 🙏 .

jbesomi · 2020-09-14T15:27:03Z

~~#156 has been merged; can you please go through it again?~~ => let's wait for #157 to be merged

PCoA is implemented in a sub-optimal way in the pyLDAvis library. We change this (by adding 1 character to their code). Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

mk2510 · 2020-09-22T10:08:52Z

we also have updated this branch, so it is now sourced from the master 🥳 it is now ready to be reviewed or merged 🦀 🤞

kepler · 2021-04-01T11:56:10Z

This would be a very useful feature. Any pending blockers or any expected date for merging and releasing?

jbesomi · 2021-04-04T14:39:55Z

Hey @kepler
Yes, the plan is to merge this PR soon. But first, the idea is to release a new version with the HeroSeries (to introduce and explain the concept). After that, we will be able to merge this one.

The remaining step for the HeroSeries is:

make sure each function makes correctly use of the HeroSeries and test it (TODO, we need to open an issue)
finalize the documentation for the HeroSeries (Update README.md #117, Update getting_started.md #118, Getting started: Kind of Series (HeroSeries) #135)

bcornet1 · 2021-08-13T08:28:44Z

Hello, do you have any news on this topic or when it will be released? Thanks :)

havardl · 2022-03-17T21:16:29Z

Hi, is there any news on when this PR will be implemented?

mk2510 and others added 30 commits August 18, 2020 22:06

added MultiIndex DF support

fa342a9

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>

beginning with tests

59a9f8c

implemented correct sparse support

19c52de

*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>

Merge branch 'master_upstream' into change_representation_to_multicolumn

66e566c

added back list() and rm .tolist()

41f55a8

rm .tolist() and added list()

217611a

Adopted the test to the new dataframes

6a3b56d

wrong format

b8ff561

Address most review comments.

e3af2f9

Add more unittests for representation

77ad80e

Initial commit to add topic modelling

bee2157

Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

add pyLDAvis to dependencies

dece7b5

add return_figure option

6387ce9

allow display in Console and Jupyter Notebooks

01c0818

tsvd

9cd113c

*missing: some test Co-authored-by: Henri Froese <hf2000510@gmail.com>

Change display at end of function

187d8f5

Merge branch 'topic_modelling' of https://github.com/SummerOfCode-NoH…

d9d032c

…ate/texthero into topic_modelling

change display

85089b1

change display for notebook again

77d815a

added lda

242383a

missing tests

Merge remote-tracking branch 'origin/topic_modelling' into topic_mode…

7edac3a

…lling

Add tests

9e09e7a

Merge branch 'topic_modelling' of https://github.com/SummerOfCode-NoH…

fc54ff2

…ate/texthero into topic_modelling

Format; change name; remove new type Signature

46289f2

updatewd test

bcfa78d

add docstring

eb2d31b

Merge branch 'topic_modelling' of https://github.com/SummerOfCode-NoH…

1fd16eb

…ate/texthero into topic_modelling

Implement matrix multiplication changes; fix metadata error

3a39346

added test for lda and tSVD

4ad7ee8

Implement top_words_per_document, top_words_per_topic, topics_from_to…

65504bb

…pic_model Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

jbesomi reviewed Sep 8, 2020

View reviewed changes

jbesomi marked this pull request as draft September 8, 2020 11:23

incorporate suggested changes from review

b9eaf1f

vercel bot deployed to Preview September 12, 2020 09:35 View deployment

Fix pyLDAvis PCoA issue.

6c30a5e

PCoA is implemented in a sub-optimal way in the pyLDAvis library. We change this (by adding 1 character to their code). Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

vercel bot deployed to Preview September 18, 2020 18:18 View deployment

Add comment to docstring.

d12ba7e

vercel bot deployed to Preview September 18, 2020 18:32 View deployment

import _helper in __init__ to overwrite pyLDAvis change

a571925

vercel bot deployed to Preview September 18, 2020 18:46 View deployment

enable auto-display for jupyter notebooks

a75aebe

vercel bot deployed to Preview September 18, 2020 19:01 View deployment

Merge branch 'master_upstream' into topic_modelling

f8a09c4

vercel bot deployed to Preview September 22, 2020 09:28 View deployment

mk2510 added 2 commits September 22, 2020 11:51

fixed vector series, as pca returns an array

cfc78d9

fixed the last merged issues

4c5aa0b

vercel bot deployed to Preview September 22, 2020 10:04 View deployment

mk2510 marked this pull request as ready for review September 22, 2020 10:09

fix formatting

dc42ed1

vercel bot deployed to Preview September 22, 2020 12:59 View deployment

jbesomi mentioned this pull request Apr 16, 2021

Help Needed for TODO #208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic Modelling and Visualization #163

Topic Modelling and Visualization #163

mk2510 commented Aug 25, 2020 •

edited

Loading

mk2510 commented Sep 5, 2020

jbesomi commented Sep 8, 2020

henrifroese commented Sep 12, 2020

jbesomi commented Sep 14, 2020 •

edited

Loading

mk2510 commented Sep 22, 2020 •

edited

Loading

kepler commented Apr 1, 2021

jbesomi commented Apr 4, 2021 •

edited

Loading

bcornet1 commented Aug 13, 2021

havardl commented Mar 17, 2022

Topic Modelling and Visualization #163

Are you sure you want to change the base?

Topic Modelling and Visualization #163

Conversation

mk2510 commented Aug 25, 2020 • edited Loading

Overview

The new functions in detail (excerpts of their docstrings + some explanations)

LDA

truncatedSVD

visualize_topics

topics_from_topic_model

top_words_per_topic

top_words_per_document

Showcase / Example

mk2510 commented Sep 5, 2020

jbesomi commented Sep 8, 2020

henrifroese commented Sep 12, 2020

jbesomi commented Sep 14, 2020 • edited Loading

mk2510 commented Sep 22, 2020 • edited Loading

kepler commented Apr 1, 2021

jbesomi commented Apr 4, 2021 • edited Loading

bcornet1 commented Aug 13, 2021

havardl commented Mar 17, 2022

mk2510 commented Aug 25, 2020 •

edited

Loading

jbesomi commented Sep 14, 2020 •

edited

Loading

mk2510 commented Sep 22, 2020 •

edited

Loading

jbesomi commented Apr 4, 2021 •

edited

Loading