Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DNM] Visualize topic model difference (need feedback) #1243

Closed
wants to merge 31 commits into from

Conversation

menshikh-iv
Copy link
Contributor

@menshikh-iv menshikh-iv commented Mar 26, 2017

I think the gensim library is not enough for the visualization of models. This problem motivates me to begin work in this direction.
I see two important cases in this field (with dictionary and topic matrix, for any topic models):

Case 1: Difference between two models

I have two topic models and the question is "What is a difference between this?". I consider a model is a matrix Theta (Topic x Dictionary). It is necessary to calculate how "similar" this two models and what is "similarity" and "difference". The difference between models is well described by the difference between their topics. Use this idea, I construct matrix topic X topic for two models and matrix[topic_i][topic_j] describe what is a difference between this topics. For this purpose, I used some "distance functions" like KL, Hellinger, and Jaccard. So, for annotating matrix[topic_i][topic_j] I used intersection and the symmetric difference between top_n words from each topic.

This approach allows you to see how different the models are. Also, we see specific words for all topic pairs.

Case 2: One-by-one difference between models in train process

The train of the model takes a very long time. I keep model to disk every N documents. The question is "How to understand if you need to continue train model or model already convergence and there is no point to continue train process" and "How to see what happens with the model during training".

For solve this, I train LDA model and dump model every N documents. I construct matrix (num_topics, models_count), ordered by training time, where matrix[topic_i][time_j] represent difference for topic_i between previous and next model in training time time_j. This process shows well what happens in training process. We can see that the model converges (or not) and "trash topics" (topics that are constantly changing)

Also, I plot sum(diff_between_two_models) of this matrix and compare it with perplexity and coherence (u_mass).
As a result, I noticed that this approach work better that perplexity with anomaly situations (something wrong with the model, but perplexity does not change) and this approach correlate with coherence, but it much faster and simpler.

We can see this solution in current commits (warning: plot may not be displayed on GitHub, so you should open html version of notebook)

I would like to see your suggestions and comments (@piskvorky and @tmylk)

P/S

The next step is to work with the code of models (BaseModels or something else) for collect the necessary data from models during training and calculate stuff on the fly.

In addition, the plans include:

  • Deeper introspection of models (using external corpus like pyLDAvis or Termite)
  • Work with visualization (perhaps a web-application or something)

@tmylk
Copy link
Contributor

tmylk commented Mar 28, 2017

btw there is a termite integration in https://github.com/baali/TopicModellingExperiments

@menshikh-iv
Copy link
Contributor Author

@tmylk thank you

@@ -0,0 +1,80 @@
from random import sample
import numpy as np
Copy link
Contributor

@tmylk tmylk May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this to LDAModel or BaseTopicModel class

@tmylk
Copy link
Contributor

tmylk commented May 10, 2017

Please add more explanatory text and split topic diff visualisation into two:

  1. topic i vs topic j. upper diagonal matrix(without the diagonal). Want to be different.

  2. topic i vs topic i. just the diagonal. Want to be the same.

@HarryBaker
Copy link

I am working on a similar project that I think ties in with topic2topic_difference. What I am working on is validating that my LDA models are reproducible. That is, I want to prove that if I were to create an identical model under the same parameters, it would be almost identical to the original model. This is to show that my topics are not a result of chance, but that they are accurate representations of the training corpus. Could I apply the output of topic2topic_difference to this goal?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented May 12, 2017

@HarryBaker yeah, you can use topic2topic_difference for this purposes. You can choice more suitable metrics (for example distance="jaccard" or another)

But remember that topics can change places (for example in the first model t1, t2, t3, in second t3, t1, t2).
In this case, the difference between topic-word matrix will be significant, but topics can be identical.

If you fix topic-order, you will not have problems with this approach, otherwise, you should work with permutations (by topics) of a topic-word matrix.

@HarryBaker
Copy link

My plan was to design a script that would compare two models, and then try to match corresponding topics based on their similarity score. That way you would identify the most obvious matches first, and then calculate the degree of dis-similarity of the more junk topics. If the models are actually similar I should expect to see that there would not be as many dissimilar junk topics.

@menshikh-iv
Copy link
Contributor Author

@HarryBaker It's nice, I will write this method for LdaModel at this week

@HarryBaker
Copy link

I'm going to work on it over the next few days. Can I send in a pull request if I make significant progress?

@menshikh-iv
Copy link
Contributor Author

@HarryBaker yes

@HarryBaker
Copy link

Ok, I have a few questions about topic2topic_difference().

In line 61 you have the chained assignment:

        z[topic1][topic2] = z[topic2][topic1] = distance_func(d1[topic1], d2[topic2])

I'm having trouble understanding why you assign z[topic2][topic1] = distance_func(d1[topic1], d2[topic2]). Given two distinct LDA models, distance_func(d1[topic1], d2[topic2]) would be different from distance_func(d1[topic2], d2[topic1]). That is, topic 4 in model 1 might be identical to topic 6 in model 2, but that does not mean that topic 6 in model 1 is identical to topic 4 in model 2.

Is this what you meant that "topics can change places"?

@tmylk
Copy link
Contributor

tmylk commented May 15, 2017

@HarryBaker In the next version of the code only the upper triangle topic1> topic2 will be shown to avoid this confusion.

@HarryBaker
Copy link

I might just be misunderstanding what this code is intended for. I think I'm using it for a slightly different purpose than what it was designed for, because I'm using it to compare 2 completely distinct LDA models. My goal is to create N many duplicate LDA models under the same parameters, and then use topic2topic_difference to show how similar they are. My goal is to prove that the LDA models I produce are similar--and are thus reproducible. I'm working in biomedical research, so quantitatively proving reproduciblility is very important.

However, from what I understand it sounds like this code was intended to compare the same topics of a single LDA model across different iterations of training. Am I correct?

@menshikh-iv
Copy link
Contributor Author

@HarryBaker Yes, I work with models from different iterations

@HarryBaker
Copy link

Would you consider adding a method that is intended to compare two distinct models? I think it would be very helpful for certain projects. It would allow you both to validate models (as I am currently doing), as well as compare similar models that are not identical. For instance, in a previous project I was studying multiple models created from the same dataset, but over different periods of time. A method to compare two distinct models would have been helpful to match topics over time periods. I was working on something similar, but your code is much simpler and more compact.

@tmylk
Copy link
Contributor

tmylk commented May 16, 2017

@HarryBaker If you wish to write a new method to compare two models trained on exactly the same data but with a different random seed that would be welcome. Please create another issue for that. However imho it will need some kind of alignment, say on top 10 words, to suggest that topic 5 became topic 10 with another random initialization.

@HarryBaker
Copy link

Ok, I will do that.

What do you mean by alignment? I have been using Jaccard distance between the top 15 words (using the code in model_difference.py) and have been getting good results. It's very similar to the results I've gotten using KL divergence, but Jaccard works slightly better.

@tmylk
Copy link
Contributor

tmylk commented May 16, 2017

@HarryBaker replied in #1328

@menshikh-iv
Copy link
Contributor Author

@HarryBaker from my experience, Jaccard is more stable and robust (unlike KL or Hellinger). But Jaccard is not sensitive enough for some tasks.

@HarryBaker
Copy link

I agree. The big issue I am trying to eliminate is word chaining in topics, where two distinct groups of words are assigned to the same topic because they have one word in common. For instance, in the corpus I'm studying I've noticed that words about breast cancer and words about pregnancy are often assigned to the same topic, because they share "woman" as a word. KL isn't appropriate for comparing topics here because highly probable words might get wrongfully chained into a topic. Jaccard does a much better job for this specific task.

@menshikh-iv
Copy link
Contributor Author

@HarryBaker Could you try new code from 'develop' branch PR 1334?

@HarryBaker
Copy link

I think that my application of your code is different than how you intended it. I have a modified version of your code in my fork. Is your code used during the training of an LDA model?

@menshikh-iv
Copy link
Contributor Author

@HarryBaker

Is your code used during the training of an LDA model?

If you means a second case (One-by-one difference between models in train process) It will be a next step, coming soon (:

I think that my application of your code is different than how you intended it

Check current version, I rewrote the code a bit to realize some of your wishes

@tmylk
Copy link
Contributor

tmylk commented May 23, 2017

@menshikh-iv why is it still empty in http://nbviewer.jupyter.org/github/menshikh-iv/gensim/blob/de1c667a9702fddeef166c8ff6b8c14cb4206cdc/docs/notebooks/model_difference.ipynb ?

Could you please add a png image to the notebook AND also link to the HTML version of the latest notebook? so that people can see the awesome viz.

Then will merge

@HarryBaker
Copy link

Yup, those are the changes that would have been necessary for me.

I'm not sure if it helps you, but the code in my fork (which I link to here: #1328) I wrote some functions to match topics between two models, and then find the average similarity. It gives the option to enforce a bijection between topics as well.

@piskvorky
Copy link
Owner

This sounds like an awesome feature! What's the status here?

@menshikh-iv
Copy link
Contributor Author

The first case now in 2.2.0 (#1374 and #1334), the second case will be a part of #1396 and #1399

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants