-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][DNM] Visualize topic model difference (need feedback) #1243
Conversation
btw there is a termite integration in https://github.com/baali/TopicModellingExperiments |
@tmylk thank you |
@@ -0,0 +1,80 @@ | |||
from random import sample | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this to LDAModel or BaseTopicModel class
Please add more explanatory text and split topic diff visualisation into two:
|
I am working on a similar project that I think ties in with topic2topic_difference. What I am working on is validating that my LDA models are reproducible. That is, I want to prove that if I were to create an identical model under the same parameters, it would be almost identical to the original model. This is to show that my topics are not a result of chance, but that they are accurate representations of the training corpus. Could I apply the output of topic2topic_difference to this goal? |
@HarryBaker yeah, you can use topic2topic_difference for this purposes. You can choice more suitable metrics (for example But remember that topics can change places (for example in the first model If you fix topic-order, you will not have problems with this approach, otherwise, you should work with permutations (by topics) of a topic-word matrix. |
My plan was to design a script that would compare two models, and then try to match corresponding topics based on their similarity score. That way you would identify the most obvious matches first, and then calculate the degree of dis-similarity of the more junk topics. If the models are actually similar I should expect to see that there would not be as many dissimilar junk topics. |
@HarryBaker It's nice, I will write this method for LdaModel at this week |
I'm going to work on it over the next few days. Can I send in a pull request if I make significant progress? |
@HarryBaker yes |
Ok, I have a few questions about topic2topic_difference(). In line 61 you have the chained assignment:
I'm having trouble understanding why you assign z[topic2][topic1] = distance_func(d1[topic1], d2[topic2]). Given two distinct LDA models, distance_func(d1[topic1], d2[topic2]) would be different from distance_func(d1[topic2], d2[topic1]). That is, topic 4 in model 1 might be identical to topic 6 in model 2, but that does not mean that topic 6 in model 1 is identical to topic 4 in model 2. Is this what you meant that "topics can change places"? |
@HarryBaker In the next version of the code only the upper triangle |
I might just be misunderstanding what this code is intended for. I think I'm using it for a slightly different purpose than what it was designed for, because I'm using it to compare 2 completely distinct LDA models. My goal is to create N many duplicate LDA models under the same parameters, and then use topic2topic_difference to show how similar they are. My goal is to prove that the LDA models I produce are similar--and are thus reproducible. I'm working in biomedical research, so quantitatively proving reproduciblility is very important. However, from what I understand it sounds like this code was intended to compare the same topics of a single LDA model across different iterations of training. Am I correct? |
@HarryBaker Yes, I work with models from different iterations |
Would you consider adding a method that is intended to compare two distinct models? I think it would be very helpful for certain projects. It would allow you both to validate models (as I am currently doing), as well as compare similar models that are not identical. For instance, in a previous project I was studying multiple models created from the same dataset, but over different periods of time. A method to compare two distinct models would have been helpful to match topics over time periods. I was working on something similar, but your code is much simpler and more compact. |
@HarryBaker If you wish to write a new method to compare two models trained on exactly the same data but with a different random seed that would be welcome. Please create another issue for that. However imho it will need some kind of alignment, say on top 10 words, to suggest that topic 5 became topic 10 with another random initialization. |
Ok, I will do that. What do you mean by alignment? I have been using Jaccard distance between the top 15 words (using the code in model_difference.py) and have been getting good results. It's very similar to the results I've gotten using KL divergence, but Jaccard works slightly better. |
@HarryBaker replied in #1328 |
@HarryBaker from my experience, Jaccard is more stable and robust (unlike KL or Hellinger). But Jaccard is not sensitive enough for some tasks. |
I agree. The big issue I am trying to eliminate is word chaining in topics, where two distinct groups of words are assigned to the same topic because they have one word in common. For instance, in the corpus I'm studying I've noticed that words about breast cancer and words about pregnancy are often assigned to the same topic, because they share "woman" as a word. KL isn't appropriate for comparing topics here because highly probable words might get wrongfully chained into a topic. Jaccard does a much better job for this specific task. |
@HarryBaker Could you try new code from 'develop' branch PR 1334? |
I think that my application of your code is different than how you intended it. I have a modified version of your code in my fork. Is your code used during the training of an LDA model? |
If you means a second case (
Check current version, I rewrote the code a bit to realize some of your wishes |
@menshikh-iv why is it still empty in http://nbviewer.jupyter.org/github/menshikh-iv/gensim/blob/de1c667a9702fddeef166c8ff6b8c14cb4206cdc/docs/notebooks/model_difference.ipynb ? Could you please add a png image to the notebook AND also link to the HTML version of the latest notebook? so that people can see the awesome viz. Then will merge |
Yup, those are the changes that would have been necessary for me. I'm not sure if it helps you, but the code in my fork (which I link to here: #1328) I wrote some functions to match topics between two models, and then find the average similarity. It gives the option to enforce a bijection between topics as well. |
This sounds like an awesome feature! What's the status here? |
I think the gensim library is not enough for the visualization of models. This problem motivates me to begin work in this direction.
I see two important cases in this field (with dictionary and topic matrix, for any topic models):
Case 1: Difference between two models
I have two topic models and the question is "What is a difference between this?". I consider a model is a matrix
Theta (Topic x Dictionary
). It is necessary to calculate how "similar" this two models and what is "similarity" and "difference". The difference between models is well described by the difference between their topics. Use this idea, I construct matrixtopic X topic
for two models andmatrix[topic_i][topic_j]
describe what is a difference between this topics. For this purpose, I used some "distance functions" like KL, Hellinger, and Jaccard. So, for annotatingmatrix[topic_i][topic_j]
I used intersection and the symmetric difference betweentop_n
words from each topic.This approach allows you to see how different the models are. Also, we see specific words for all topic pairs.
Case 2: One-by-one difference between models in train process
The train of the model takes a very long time. I keep model to disk every
N
documents. The question is "How to understand if you need to continue train model or model already convergence and there is no point to continue train process" and "How to see what happens with the model during training".For solve this, I train LDA model and dump model every N documents. I construct matrix
(num_topics, models_count)
, ordered by training time, wherematrix[topic_i][time_j]
represent difference fortopic_i
between previous and next model in training timetime_j
. This process shows well what happens in training process. We can see that the model converges (or not) and "trash topics" (topics that are constantly changing)Also, I plot
sum(diff_between_two_models)
of this matrix and compare it with perplexity and coherence (u_mass).As a result, I noticed that this approach work better that perplexity with anomaly situations (something wrong with the model, but perplexity does not change) and this approach correlate with coherence, but it much faster and simpler.
We can see this solution in current commits (warning: plot may not be displayed on GitHub, so you should open html version of notebook)
I would like to see your suggestions and comments (@piskvorky and @tmylk)
P/S
The next step is to work with the code of models (BaseModels or something else) for collect the necessary data from models during training and calculate stuff on the fly.
In addition, the plans include: