-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conversion function naming #1270
Comments
In this context corpus is a list/iterator/generator of tuples in bag of words format. There is more context in What is the context of this conversion? |
I want to use the word2vec though ;) The context is that I'm trying to teach my students about word2vec using gensim and we have only used the sklearn representation so far. I think I got the representation but I'm still confused by the naming. So in |
The input to word2vec is not a corpus aka a list of tuples, but an iterable of lists of words - sentences. |
Actually the simplest gensim - sklearn word2vec integration code is in shorttext package |
I only want to transform, not train, so then the interface is word-based, right? |
thanks for the hint for shorttext. That doesn't have paragraph2vec, though, right? Btw, is there a pretrained model for that? |
Not aware of large doc2vec pre-trained model. This week there will be a small trained doc2vec model with tensorboard viz in this PR by @parulsethi |
@tmylk awesome, thanks! |
pretrained doc2vec here: https://github.com/jhlau/doc2vec though unclear if that's applicable to other domains. |
somewhat unrelated, have you thought about including the feature of using a pretrained word model for the doc2vec as done here jhlau@9dc0f79 ? |
Initialization of word vectors by pre-trained is possible to do manually in the main branch without that fork. Though it's debated on the mailing list by @gojomo whether that's helpful or not. |
hm upgraded to 1.0.1 |
That is strange
|
ah, It's probably because I load the model? from gensim import models
w = models.KeyedVectors.load_word2vec_format(
'../GoogleNews-vectors-negative300.bin', binary=True) |
that is not the model, that is just the vectors from the model :) You cannot train it, just read-only query with them. |
Looking at this issue history, I see @amueller comments that seem to be reating to @tmylk answers... but no @tmylk comments at all. Some github bug? If you're loading directly into a KeyedVectors, no need to access the There's experimental support for merging some pretrained-word-vectors into a prediscovered vocabulary, in |
We've done just that. It's all documented in this paper: https://arxiv.org/abs/1607.05368 Long story short, pre-trained word embeddings help most when you are using a small document collection (e.g. a special domain of text) when training doc2vec. |
I can also see only @amueller 's side of the conversation.
The confusion comes from the fact that both scipy and gensim have been calling their data structure "sparse", for almost a decade now... :( In scipy, it denotes a sparse matrix in CSR / CSC / whatever; in gensim it's anything that you can iterate over, yielding iterables of Maybe call it "gensim-sparse" vs "scipy-sparse"? I'm also +1 on renaming the generic gensim structure to something else entirely. "Sparse" is taken (scipy). "Corpus" is taken (NLP). Any other ideas? |
@jhlau Thanks for your comment & analysis - but I found some of the parameter-choices and evaluations/explanations in your paper confusing, to the point of not being convinced of that conclusion. Some of my observations are in the gensim forum messages at https://groups.google.com/d/msg/gensim/MYbZBkM5KKA/lBKGf7WNDwAJ and https://groups.google.com/d/msg/gensim/MYbZBkM5KKA/j5OKViKzEgAJ. As an example, the claim in section 5 – "More importantly, using pre-trained word |
Not sure what you are confused about, but looking at your comments on the links:
That seems to correspond to my understanding of doc2vec. What we found is that pure PV-DBOW ('dm=0, dbow_words=0') is pretty bad. PV-DBOW is generally the best option ('dm=0, dbow_words=1'), and PV-DM ('dm=1') at best performs on-par with PV-DBOW, but is often slightly worse and requires more training iteration (since its parameter size is much larger). Feel free to ask any specific questions that you feel are not clear. I wasn't aware of any these discussions as no one has tagged me.
This function does not really work, as it uses pre-trained embeddings only for words that are in the model. The forked version of gensim that I've built on the other hand also loads new word embeddings. That is the key difference.
On section 5 table 6, we really meant is that adding pre-trained word vectors doesn't harm performance substantially. Overall, we see that using pre-trained embeddings is generally beneficial for small training collection, and at the worst case, it'd give similar performance, and therefore there's little reason to not do it. |
Nice thread hijacking! 😆 Perhaps mailing list better? |
I take all the blame for mixing about 10 issues into one.
exactly, that was confusing for me. |
Can you give a reference for that - even how that works? That's not described in the original paper, right? [Sorry for highjack-continuation, I'm already on too many mailing lists. maybe separate issue?] |
@piskvorky - Could discuss on gensim list if @jhlau would also like that forum, but keeping full context here for now. @jhlau - For background, I am the original implementor of the
I didn't see any specific measurements in the paper about pure PV-DBOW – am I misreading something? (There, as here, I only see statements to the effect of, "we tried it but it was pretty bad".) As mentioned in my 2nd-referenced-message, comparing pure PV-DBOW with arguments like From the paper's description & your posted code, it appears all
Yes, but if someone is only computing doc-vectors over a current corpus C, and will be doing further training over just examples from current corpus C, and further inference just using documents from corpus C, why would any words that never appear in C be of any value? Sure, earlier larger corpus P may have pre-trained lots of other words. But any training/inference on C will never update or even consult those slots in the vector array, so why load them? Now, there might be some vague intuition that bringing in such words could help later, when you start presenting new documents for inference, say from some new set D, that have words that are outside the vocabulary of C, but were in P. But there are problems with this hope:
These subtle issues are why I'm wary of a superficially-simple API to "bring in pretrained embeddings" That would make that step seem like an easy win, when I don't yet consider the evidence for that (including your paper) to be strong. And it introduces tradeoffs and unintuitive behaviors with regard to the P-but-not-C vocabulary words, and the handling of D examples with such words. I see the limits and lock-options of
The benefits in that table generally look small to me, and I suspect they'd be even smaller with the fairer training-time comparison I suggest above. But "never harms" (with italicized emphasis!) was an unsupportable word choice if in fact you really meant 'substantially', and the adjacent data table provides actual examples where pre-trained embeddings harmed the evaluation score. Such a mismatch also lowers my confidence in all nearby claims. |
The original Paragraph Vectors paper only describes that PV-DBOW mode: the doc-vector-in-training, alone, is optimized to predict each word in turn. It's not averaged with any word-vectors, nor does the paper explicitly describe training word-vectors at the same time – though it's a naturally composable approach, given how analogous PV-DBOW is with skip-gram words, with the PV-DBOW doc-vector being like a magic pseudo-word that, within one text example, has an 'infinite' effective window, floating into every context. That 'floating word' is indeed how Mikolov's small patch to word2vec.c, adding a The followup paper, "Document Embeddings with Paragraph Vector" (https://arxiv.org/abs/1507.07998) seems to share my interpretation, because it observes that word-vector training was an extra option they chose (section 3 paragraph 2):
However, the only places this paper compares "PV w/out word-training" against PV-with-word-training, in figures 4 and 5, the without-word-training is very similar in evaluation score, and even better at 1-out-of-4 comparison points (lower dimensionality in figure 4). And I suspect the same conjecture I've made about @jhlau's results, that using some/all of the time saved from not-training-words to do more iterations of pure-DBOW, would be a fairer comparison and further improve plain PV-DBOW's relative performance. |
Indeed. Its performance is far worse than PV-DBOW with SG so we omit from including them entirely.
I disagree that is a fairer comparison. What would be a fairer comparison, though, is that you extract the most optimal performance from both methods. If PV-DBOW without SG takes longer to converge to optimal performance, then yes I agree that one should train it more (but not by arbitrarily setting some 'standardised' epoch number). I did the same when comparing with PV-DM - it uses much more training epochs but the key point is finding its best performance. I might go back and run PV-DBOW without SG to check if this is the case.
The intention is about checking the original paragraph vector, so yes I only experiment with dm_concat=1 option. In terms of observations we found what you've seen, that the increased number of parameters is hardly worth it.
Not quite, because often there is vocab filter for low frequency words. A word might have been filtered out due to this frequency threshold and excluded from the dataset, but it could be included back again when you are importing it from a larger pre-trained word embeddings model.
That wasn't quite the intention behind why the new vocab is included, for all the reasons you pointed out below.
Fair point. The wording might have been a little strong but I stand by what I said previously and the key point is take a step back and look at the bigger picture. Ultimately the interpretation is up to the users -they can make the choice whether they want to incorporate pre-trained embeddings or not. |
My concern is that without seeing the numbers, & knowing what parameters were tested, it's hard to use this observation to guide future work.
Sure, never mind any default epoch-counts (or epoch-ratios). The conjecture is that even though PV-DBOW-without-SG may benefit from more epochs, these are so much faster (perhaps ~15X in the If you get a chance to test that, in a comparable set-up to the published results, I'd love to see the numbers and it'd give me far more confidence in any conclusions. (Similarly, the paper's reporting of 'optimal' parameters in Table 4, §3.3, and footnotes 9 & 11 would be far more informative if it also reported the full range of alternative values tried, and in what combinations.)
I understand that choice. But given the dubiousness of the original paper's PV-DM-with-concatenation results, comparative info about the gensim default PV-DM-with-averaging mode could be more valuable. That mode might be competitive with PV-DBOW, especially on large datasets. So if you're ever thinking of a followup paper...
I see. That's an interesting subset of the combined vocabulary – but raises the same concerns about vector-quality-vs-frequency as come into play in picking a |
@gojomo Ah, in the original paper i thought they implemented Figure 2 but they actually implemented Figure 3 (I only skimmed). |
@amueller - I'd describe it as, figure-2 is PV-DM (available in gensim as |
Closing as resolved open-ended discussion. |
Hey. I'm trying to go from the CSR format used in scikit-learn to the gensim for mat and I'm a bit confused.
There is some instructions here:
https://radimrehurek.com/gensim/tut1.html#compatibility-with-numpy-and-scipy
But the naming seems odd. Why is "corpus to CSC" the inverse of "sparse to corpus"?
Looking at the helper functions here is even more confusing imo.
Does "corpus" mean an iterator over lists of tuples or what is the interface here?
There are some other functions like:
and
full2sparse
. In this context "sparse" means sequence of 2-tuples, while in the "Sparse2Corpus" the "sparse" means "scipy sparse matrix".Is it possible to explain what "sparse", "scipy", "dense" and "corpus" mean in all these functions? It seems to me like there is no consistent convention.
The text was updated successfully, but these errors were encountered: