-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add evaluate_word_analogies
(will replace accuracy
) method for gensim.models.KeyedVectors
#1935
Conversation
The `accuracy` function evaluates the performance of word2vec models in analogy task. `restrict_vocab` parameter defines which part of the model vocabulary will be used for evaluation. The previous default was 30 000 top frequent words (analogy questions containing words beyond this threshold are simply skipped). It indeed makes sense to use some kind of limit here, as the evaluation running time depends on the size of the used vocabulary. However, 30 000 is a very small value, with typical models nowadays featuring hundreds of thousands or even millions of words in their vocabularies. This leads to unrealistic evaluation scores, calculated only on small parts of a test set and a model. Therefore, I suggest increasing the default value of `restrict_vocab` 10-fold, up to 300 000. This will be more in line with the typical vocabulary size of contemporary word embedding models, and also will be consistent with the default value for the `evaluate_word_pairs` function. Note that although the original C word2vec does mention 30 000 as a good threshold value for analogies evaluation, the default behavior of its `compute-accuracy` executable is still not to use any threshold (=evaluate on the whole vocabulary).
Hello @akutuzov, Why this important (I mean, why we should change the default value if a user can specify |
Well, the backwards compatibility would not be broken: just the evaluation scores would be different (but much more realistic in most cases). Of course, this should be highlighted in the changelog. Changing the default value is important, because otherwise people get over-inflated evaluation scores. For example, suppose one uses the current default of 30 000 and evaluates on the pretty much standard Google Analogies test set. The semantic part of this test set contains about 12 000 quadruplets. But with the 30 000 threshold, about half of the test set will be silently skipped (with the Google News model, 5 678 out of 12 280 questions are skipped). As a result, the models are evaluated only on the questions containing high-frequent words, which is of course easier. Moreover, the candidates for the answers are also selected only from the words within the threshold. All this makes such evaluation scores highly unreliable and depending on word frequency fluctuations in the training corpora. The 300 000 threshold which I suggest will at least cover most 'normal words' of any natural language and make the evaluation scores for different models more comparable. It can of course be set to something else: 100 000 or 400 000 if you like, just that the order should be hundreds of thousands, not tens of thousands. Finally, increasing threshold will make Gensim-produced evaluation scores closer to the scores produced by the original |
If making this change, people using the same data & same eval-method will, after an upgrade, get 'worse' scores - which is likely to cause alarm, confusion, and support requests. (It's not quite 'backward compaTIbility' that's being broken, but 'backward compaRAbility'.) But the case for being more realistic, and especially matching the original I'd suggest adding a new, fixed method that's more directly analogous to the |
May be adding a new correct method and marking In fact, this silent skipping of OOV questions is wrong even if the threshold is very permissive (or even if there is no threshold at all). The current implementation (and the original word2vec implementation as well) allows to get high scores on Google Analogies even if your model's vocabulary contains only 10 words, for example. If these 10 words cover at least 1 question from the test set and produce the correct answer, the method will output the score of 100%. I think that the fair way to evaluate is to punish the models for lacking words from the test set. This is what the |
That evaluation process was selected precisely to match the C tool behaviour 1:1. Did something change in Otherwise I'm -1 on diverging from the original at this point, at least under the established name. That's just confusing. |
As I've said above, the default |
Alright, thanks. That means If that's the case, mimicking its current behaviour is OK. |
@piskvorky @gojomo And what about the total, can I merge this one? |
Note that it doesn't make sense to precisely mimic the |
I had a look and Mikolov's word2vec paper also used 30,000. If that is the established standard, I'm -1 on deviating at this point. One option would be to change the default to match the C tool default ("off"). Another to create a new, non-conflicting evaluation method/process. |
@piskvorky yes, 'because of the way the task is framed, performance also depends on the size of the vocabulary to be searched: Mikolov et al. (2013a) pick the nearest neighbour among vectors for 1M words, Mikolov et al. (2013c) among 700K words, and we among 300K words.' I would say there is unfortunately no established standard here and most users will simply run evaluation script with default parameters. Overall, I support the suggestion of @gojomo to implement new evaluation method and mark the current If everyone agrees to this plan, I can start implementing this new method. |
Sounds good to me, thanks for investigating @akutuzov . |
Despite the word2vec.c precedent, calling this just For example, a plausible non-running-time reason for clipping the analogies evaluation to more-frequent words is that the ‘long tail’ of words includes many that might crowd-out the ‘right’ answer, without being wholly ‘wrong’. They may be near-synonyms of the ‘best’ answer, or just be idiosyncratically-placed because of their few training examples. But they still help on most tasks, even if they hurt analogies! And you might not want to discard them before training - because they still have other value, and perhaps even help improve the ‘fat head’ words. (For example, |
@akutuzov you'll make changes in current PR or in new? If in new - please close current PR. |
@menshikh-iv I think I will work in this PR |
New method `evaluate_word_analogies` to solve word analogies. Implements more sensible frequency threshold and the `dummy4unknown` parameter. Also, works two times faster than the previous `accuracy` method which is now deprecated.
OK, so as discussed before, I implemented a new
I marked the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -859,6 +967,7 @@ def log_accuracy(section): | |||
section['section'], 100.0 * correct / (correct + incorrect), correct, correct + incorrect | |||
) | |||
|
|||
@deprecated("Method will be removed in 4.0.0, use self.evaluate_word_analogies() instead") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the correct way, all fine 👍
gensim/models/keyedvectors.py
Outdated
@@ -850,6 +851,113 @@ def n_similarity(self, ws1, ws2): | |||
v2 = [self[word] for word in ws2] | |||
return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0))) | |||
|
|||
@staticmethod | |||
def log_evaluate_word_analogies(section): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better hide this method (with _
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly do you mean? Or may be you can point to some example of such hiding in the existing Gensim code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean why not _log_evaluate_word_analogies
? I asking because this method looks like a helper for evaluate_word_analogies
, not more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
|
||
def evaluate_word_analogies(self, analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False): | ||
""" | ||
Compute performance of the model on an analogy test set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use numpy-style docstrings (http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html and https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
def evaluate_word_analogies(self, analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False): | ||
""" | ||
Compute performance of the model on an analogy test set | ||
(see https://aclweb.org/aclwiki/Analogy_(State_of_the_art)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be rendered as a link, should look like
`Analogy (State of the art) <https://aclweb.org/aclwiki/Analogy_(State_of_the_art)>`_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gensim/models/keyedvectors.py
Outdated
Compute performance of the model on an analogy test set | ||
(see https://aclweb.org/aclwiki/Analogy_(State_of_the_art)). | ||
`analogies` is a filename where lines are 4-tuples of words, split into sections by ": SECTION NAME" lines. | ||
See questions-words.txt in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file also provided in the current repo: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/questions-words.txt + this is part of the gensim package, i.e. path on local machine can be retrieved as
from gensim.test.utils import datapath
datapath("questions-words.txt")
No need to download source-code of C version for looking into this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
oov_ratio = float(oov) / line_no * 100 | ||
logger.info('Quadruplets with out-of-vocabulary words: %.1f%%', oov_ratio) | ||
if not dummy4unknown: | ||
logger.info('NB: analogies containing OOV words were skipped from evaluation! ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: please use hanging indents (instead of vertical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
|
||
Parameters | ||
---------- | ||
`analogies` is a filename where lines are 4-tuples of words, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be
parameter_1 : type_1
Description_1.
parameter_2 : type_2
Description_2.
...
example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
with out-of-vocabulary words. Otherwise (default False), these | ||
tuples are skipped entirely and not used in the evaluation. | ||
|
||
References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use References
section (this will produce a problem in future, "thanks" autosummary sphinx plugin), add it simply as a link with description (as I mentioned before #1935 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@menshikh-iv Is everything OK now? |
evaluate_word_analogies
(will replace accuracy
) method for gensim.models.KeyedVectors
The
accuracy
function evaluates the performance of word2vec models in analogy task.restrict_vocab
parameter defines which part of the model vocabulary will be used for evaluation. The previous default was 30 000 top frequent words (analogy questions containing words beyond this threshold are simply skipped). It indeed makes sense to use some kind of limit here, as the evaluation running time depends on the size of the used vocabulary.However, 30 000 is a very small value, with typical models nowadays featuring hundreds of thousands or even millions of words in their vocabularies. This leads to unrealistic evaluation scores, calculated only on small parts of a test set and a model.
Therefore, I suggest increasing the default value of
restrict_vocab
10-fold, up to 300 000. This will be more in line with the typical vocabulary size of contemporary word embedding models, and also will be consistent with the defaultrestrict_vocab
value for theevaluate_word_pairs
function.Note that although the original C word2vec does mention 30 000 as a good threshold value for analogies evaluation, the default behavior of its
compute-accuracy
executable is still not to use any threshold (=evaluate on the whole vocabulary).