Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "doesn't match" evaluation to KeyedVectors #2765

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

n8stringham
Copy link

Background

The inspiration for this contribution arose during my research for my senior thesis project in mathematics at Pomona College. In my project I have been investigating the generation of word embeddings for Medieval Latin using the tools offered by gensim. In order to determine the accuracy of these embeddings, I have relied primarily on three types of tasks--analogies, odd-one-out, and topk similarity.

Testing on analogies has been quite straight forward due to the already implemented wv.evaluate_word_analogies() function. This method is makes large scale experimentation and testing of word embeddings easy because it provides a means for test cases to be generated from a custom file in the same style as Mikolov et. al's analogy set.

However, in addition to analogy testing, I also desired to evaluate my word embeddings on both the odd one out task as well as topk similarity. Though functions to perform these tasks already exist in the form of the wv.doesnt_match() and wv.most_similar(), there was no similarly convenient way to apply these methods to a large test set. For my own purposes, I set out to implement functions that would provide the flexible and scaled evaluation capabilities of the wv.evaluate_word_analogies() function for the odd one out and topk similarity tasks.

In this PR I add two functions: evaluate_doesnt_match() and evaluate_top_k_similarity(). Both seek to emulate the style of the evaluate_word_analogies() function in the type of parameters it takes and the format of the test set txt file. Details are provided below.

Doesn't Match Evaluation on a File

This function expands the functionality of model.wv.doesnt_match() by allowing the user
to perform this evaluation task at scale on a custom .txt file of categories. It does this by
creating all possible "odd-one-out" groupings from the categories file. The groups are composed
of k_in words from one category and 1 word from a different category.

The function expects the .txt file to follow the formatting conventions of the
evaluate_word_analogies() function where each category has a heading line (identified
with a colon) followed by a list of space separated words that belong to that category on the next line.

e.g.
:fruits
apple banana pear raspberry

Note that each category in the txt file must have at least k_in
words otherwise, comparison groups can't be created.

In the event that categories contain the same word, the function could produce comparison groups that contain duplicate words.

For example, consider the following txt file.
:food
apple, hamburger, hotdog, soup
:fruit
apple, pear, banana, grape

Say k_in=3, then some comparisons would contain duplicate words.

[apple, hamburger, hotdog, apple]
[apple, hamburger, soup, apple]
[apple, hotdog, soup, apple]

By default, this function ignores these comparisons, unless
eval_dupes=True.

Topk Similarity Evaluation on File

This function evaluates the accuracy of a word2vec model
using the topk similarity metric.

The user provides the function with a txt file of words divided
into categories. It is expected the .txt file follows the formatting conventions of the
evaluate_word_analogies() method where each category has a heading line (identified
with a colon) followed by a list of space separated words that belong to that category on the next line.

e.g.
:fruits
apple banana pear raspberry

For each word in the file, the function generates a topk similarity list. This list
is compared to the other entries in the category of the word which generated the list
in order to find matches between the two. The number of matches is then used to compute
one of two accuracy measures -- topk_in_cat or cat_in_topk.

Summary

These two additional functions have been useful to me in my own experimentation with creating good word embeddings by allowing me to evaluate their performance on the odd one out and similarity tasks with the same level of robustness as the analogy task. I believe this is an important tool because measuring the goodness of embeddings can often be ambiguous. Access to multiple evaluation can help bring clarity to the task of assessment.

Since I needed to implement these functions for my own work it seemed fitting to offer them back as a contribution to the project. I hope you find them useful.

Thanks!

Nate Stringham

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 21, 2020

Thank you for your interest in gensim and your effort.

In it's current form, I don't think the contribution is a good fit for gensim, for the following reasons:

  • The code is not reusable. You output results to standard output, but gensim users aren't likely to look there for results.
  • You expect the input to live as a file on the local storage, in a specific format.
  • Overall, this stuff seems more like application logic than library functionality.

@piskvorky What do you think?

@piskvorky
Copy link
Owner

piskvorky commented Mar 21, 2020

I think the functionality is useful, but maybe too specific. This would be a better fit as a stand-alone extension (Python package), not in core gensim. Also for maintenance reasons.

@mikeizbicki
Copy link

mikeizbicki commented Mar 21, 2020

I am @n8stringham 's advisor for this project, so I think I can help clarify the point of this contribution and address your concerns, @mpenkov and @piskvorky.

Concern 1: Overall, this stuff seems more like application logic than library functionality.

I think this isn't the case. Word analogies are the current gold standard for evaluating word embeddings, but they only work with large amounts of training data. @n8stringham 's contribution is designed for the low resource setting. In particular, these measures start showing improvements before the analogy metric starts showing improvement. The low resource setting is widely applicable, and that is why we believe the measures should be included directly in gensim.

@n8stringham is currently writing up a paper demonstrating the wide applicability of these measures, and so maybe from your perspective it would make more sense to include these functions after the paper has been published and the documentation can link to the paper?

Concern 2: The code is not reusable. You output results to standard output, but gensim users aren't likely to look there for results.

The code does not output to stdout, but returns the results. It's true that there is a debug flag which causes detailed printing to stdout, but this could easily be removed.

Concern 3: You expect the input to live as a file on the local storage, in a specific format.

This exactly follows the pattern of the evaluate_word_analogies function, and so a different input method would be weird.

Concern 4: maintenance reasons

I would definitely understand if this would impose too much of a maintenance burden, and so you don't want to include it for that reason. But our hope is that the wide applicability of the methods would make the maintenance burden worth the cost.

@piskvorky
Copy link
Owner

piskvorky commented Mar 21, 2020

Thanks. I do see value in better evaluation functions. My main worry is we have several already, with various parameters, and it's chaos for users.

So to me this is a question of discovery + documentation for users: "when would I use this?", "why would I use this and not something else?", plus maintenance going forward. Unless the use-cases are clear and attract a convincing user base, it will be yet another algorithm we include to bit-rot in Gensim.

Having a thorough analysis + paper to refer to definitely helps. Anything that will communicate to users this is "general and robust enough" and it will solve their problem.

@piskvorky piskvorky changed the title Add doesnt match eval function Add "doesn't match" evaluation in KeyedVectors Mar 23, 2020
@piskvorky piskvorky changed the title Add "doesn't match" evaluation in KeyedVectors Add "doesn't match" evaluation to KeyedVectors Mar 23, 2020
@mpenkov
Copy link
Collaborator

mpenkov commented Mar 24, 2020

Another thing I've noticed is that the added functionality doesn't need to be part of the class it's being added to. The new functionality consists of two methods, but neither of those methods access self. They are essentially pure functions masquerading as methods.

From a maintainer's point of view, if we were to keep this, it'd be better to move these out of the class. They could live pretty much anywhere (same module, different module, different package, or outside of gensim altogether).

@gojomo
Copy link
Collaborator

gojomo commented Mar 24, 2020

Another thing I've noticed is that the added functionality doesn't need to be part of the class it's being added to. The new functionality consists of two methods, but neither of those methods access self. They are essentially pure functions masquerading as methods.

From a maintainer's point of view, if we were to keep this, it'd be better to move these out of the class. They could live pretty much anywhere (same module, different module, different package, or outside of gensim altogether).

Agreed - and except for accrued tradition/practice, this same reasoning could apply to the other evaluate methods as well, putting them in some other focused module.

@mpenkov mpenkov added the stale Waiting for author to complete contribution, no recent effort label Jun 29, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Jun 29, 2021

Ping @n8stringham : are you able to complete this PR?

@n8stringham
Copy link
Author

@mpenkov Sorry for the delay. The paper I was working on has been published (https://aclanthology.org/2020.eval4nlp-1.17/). In addition to describing these evaluation functions we also developed a method to automatically generate test sets for them in any language supported by Wikidata. I ended up putting together a small PyPi package which includes the evaluation functions as well as functions to generate multilingual test sets. The code currently lives at (https://github.com/n8stringham/gensim-evaluations)

I'd be happy to add the functions to gensim if it still seems worthwhile.

except for accrued tradition/practice, this same reasoning could apply to the other evaluate methods as well, putting them in some other focused module.

If not, do you still want someone to work on this?

@gojomo
Copy link
Collaborator

gojomo commented Aug 16, 2021

Adding more evaluation options would be a plus. Each new evaluation function has potential intrinsic value, perhaps better capturing how well word-vectors work for specific downstream uses. But also, having a variety could better communicate to users the idea that the traditional 'analogies' evaluation isn't the end-all/be-all of word-vector quality for all downstream tasks. (Sometimes sets of word-vectors that do better on analogies do worse when used in tasks like classification.)

And, refactoring such that KeyedVectors doesn't keep growing with methods that don't even need self would also be valuable.

I could see such functions potentially as either:

  • top-level functionins in an evaluations module; or...
  • methods on some sort of single word-vectors-evaluations-utility-class; or even…
  • grouped per-evaluation-type into different modules/types. (EG: AnalogiesEvaluation.evaluate(keyed_vectors, test_file), with the alternate evals as much as possible surfacing common overall summary scores, subset scoring, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Waiting for author to complete contribution, no recent effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants