Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Function relative_cosine_similarity in keyedvectors.py #2307

Merged
merged 18 commits into from
Jan 15, 2019
Merged

Added Function relative_cosine_similarity in keyedvectors.py #2307

merged 18 commits into from
Jan 15, 2019

Conversation

rsdel2007
Copy link
Contributor

Fixes #2175.
I have implemented relative_cosine_similarity as function according to the paper and as @gojomo suggested in #2175 discussion.
According to paper :
rcs(top-n) = cosine_similarity(wordA,wordB)/(sum of cosine_similarities of top-n similar words to wordA)
For finding the top-n similar words I have used method similar_by_word(word,topn).

@@ -1384,7 +1384,42 @@ def init_sims(self, replace=False):
else:
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

def relative_cosine_similarity(self, wa, wb, topn=10):
"""Compute the relative cosine similarity between two words given top-n similar words,
proposed by Artuur Leeuwenberg,Mihaela Vela,Jon Dehdari,Josef van Genabith
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces after commas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"A Minimally Supervised Approach for Synonym Extraction with Word Embeddings"
<https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>.
To calculate relative cosine similarity between two words, equation (1) of the paper is used.
For WordNet synonyms, if rcs(topn=10) is greater than 0.10 than wa and wb are more similar than
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second of three 'than's on this line should actually be 'then' (consequently) not 'than' (comparative).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah...Thanks:).


return rcs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need blank line before next method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""

result = self.similar_by_word(wa, topn)
topn_words = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topn_words is never needed to calculate results - so no good reason to create.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay...Done.

relative cosine similarity between wa and wb.
"""

result = self.similar_by_word(wa, topn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a list of results, using a plural variable name would be slightly better. Also, it's common in the existing gensim code to call the list-of-most-similar-items sims (short for 'similars'), so I'd recommend that variable name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. sims is used as variable name.


topn_cosine = np.array(topn_cosine)

norm = np.sum(topn_cosine)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norm isn't a good name here, as it usually means something other than a sum.

But, there's not really a need to loop-append, convert-to-np-array, or put the sum calculation in a local variable. The sum can be a short, idiomatic calculation at the place where it's needed as the denominator of the final return-value calculation, for example just: sum(result[1] for result in sims).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gojomo
Copy link
Collaborator

gojomo commented Dec 23, 2018

Thanks for your contribution! I've added some line-by-line comments; there should also be a unit test method that confirms expected results in some minimal way.

Automatic tests are failing, but that appears unrelated to your changes - @menshikh-iv, it again looks related to some doc-building command.

@rsdel2007
Copy link
Contributor Author

Yeah, sure I will make the changes.

@rsdel2007
Copy link
Contributor Author

@gojomo can you help me, how to write unit test?

@gojomo
Copy link
Collaborator

gojomo commented Dec 24, 2018

@rsdel2007 Take a look at the existing tests in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_keyedvectors.py as models – and make something that at least minimally uses the new function, and verifies its results according to some idea of where it's beneficial. (There might be ideas in the original paper proposing relative-cosine-similarity about the kinds of word-comparisons where it'd be meaningful. Matching any results claimed there, even in a tiny way, would both help confirm proper operation and highlight proper use.)

While you are developing your test locally, you can run all the tests in test_keyedvectors.py by executing the file itself – its main() runs its own tests. After you commit code, the multiple automatic testing services used by the gensim project will run it along with all others.

@rsdel2007
Copy link
Contributor Author

I have written the unit test in the test_keyedvectors.py, but when I am running the code this error is coming.
Can you help me here @gojomo ?
test

@gojomo
Copy link
Collaborator

gojomo commented Dec 26, 2018

It is almost always better to cut & paste actual error text than to use a screenshot - it ensures the text will appear in search results, etc.

Are you sure you're executing the tests inside the same active environment (finding the same working copy of gensim) as where you added relative_cosine_similarity()? I presume you've run the method separately in your own ad-hoc tests... how did you run it then? Can you still run it there?

@rsdel2007
Copy link
Contributor Author

Yes I am sure that these both are in the same environment. I have checked it twice, but still I got stuck at this.

@horpto
Copy link
Contributor

horpto commented Dec 27, 2018

@rsdel2007
Add your tests into PR even they are not completed successfully.

@rsdel2007
Copy link
Contributor Author

@horpto ,@gojomo I have added the test. Please me tell what changes should I do?

@gojomo
Copy link
Collaborator

gojomo commented Dec 27, 2018

The test is not succeeding - see for example https://travis-ci.org/RaRe-Technologies/gensim/jobs/472632483

But, you absolutely need to figure out locally how to run the tests without the has no attribute 'relative_cosine_similarity'error appearing. Without being at your system, I don't have further ideas, except to say that the error is suggestive that the python/library-search-directories that you're running test_keyedvectors.py with is not your development python-env, but some other one, such as the system default. (You should be using separate envs, as are typically created by virtualenv or venv, to keep projects' libraries separate during development.) You could try looking up other unit-test tutorials or virtual-env tutorials that may be custom to the "visual studio" tool you appear to be using.

The unit-test needs to pass and make sense according to the supposed benefits of this new calculation, ideally by matching the claims/examples of the paper in which it originated.

@rsdel2007
Copy link
Contributor Author

Thanks @gojomo I figure out the problem.
One problem which I am getting while preparing for the unit test is that results for topn similar words to a given word are not same for most_similar and wordnet. That's why I am getting some differences from the paper.
So according to you what should I do?

@gojomo
Copy link
Collaborator

gojomo commented Dec 29, 2018

Can you fashion a test which probes/demonstrates the same advantages the paper claims for this measure, even if the unit-test environment only has access to much smaller corpuses/vector-sets?

@rsdel2007
Copy link
Contributor Author

@gojomo I have added the test.

@rsdel2007
Copy link
Contributor Author

@gojomo @menshikh-iv, Please take a look.

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style.

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Show resolved Hide resolved
gensim/models/keyedvectors.py Show resolved Hide resolved
gensim/models/keyedvectors.py Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
@rsdel2007
Copy link
Contributor Author

@piskvorky I have made the changes. Please take a look.

cos_sim.append(self.vectors.similarity("good", wordnet_syn[i]))
cos_sim = sorted(cos_sim, reverse=True) # cosine_similarity of "good" with wordnet_syn in decreasing order
# computing relative_cosine_similarity of two similar words
rcs_wordnet = self.vectors.similarity("good", "nice") / sum(cos_sim[i] for i in range(10))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this is calculating. It's kind of like the relactive_cosine_similarity() formula, but now with only WordNet synonyms as contributors to the denominator. And, only those synonyms which happen to be in this vector-set. Are all those words in the euclidean_vectors.bin test vectors set? As a result, I'm not sure what the following asserts really test. Is this matching something in the paper?

Copy link
Contributor Author

@rsdel2007 rsdel2007 Dec 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is the problem which I found while making test.There is not any claim or any perfect result in the paper and I can't find any way to confirm on corpus other than wordnet, so I think best way will be to compare the relative_cosine_similarity of wordnet synonyms and most_similar ones under a threshold of 0.125.

Let me give explain the insights of the section relative cosine similarity of the paper_:

  1. Construct a set of the top 10 most (cosine) similar words for w1 (called topn in the paper).
  2. Calculate a normalized score for each of the words in the topn, by dividing by the sum of the topn cosine similarity scores.

They mostly wanted to know if the most similar word of w1 was a synonym or not, and not a synonym/hypernym etc. They expected that if the most (cosine) similar word is a lot more (cosine) similar than the other words in the topn it is more likely to be a synonym, than if it is only slightly more similar. So this is what the rcs takes into account.
So, they come to conclusion which is the only claim in the paper is that if a word pair have a rcs greater 0.10 than it is more likely to be an arbitrary pair.
0.10 can be used threshold but this result is based on wordnet corpus. On a short corpus this result may be more lower. The threshold is nothing but the mean of the cosine_similarities of topn words. So on a short corpus it may be anything less than 0.10.

@gojomo ,Can you suggest some better way to test why looking at above description?

I am looking forward for helping to contribute for the tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know offhand what's in euclidean_vectors.bin - if there's any overlap with the wordnet words you've chosen. But if there's one or more word-and-nearest-neighbor pairs in that set-of-word-vectors (or some other available-at-unit-testing set-of-word-vectors) that the RCS measure successfully identifies as synonyms, and one or more other word-and-nearest-neighbor pairs that the RCS measure also successfully rejects as synonyms, then having the test method show that functionality would be useful as a demonstration/confirmation of the RCS functionality. (And, at least a little, a guard against any future regressions where that breaks due to other changes... which seems unlikely here, but is one of the reasons for ensuring this kind of test coverage.)

Maybe @viplexke, who originally suggested this in #2175, has some other application/test ideas?

self.assertTrue(np.allclose(rcs_wordnet, rcs, 0, 0.125))
# computing relative_cosine_similarity for two non-similar words
rcs = self.vectors.relative_cosine_similarity("good", "worst", 10)
self.assertTrue(rcs < 0.10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 0.10 an important threshold from the paper, or just chosen because it works? Is this sort of contrast – between a word good and a near-antonym worst – the sort of thing RCS is supposed to be good for?

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
@viplexke
Copy link

viplexke commented Jan 4, 2019 via email

@rsdel2007
Copy link
Contributor Author

@viplexke can you provide test cases?

@viplexke
Copy link

viplexke commented Jan 8, 2019 via email

@viplexke
Copy link

viplexke commented Jan 8, 2019 via email

@rsdel2007
Copy link
Contributor Author

rsdel2007 commented Jan 9, 2019

@viplexke according to paper,
rcs(wa,wb,topn)= cs(wa,wb)/(sum of cosine_similarities of top-n similar words to wa).
So, rcs(x.word, x.synonym) /rcs(x.word, x.antonym) will be equal to cs(x.word, x.synonym) / cs(x.word, x.antonym).

CS(x.word, x.synonym) / CS(x.word, x.antonym) < RCS(x.word, x.synonym) / RCS(x.word, x.antonym)

So how can this prove the functionality?

@viplexke
Copy link

viplexke commented Jan 10, 2019 via email

@rsdel2007
Copy link
Contributor Author

@viplexke Can you please check again. It seems same to me again.

@rsdel2007
Copy link
Contributor Author

@viplexke Since this feature is not merged, you post it here.
wdyt @gojomo ?

@viplexke
Copy link

viplexke commented Jan 10, 2019

@viplexke Can you please check again. It seems same to me again.

You're right, sorry about that.
I'll find another test and change the test function accordingly.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks @rsdel2007, I like current PR when @gojomo will be satisfied - I'll merge that (ping me, Gordon)

@@ -195,6 +195,7 @@ class Vocab(object):
and for constructing binary trees (incl. both word leaves and inner nodes).

"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes, please revert all of it (stay PR compact)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1384,12 +1387,42 @@ def init_sims(self, replace=False):
else:
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

def relative_cosine_similarity(self, wa, wb, topn=10):
"""Compute the relative cosine similarity between two words given top-n similar words,
proposed by Artuur Leeuwenberg, Mihaela Vela, Jon Dehdari, Josef van Genabith
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make a proper link in the doc, please use

by `Artuur Leeuwenberg, ... <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>`_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay..Done.

Parameters
----------
wa: str
word for which we have to look top-n similar word.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence should start from uppercased letter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

gensim/models/keyedvectors.py Show resolved Hide resolved
gensim/models/keyedvectors.py Show resolved Hide resolved
"""
sims = self.similar_by_word(wa, topn)
assert sims, "Failed code invariant: list of similar words must never be empty."
rcs = (self.similarity(wa, wb)) / (sum(result[1] for result in sims))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need wrap left part with ()

Copy link
Owner

@piskvorky piskvorky Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, prepend float if this is meant to be a float division. Both to avoid potential errors due to integer operands in python2, and to make the intent clear.

Also, can you please unpack result into appropriately named variables, instead of writing result[1]?

Copy link
Contributor

@menshikh-iv menshikh-iv Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rcs = (self.similarity(wa, wb)) / (sum(result[1] for result in sims))
rcs = float(self.similarity(wa, wb)) / sum(sim for _, sim in sims)

Copy link
Owner

@piskvorky piskvorky Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv cool! You have to teach me how to do that :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def test_relative_cosine_similarity(self):
"""Test relative_cosine_similarity returns expected results with an input of a word pair and topn"""
wordnet_syn = ['good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable',
'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient',
Copy link
Contributor

@menshikh-iv menshikh-iv Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format it properly, please, like

wordnet_syn = [
    'good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable',
    'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient',
    'skillful', 'skilful', 'dear', 'near', 'dependable', 'safe', 'secure', 'right', 'ripe', 'well',
    'effective', 'in_effect', 'in_force', 'serious', 'sound', 'salutary', 'honest', 'undecomposed',
    'unspoiled', 'unspoilt', 'thoroughly', 'soundly',
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1385,6 +1385,36 @@ def init_sims(self, replace=False):
logger.info("precomputing L2-norms of word weight vectors")
self.vectors_norm = _l2_norm(self.vectors, replace=replace)

def relative_cosine_similarity(self, wa, wb, topn=10):
"""Compute the relative cosine similarity between two words given top-n similar words,
by Artuur Leeuwenberg, ... "A Minimally Supervised Approach for Synonym Extraction with Word Embeddings"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be proper link rendered by sphinx, in this case

by `Artuur ..... <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>`_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 👍

Copy link
Contributor

@menshikh-iv menshikh-iv Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this was an example, still incorrect (+ still missing ` and _)

by `Artur <another authors, paper name> <URL_LINK>`_

Copy link
Contributor Author

@rsdel2007 rsdel2007 Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I have to include another authors and paper name in < > and add link in a separate < >?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just replace with this

`Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc "A Minimally Supervised Approach for Synonym Extraction
with Word Embeddings" <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>`_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😄

@menshikh-iv
Copy link
Contributor

@gojomo how do you feel about this PR? Looks good for me, waiting approve from you (or notes about what's still should be fixed)

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
@gojomo
Copy link
Collaborator

gojomo commented Jan 14, 2019

Though the difficulty in demonstrating/testing the value of this calculation has me more doubtful than initially of its value, I suspect the simple implementation is correct, and the test is OK/passing, and there's little risk of harm to users who don't use it, so merging is OK with me!

@menshikh-iv
Copy link
Contributor

congratz @rsdel2007 👍

@menshikh-iv menshikh-iv merged commit a864e02 into piskvorky:develop Jan 15, 2019
@rsdel2007 rsdel2007 deleted the rcs branch January 15, 2019 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants