-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Function relative_cosine_similarity in keyedvectors.py #2307
Conversation
gensim/models/keyedvectors.py
Outdated
@@ -1384,7 +1384,42 @@ def init_sims(self, replace=False): | |||
else: | |||
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL) | |||
|
|||
def relative_cosine_similarity(self, wa, wb, topn=10): | |||
"""Compute the relative cosine similarity between two words given top-n similar words, | |||
proposed by Artuur Leeuwenberg,Mihaela Vela,Jon Dehdari,Josef van Genabith |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces after commas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
"A Minimally Supervised Approach for Synonym Extraction with Word Embeddings" | ||
<https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>. | ||
To calculate relative cosine similarity between two words, equation (1) of the paper is used. | ||
For WordNet synonyms, if rcs(topn=10) is greater than 0.10 than wa and wb are more similar than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second of three 'than's on this line should actually be 'then' (consequently) not 'than' (comparative).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah...Thanks:).
gensim/models/keyedvectors.py
Outdated
|
||
return rcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need blank line before next method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
""" | ||
|
||
result = self.similar_by_word(wa, topn) | ||
topn_words = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topn_words
is never needed to calculate results - so no good reason to create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay...Done.
gensim/models/keyedvectors.py
Outdated
relative cosine similarity between wa and wb. | ||
""" | ||
|
||
result = self.similar_by_word(wa, topn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a list of results, using a plural variable name would be slightly better. Also, it's common in the existing gensim code to call the list-of-most-similar-items sims
(short for 'similars'), so I'd recommend that variable name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. sims
is used as variable name.
gensim/models/keyedvectors.py
Outdated
|
||
topn_cosine = np.array(topn_cosine) | ||
|
||
norm = np.sum(topn_cosine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
norm
isn't a good name here, as it usually means something other than a sum.
But, there's not really a need to loop-append, convert-to-np-array, or put the sum calculation in a local variable. The sum can be a short, idiomatic calculation at the place where it's needed as the denominator of the final return-value calculation, for example just: sum(result[1] for result in sims)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks for your contribution! I've added some line-by-line comments; there should also be a unit test method that confirms expected results in some minimal way. Automatic tests are failing, but that appears unrelated to your changes - @menshikh-iv, it again looks related to some doc-building command. |
Yeah, sure I will make the changes. |
@gojomo can you help me, how to write unit test? |
@rsdel2007 Take a look at the existing tests in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_keyedvectors.py as models – and make something that at least minimally uses the new function, and verifies its results according to some idea of where it's beneficial. (There might be ideas in the original paper proposing relative-cosine-similarity about the kinds of word-comparisons where it'd be meaningful. Matching any results claimed there, even in a tiny way, would both help confirm proper operation and highlight proper use.) While you are developing your test locally, you can run all the tests in |
I have written the unit test in the |
It is almost always better to cut & paste actual error text than to use a screenshot - it ensures the text will appear in search results, etc. Are you sure you're executing the tests inside the same active environment (finding the same working copy of gensim) as where you added |
Yes I am sure that these both are in the same environment. I have checked it twice, but still I got stuck at this. |
@rsdel2007 |
The test is not succeeding - see for example https://travis-ci.org/RaRe-Technologies/gensim/jobs/472632483 But, you absolutely need to figure out locally how to run the tests without the The unit-test needs to pass and make sense according to the supposed benefits of this new calculation, ideally by matching the claims/examples of the paper in which it originated. |
Thanks @gojomo I figure out the problem. |
Can you fashion a test which probes/demonstrates the same advantages the paper claims for this measure, even if the unit-test environment only has access to much smaller corpuses/vector-sets? |
@gojomo I have added the test. |
@gojomo @menshikh-iv, Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style.
@piskvorky I have made the changes. Please take a look. |
cos_sim.append(self.vectors.similarity("good", wordnet_syn[i])) | ||
cos_sim = sorted(cos_sim, reverse=True) # cosine_similarity of "good" with wordnet_syn in decreasing order | ||
# computing relative_cosine_similarity of two similar words | ||
rcs_wordnet = self.vectors.similarity("good", "nice") / sum(cos_sim[i] for i in range(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this is calculating. It's kind of like the relactive_cosine_similarity()
formula, but now with only WordNet synonyms as contributors to the denominator. And, only those synonyms which happen to be in this vector-set. Are all those words in the euclidean_vectors.bin
test vectors set? As a result, I'm not sure what the following asserts really test. Is this matching something in the paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is the problem which I found while making test.There is not any claim or any perfect result in the paper and I can't find any way to confirm on corpus other than wordnet
, so I think best way will be to compare the relative_cosine_similarity
of wordnet
synonyms and most_similar
ones under a threshold of 0.125
.
Let me give explain the insights of the section relative cosine similarity of the paper_:
- Construct a set of the top 10 most (cosine) similar words for w1 (called topn in the paper).
- Calculate a normalized score for each of the words in the topn, by dividing by the sum of the topn cosine similarity scores.
They mostly wanted to know if the most similar word of w1 was a synonym or not, and not a synonym/hypernym etc. They expected that if the most (cosine) similar word is a lot more (cosine) similar than the other words in the topn it is more likely to be a synonym, than if it is only slightly more similar. So this is what the rcs takes into account.
So, they come to conclusion which is the only claim in the paper is that if a word pair have a rcs greater 0.10 than it is more likely to be an arbitrary pair.
0.10
can be used threshold but this result is based on wordnet
corpus. On a short corpus this result may be more lower. The threshold is nothing but the mean of the cosine_similarities of topn words. So on a short corpus it may be anything less than 0.10
.
@gojomo ,Can you suggest some better way to test why looking at above description?
I am looking forward for helping to contribute for the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know offhand what's in euclidean_vectors.bin
- if there's any overlap with the wordnet words you've chosen. But if there's one or more word-and-nearest-neighbor pairs in that set-of-word-vectors (or some other available-at-unit-testing set-of-word-vectors) that the RCS measure successfully identifies as synonyms, and one or more other word-and-nearest-neighbor pairs that the RCS measure also successfully rejects as synonyms, then having the test method show that functionality would be useful as a demonstration/confirmation of the RCS functionality. (And, at least a little, a guard against any future regressions where that breaks due to other changes... which seems unlikely here, but is one of the reasons for ensuring this kind of test coverage.)
Maybe @viplexke, who originally suggested this in #2175, has some other application/test ideas?
self.assertTrue(np.allclose(rcs_wordnet, rcs, 0, 0.125)) | ||
# computing relative_cosine_similarity for two non-similar words | ||
rcs = self.vectors.relative_cosine_similarity("good", "worst", 10) | ||
self.assertTrue(rcs < 0.10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 0.10
an important threshold from the paper, or just chosen because it works? Is this sort of contrast – between a word good
and a near-antonym worst
– the sort of thing RCS is supposed to be good for?
Hi, according to the paper, precision of rcs should go up when increasing
the similarity threshold, in contrast with plain cosine similarity, which
was the main motive behind rcs. So I thought maybe calculating such a
precision curve for one or two top10 cosine similarity sets could prove the
usefulness. Testing on a few individual pairs may work as well, using the
fact that rcs gives lower values for co-hyponyms and related words (false
positives) than plain cs. I hope that helps. I'll try to give test cases..
2019. jan. 2., Sze 0:28 dátummal Gordon Mohr <notifications@github.com> ezt
írta:
… ***@***.**** commented on this pull request.
------------------------------
In gensim/test/test_keyedvectors.py
<#2307 (comment)>
:
> @@ -104,6 +104,27 @@ def test_most_similar_topn(self):
predicted = self.vectors.most_similar('war', topn=None)
self.assertEqual(len(predicted), len(self.vectors.vocab))
+ def test_relative_cosine_similarity(self):
+ """Test relative_cosine_similarity returns expected results with an input of a word pair and topn"""
+ wordnet_syn = ['good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable',
+ 'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient',
+ 'skillful', 'skilful', 'dear', 'near', 'dependable', 'safe', 'secure', 'right', 'ripe', 'well',
+ 'effective', 'in_effect', 'in_force', 'serious', 'sound', 'salutary', 'honest', 'undecomposed',
+ 'unspoiled', 'unspoilt', 'thoroughly', 'soundly'] # synonyms for "good" as per wordnet
+ cos_sim = []
+ for i in range(len(wordnet_syn)):
+ if wordnet_syn[i] in self.vectors.vocab:
+ cos_sim.append(self.vectors.similarity("good", wordnet_syn[i]))
+ cos_sim = sorted(cos_sim, reverse=True) # cosine_similarity of "good" with wordnet_syn in decreasing order
+ # computing relative_cosine_similarity of two similar words
+ rcs_wordnet = self.vectors.similarity("good", "nice") / sum(cos_sim[i] for i in range(10))
I don't know offhand what's in euclidean_vectors.bin - if there's any
overlap with the wordnet words you've chosen. But if there's one or more
word-and-nearest-neighbor pairs in that set-of-word-vectors (or some other
available-at-unit-testing set-of-word-vectors) that the RCS measure
successfully identifies as synonyms, and one or more other
word-and-nearest-neighbor pairs that the RCS measure also successfully
rejects as synonyms, then having the test method show that functionality
would be useful as a demonstration/confirmation of the RCS functionality.
(And, at least a little, a guard against any future regressions where that
breaks due to other changes... which seems unlikely here, but is one of the
reasons for ensuring this kind of test coverage.)
Maybe @viplexke <https://github.com/viplexke>, who originally suggested
this in #2175 <#2175>,
has some other application/test ideas?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2307 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ApC21kDRsoXm4phLDDdBrnl-VBEvwl_6ks5u--80gaJpZM4Zf1m4>
.
|
@viplexke can you provide test cases? |
"""test cases in [word, synonym, antonym, co-hyponym, related word] form"""
Here are 5 dictionaries with 5 hierarchical category key each.
dic_mother={"word":"mother", "synonym":"mom", "antonym":"father",
"co-hyponym":"father", "related_word":"birth"}
dic_cautious={"word":"cautious", "synonym":"careful", "antonym":"careless",
"co-hyponym":"shrewd", "related_word":"danger"}
dic_guy={"word":"guy", "synonym":"dude", "antonym":"girl",
"co-hyponym":"bachelor", "related_word":"son"}
dic_white={"word":"white", "synonym":"milky", "antonym":"black",
"co-hyponym":"blue", "related_word":"blank"}
dic_bend={"word":"bend", "synonym":"curl", "antonym":"straighten",
"co-hyponym":"stretch", "related_word":"river"}
For each case, we can calculate 6 values:
(R)CS(x.word, x.synonym) / (R)CS(x.word, x.antonym)
(R)CS(x.word, x.synonym) / (R)CS(x.word, x.co-hyponym)
(R)CS(x.word, x.synonym) / (R)CS(x.word, x.related_word)
where (R)CS is the (relative) cosine similarity and each ratio with RCS
should be higher thatn the respective ratio with CS:
CS(x.word, x.synonym) / CS(x.word, x.antonym) < RCS(x.word, x.synonym) /
RCS(x.word, x.antonym)
I think this shows that it does what it design for, and that it's not
broken. If this fails significantly, then I was wrong :)
…On Sun, 6 Jan 2019 at 15:23, Rupal Sharma ***@***.***> wrote:
@viplexke <https://github.com/viplexke> can you provide test cases?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2307 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ApC21i22Kjz06pbZm39iZBZw6NbS8b6Oks5vAgbqgaJpZM4Zf1m4>
.
|
" each ratio with RCS should be higher thatn the respective ratio with CS"
- of course I don't mean the universality. The more the better.
…On Tue, 8 Jan 2019 at 17:14, Viktor Pless ***@***.***> wrote:
"""test cases in [word, synonym, antonym, co-hyponym, related word] form"""
Here are 5 dictionaries with 5 hierarchical category key each.
dic_mother={"word":"mother", "synonym":"mom", "antonym":"father",
"co-hyponym":"father", "related_word":"birth"}
dic_cautious={"word":"cautious", "synonym":"careful",
"antonym":"careless", "co-hyponym":"shrewd", "related_word":"danger"}
dic_guy={"word":"guy", "synonym":"dude", "antonym":"girl",
"co-hyponym":"bachelor", "related_word":"son"}
dic_white={"word":"white", "synonym":"milky", "antonym":"black",
"co-hyponym":"blue", "related_word":"blank"}
dic_bend={"word":"bend", "synonym":"curl", "antonym":"straighten",
"co-hyponym":"stretch", "related_word":"river"}
For each case, we can calculate 6 values:
(R)CS(x.word, x.synonym) / (R)CS(x.word, x.antonym)
(R)CS(x.word, x.synonym) / (R)CS(x.word, x.co-hyponym)
(R)CS(x.word, x.synonym) / (R)CS(x.word, x.related_word)
where (R)CS is the (relative) cosine similarity and each ratio with RCS
should be higher thatn the respective ratio with CS:
CS(x.word, x.synonym) / CS(x.word, x.antonym) < RCS(x.word, x.synonym) /
RCS(x.word, x.antonym)
I think this shows that it does what it design for, and that it's not
broken. If this fails significantly, then I was wrong :)
On Sun, 6 Jan 2019 at 15:23, Rupal Sharma ***@***.***>
wrote:
> @viplexke <https://github.com/viplexke> can you provide test cases?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2307 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ApC21i22Kjz06pbZm39iZBZw6NbS8b6Oks5vAgbqgaJpZM4Zf1m4>
> .
>
|
@viplexke according to paper,
So how can this prove the functionality? |
You right, I skipped the sum. The formula is
CS(x.word, x.synonym) / CS(x.word, x.antonym) < RCS(x.word, x.synonym,
dictionary) / RCS(x.word, x.antonym, dictionary)
where *dictionary *corresponds to topn
…On Wed, 9 Jan 2019 at 18:30, Rupal Sharma ***@***.***> wrote:
@viplexke <https://github.com/viplexke> according to paper,
rcs(wa,wb,topn)= cs(wa,wb)/(sum of cosine_similarities of top-n similar
words to wa).
So, rcs(x.word, x.synonym) /rcs(x.word, x.antonym) will be equal to cs(x.word,
x.synonym) / cs(x.word, x.antonym).
CS(x.word, x.synonym) / CS(x.word, x.antonym) < RCS(x.word, x.synonym) /
RCS(x.word, x.antonym)
So how can this prove the functionality?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2307 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ApC21p7AQaWBkucz-lU9o2NQ4e0UWYWAks5vBicfgaJpZM4Zf1m4>
.
|
@viplexke Can you please check again. It seems same to me again. |
You're right, sorry about that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks @rsdel2007, I like current PR when @gojomo will be satisfied - I'll merge that (ping me, Gordon)
gensim/models/keyedvectors.py
Outdated
@@ -195,6 +195,7 @@ class Vocab(object): | |||
and for constructing binary trees (incl. both word leaves and inner nodes). | |||
|
|||
""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated changes, please revert all of it (stay PR compact)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
@@ -1384,12 +1387,42 @@ def init_sims(self, replace=False): | |||
else: | |||
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL) | |||
|
|||
def relative_cosine_similarity(self, wa, wb, topn=10): | |||
"""Compute the relative cosine similarity between two words given top-n similar words, | |||
proposed by Artuur Leeuwenberg, Mihaela Vela, Jon Dehdari, Josef van Genabith |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make a proper link in the doc, please use
by `Artuur Leeuwenberg, ... <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>`_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay..Done.
gensim/models/keyedvectors.py
Outdated
Parameters | ||
---------- | ||
wa: str | ||
word for which we have to look top-n similar word. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentence should start from uppercased letter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
""" | ||
sims = self.similar_by_word(wa, topn) | ||
assert sims, "Failed code invariant: list of similar words must never be empty." | ||
rcs = (self.similarity(wa, wb)) / (sum(result[1] for result in sims)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need wrap left part with ()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, prepend float
if this is meant to be a float division. Both to avoid potential errors due to integer operands in python2, and to make the intent clear.
Also, can you please unpack result
into appropriately named variables, instead of writing result[1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rcs = (self.similarity(wa, wb)) / (sum(result[1] for result in sims)) | |
rcs = float(self.similarity(wa, wb)) / sum(sim for _, sim in sims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv cool! You have to teach me how to do that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/test/test_keyedvectors.py
Outdated
def test_relative_cosine_similarity(self): | ||
"""Test relative_cosine_similarity returns expected results with an input of a word pair and topn""" | ||
wordnet_syn = ['good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable', | ||
'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format it properly, please, like
wordnet_syn = [
'good', 'goodness', 'commodity', 'trade_good', 'full', 'estimable', 'honorable',
'respectable', 'beneficial', 'just', 'upright', 'adept', 'expert', 'practiced', 'proficient',
'skillful', 'skilful', 'dear', 'near', 'dependable', 'safe', 'secure', 'right', 'ripe', 'well',
'effective', 'in_effect', 'in_force', 'serious', 'sound', 'salutary', 'honest', 'undecomposed',
'unspoiled', 'unspoilt', 'thoroughly', 'soundly',
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/keyedvectors.py
Outdated
@@ -1385,6 +1385,36 @@ def init_sims(self, replace=False): | |||
logger.info("precomputing L2-norms of word weight vectors") | |||
self.vectors_norm = _l2_norm(self.vectors, replace=replace) | |||
|
|||
def relative_cosine_similarity(self, wa, wb, topn=10): | |||
"""Compute the relative cosine similarity between two words given top-n similar words, | |||
by Artuur Leeuwenberg, ... "A Minimally Supervised Approach for Synonym Extraction with Word Embeddings" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be proper link rendered by sphinx, in this case
by `Artuur ..... <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>`_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this was an example, still incorrect (+ still missing ` and _)
by `Artur <another authors, paper name> <URL_LINK>`_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I have to include another authors and paper name in < > and add link in a separate < >?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just replace with this
`Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc "A Minimally Supervised Approach for Synonym Extraction
with Word Embeddings" <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>`_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 😄
@gojomo how do you feel about this PR? Looks good for me, waiting approve from you (or notes about what's still should be fixed) |
Though the difficulty in demonstrating/testing the value of this calculation has me more doubtful than initially of its value, I suspect the simple implementation is correct, and the test is OK/passing, and there's little risk of harm to users who don't use it, so merging is OK with me! |
congratz @rsdel2007 👍 |
Fixes #2175.
I have implemented relative_cosine_similarity as function according to the paper and as @gojomo suggested in #2175 discussion.
According to paper :
rcs(top-n) = cosine_similarity(wordA,wordB)/(sum of cosine_similarities of top-n similar words to wordA)
For finding the top-n similar words I have used method
similar_by_word(word,topn)
.