Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Function relative_cosine_similarity in keyedvectors.py #2307

Merged
merged 18 commits into from
Jan 15, 2019
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -1384,7 +1384,42 @@ def init_sims(self, replace=False):
else:
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

def relative_cosine_similarity(self, wa, wb, topn=10):
"""Compute the relative cosine similarity between two words given top-n similar words,
proposed by Artuur Leeuwenberg,Mihaela Vela,Jon Dehdari,Josef van Genabith
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces after commas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"A Minimally Supervised Approach for Synonym Extraction with Word Embeddings"
<https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>.
To calculate relative cosine similarity between two words, equation (1) of the paper is used.
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
For WordNet synonyms, if rcs(topn=10) is greater than 0.10 than wa and wb are more similar than
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second of three 'than's on this line should actually be 'then' (consequently) not 'than' (comparative).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah...Thanks:).

any arbitrary word pairs.
Parameters
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
----------
wa: str
word for which we have to look top-n similar word.
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
wb: str
word for which we evaluating relative cosine similarity with wa.
topn: int, optional
Number of top-n similar words to look with respect to wa.
Returns
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
-------
numpy.float64
relative cosine similarity between wa and wb.
"""
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
menshikh-iv marked this conversation as resolved.
Show resolved Hide resolved

result = self.similar_by_word(wa, topn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a list of results, using a plural variable name would be slightly better. Also, it's common in the existing gensim code to call the list-of-most-similar-items sims (short for 'similars'), so I'd recommend that variable name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. sims is used as variable name.

topn_words = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topn_words is never needed to calculate results - so no good reason to create.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay...Done.

topn_cosine = []
for i in range(topn):
topn_words.append(result[i][0])
topn_cosine.append(result[i][1])

topn_cosine = np.array(topn_cosine)

norm = np.sum(topn_cosine)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norm isn't a good name here, as it usually means something other than a sum.

But, there's not really a need to loop-append, convert-to-np-array, or put the sum calculation in a local variable. The sum can be a short, idiomatic calculation at the place where it's needed as the denominator of the final return-value calculation, for example just: sum(result[1] for result in sims).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


rcs = (self.similarity(wa, wb)) / norm

return rcs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need blank line before next method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class Word2VecKeyedVectors(WordEmbeddingsKeyedVectors):
"""Mapping between words and vectors for the :class:`~gensim.models.Word2Vec` model.
Used to perform operations on the vectors such as vector lookup, distance, similarity etc.
Expand Down