-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899
Changes from 1 commit
b70c826
a81e547
78fe1c4
a0e40ca
4cf4da0
161ad55
31d2b87
c834e06
172e37f
3919b68
d40f685
872c8ed
4c1b3f7
b39eec2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -165,19 +165,18 @@ | |
import sys | ||
import itertools | ||
import warnings | ||
from itertools import chain | ||
from numbers import Integral | ||
|
||
from numpy import dot, float32 as REAL, \ | ||
double, array, zeros, vstack, \ | ||
ndarray, sum as np_sum, prod, argmax, dtype, ascontiguousarray, \ | ||
frombuffer | ||
from numpy import ( | ||
dot, float32 as REAL, double, array, zeros, vstack, | ||
ndarray, sum as np_sum, prod, argmax, dtype, ascontiguousarray, frombuffer, | ||
) | ||
import numpy as np | ||
from scipy import stats | ||
|
||
from gensim import utils, matutils # utility fnc for pickling, common scipy operations etc | ||
from gensim.corpora.dictionary import Dictionary | ||
from gensim.utils import deprecated | ||
from scipy import stats | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
@@ -519,9 +518,17 @@ def rank(self, key1, key2): | |
"""Rank of the distance of `key2` from `key1`, in relation to distances of all keys from `key1`.""" | ||
return len(self.closer_than(key1, key2)) + 1 | ||
|
||
# backward compatibility; some would be annotated `@deprecated` if that stacked with @property/.setter | ||
@property | ||
def vectors_norm(self): | ||
raise ValueError( | ||
"The vectors_norm attribute became a get_normed_vectors() method in Gensim 4.0.0. " | ||
"See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims" | ||
) | ||
|
||
def get_normed_vectors(self): | ||
# TODO: what's the way for users to get from a matrix index (integer) to the | ||
# corresponding key (string)? | ||
# Shouldn't we return this as a mapping (dict), or even a new KeyedVectors instance? | ||
self.fill_norms() | ||
return self.vectors / self.norms[..., np.newaxis] | ||
|
||
|
@@ -1215,8 +1222,8 @@ def evaluate_word_analogies(self, analogies, restrict_vocab=300000, case_insensi | |
|
||
total = { | ||
'section': 'Total accuracy', | ||
'correct': list(chain.from_iterable(s['correct'] for s in sections)), | ||
'incorrect': list(chain.from_iterable(s['incorrect'] for s in sections)), | ||
'correct': list(itertools.chain.from_iterable(s['correct'] for s in sections)), | ||
'incorrect': list(itertools.chain.from_iterable(s['incorrect'] for s in sections)), | ||
} | ||
|
||
oov_ratio = float(oov) / quadruplets_no * 100 | ||
|
@@ -1345,7 +1352,10 @@ def evaluate_word_pairs(self, pairs, delimiter='\t', restrict_vocab=300000, | |
self.log_evaluate_word_pairs(pearson, spearman, oov_ratio, pairs) | ||
return pearson, spearman, oov_ratio | ||
|
||
@deprecated("use fill_norms() instead") | ||
@deprecated( | ||
"Use fill_norms() instead. " | ||
"See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims" | ||
) | ||
def init_sims(self, replace=False): | ||
"""Precompute data helpful for bulk similarity calculations. | ||
|
||
|
@@ -1454,7 +1464,7 @@ def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None, | |
if not (i == val): | ||
break | ||
index_id_count += 1 | ||
keys_to_write = chain(range(0, index_id_count), store_order_vocab_keys) | ||
keys_to_write = itertools.chain(range(0, index_id_count), store_order_vocab_keys) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Elsewhere, I think I've followed a growing convention I've seen elsewhere to consider There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No preference from me. This one I changed merely because I saw two side-by-side imports |
||
|
||
with utils.open(fname, mode) as fout: | ||
if write_header: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_vector(key_or_int, use_norm=True)
returns an individual normed vector, by key or raw integer slot. If for some reason a user wants a KeyedVectors that only has the normed vectors, the methodunit_normalize_all()
(with a suitable warning about the destructiveness of this operation) is available below. (That thus matches the previousinit_sims(replace=True)
behavior, for those that really need it, but I believe most users of that call were just trying to save memory in a way that's no longer necessary.)We could add a convenience method for creating a separate new KV instance with unit-normed vectors, but as that goes above-and-beyond prior behavior, & I don't offhand know anyplace that's necessary, I'd rather not bloat this class with more "maybe someone somewhere needs it" functions. (I'd much rather move functions out, first, as per concerns described in #2873.) And if this were a real need, the right way to support it would likely be a generic deep-copy of the KV followed by a
unit_normalize_all()
on the new KV.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The idea here was to fuse
unit_normalize_all()
withget_normed_vectors()
: have a single function which returns a new KV instance, normalized.I think that's cleaner than several in-situ modifiers/getters, more functional less state, but it's true the use-case is quite hypothetical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't see that in the code. How does the "raw integer slot" option work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the
int
option from the docstring, because I don't think it exists.