Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899

Merged
merged 14 commits into from
Jul 30, 2020
3 changes: 2 additions & 1 deletion gensim/models/doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -747,7 +747,8 @@ def save_word2vec_format(self, fname, doctag_vec=False, word_vec=True, prefix='*

@deprecated(
"Gensim 4.0.0 implemented internal optimizations that make calls to init_sims() unnecessary. "
"init_sims() is now obsoleted and will be completely removed in future versions."
"init_sims() is now obsoleted and will be completely removed in future versions. "
"See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
)
def init_sims(self, replace=False):
"""
Expand Down
3 changes: 2 additions & 1 deletion gensim/models/fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -750,7 +750,8 @@ def train(self, corpus_iterable=None, corpus_file=None, total_examples=None, tot

@deprecated(
"Gensim 4.0.0 implemented internal optimizations that make calls to init_sims() unnecessary. "
"init_sims() is now obsoleted and will be completely removed in future versions."
"init_sims() is now obsoleted and will be completely removed in future versions. "
"See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
)
def init_sims(self, replace=False):
"""
Expand Down
32 changes: 21 additions & 11 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,19 +165,18 @@
import sys
import itertools
import warnings
from itertools import chain
from numbers import Integral

from numpy import dot, float32 as REAL, \
double, array, zeros, vstack, \
ndarray, sum as np_sum, prod, argmax, dtype, ascontiguousarray, \
frombuffer
from numpy import (
dot, float32 as REAL, double, array, zeros, vstack,
ndarray, sum as np_sum, prod, argmax, dtype, ascontiguousarray, frombuffer,
)
import numpy as np
from scipy import stats

from gensim import utils, matutils # utility fnc for pickling, common scipy operations etc
from gensim.corpora.dictionary import Dictionary
from gensim.utils import deprecated
from scipy import stats


logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -519,9 +518,17 @@ def rank(self, key1, key2):
"""Rank of the distance of `key2` from `key1`, in relation to distances of all keys from `key1`."""
return len(self.closer_than(key1, key2)) + 1

# backward compatibility; some would be annotated `@deprecated` if that stacked with @property/.setter
@property
def vectors_norm(self):
raise ValueError(
"The vectors_norm attribute became a get_normed_vectors() method in Gensim 4.0.0. "
"See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
)

def get_normed_vectors(self):
# TODO: what's the way for users to get from a matrix index (integer) to the
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo FYI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_vector(key_or_int, use_norm=True) returns an individual normed vector, by key or raw integer slot. If for some reason a user wants a KeyedVectors that only has the normed vectors, the method unit_normalize_all() (with a suitable warning about the destructiveness of this operation) is available below. (That thus matches the previous init_sims(replace=True) behavior, for those that really need it, but I believe most users of that call were just trying to save memory in a way that's no longer necessary.)

We could add a convenience method for creating a separate new KV instance with unit-normed vectors, but as that goes above-and-beyond prior behavior, & I don't offhand know anyplace that's necessary, I'd rather not bloat this class with more "maybe someone somewhere needs it" functions. (I'd much rather move functions out, first, as per concerns described in #2873.) And if this were a real need, the right way to support it would likely be a generic deep-copy of the KV followed by a unit_normalize_all() on the new KV.

Copy link
Owner Author

@piskvorky piskvorky Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The idea here was to fuse unit_normalize_all() with get_normed_vectors(): have a single function which returns a new KV instance, normalized.

I think that's cleaner than several in-situ modifiers/getters, more functional less state, but it's true the use-case is quite hypothetical.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_vector(key_or_int, use_norm=True) returns an individual normed vector, by key or raw integer slot

Hmm, I don't see that in the code. How does the "raw integer slot" option work?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the int option from the docstring, because I don't think it exists.

# corresponding key (string)?
# Shouldn't we return this as a mapping (dict), or even a new KeyedVectors instance?
self.fill_norms()
return self.vectors / self.norms[..., np.newaxis]

Expand Down Expand Up @@ -1215,8 +1222,8 @@ def evaluate_word_analogies(self, analogies, restrict_vocab=300000, case_insensi

total = {
'section': 'Total accuracy',
'correct': list(chain.from_iterable(s['correct'] for s in sections)),
'incorrect': list(chain.from_iterable(s['incorrect'] for s in sections)),
'correct': list(itertools.chain.from_iterable(s['correct'] for s in sections)),
'incorrect': list(itertools.chain.from_iterable(s['incorrect'] for s in sections)),
}

oov_ratio = float(oov) / quadruplets_no * 100
Expand Down Expand Up @@ -1345,7 +1352,10 @@ def evaluate_word_pairs(self, pairs, delimiter='\t', restrict_vocab=300000,
self.log_evaluate_word_pairs(pearson, spearman, oov_ratio, pairs)
return pearson, spearman, oov_ratio

@deprecated("use fill_norms() instead")
@deprecated(
"Use fill_norms() instead. "
"See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
)
def init_sims(self, replace=False):
"""Precompute data helpful for bulk similarity calculations.

Expand Down Expand Up @@ -1454,7 +1464,7 @@ def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None,
if not (i == val):
break
index_id_count += 1
keys_to_write = chain(range(0, index_id_count), store_order_vocab_keys)
keys_to_write = itertools.chain(range(0, index_id_count), store_order_vocab_keys)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere, I think I've followed a growing convention I've seen elsewhere to consider itertools as a package with a 'well known standard abbreviation' of it, so import itertools as it. I'd prefer it if gensim's project guide specifically recommended this for a small set of packages: np, pd, it, and maybe pkl, as it matches wider practices & simplifies import management. (That's especially true for numpy & itertools, where the prior practice of naming exactly which functions were used meant a jump-around just-typing burden of changing the imports every time some single function was used or removed-from-use.) But whatever project style can be unambiguously declared somewhere authoritative is OK.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preference from me. This one I changed merely because I saw two side-by-side imports import itertools; from itertools import chain, which looked weird, so I merged them into a single import itertools.


with utils.open(fname, mode) as fout:
if write_header:
Expand Down
3 changes: 2 additions & 1 deletion gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -809,7 +809,8 @@ def update_weights(self):

@deprecated(
"Gensim 4.0.0 implemented internal optimizations that make calls to init_sims() unnecessary. "
"init_sims() is now obsoleted and will be completely removed in future versions."
"init_sims() is now obsoleted and will be completely removed in future versions. "
"See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#init_sims"
)
def init_sims(self, replace=False):
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
"""
Expand Down