Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Levenshtein term similarity matrix and fast SCM between corpora #2016

Merged
merged 66 commits into from
Jan 14, 2019
Merged
Show file tree
Hide file tree
Changes from 63 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
ccadc8d
Wrap docstring for WordEmbeddingsKeyedVectors.similarity_matrix
Witiko Mar 26, 2018
517bcc8
Add the gensim.models.levenshtein module
Witiko Mar 26, 2018
e71b6ff
Add projected density to term similarity matrix logs
Witiko Mar 27, 2018
b8425af
Add tests for the gensim.models.levenshtein.similarity_matrix function
Witiko Apr 3, 2018
80c13ef
Separate similarity_matrix methods into director and builder classes.
Witiko Apr 4, 2018
6f6cdb7
Add symmetric parameter to SparseTermSimilarityMatrix
Witiko Apr 4, 2018
7274fac
Add corpus support to SparseTermSimilarityMatrix.inner_product
Witiko Apr 4, 2018
27e76b8
Replace scipy.sparse.dok_matrix.has_key with the in operator
Witiko Apr 5, 2018
739383a
Fix handling of unicode in Python 3 in levsim
Witiko Apr 5, 2018
9ecae3c
Remove temporary method similarity of LevenshteinSimilarityIndex
Witiko Apr 5, 2018
49a2160
Move models.term_similarity, and levenshtein to similarities
Witiko Apr 11, 2018
c5669fc
Make python-Levenshtein a conditional import
Witiko Apr 11, 2018
7b774dd
Add default values to gensim.similarities.levenshtein.levsim arguments
Witiko Apr 11, 2018
2e8d4fa
Remove extraneous addition operators from @deprecated annotations
Witiko Apr 11, 2018
a6e295f
Remove @deprecated annotation from tests
Witiko Apr 11, 2018
13948dc
Merge test_term_similarity, and test_levenshtein with test_similarities
Witiko Apr 11, 2018
a9706de
Reword TermSimilarityIndex docstring
Witiko Apr 11, 2018
5e3e948
Consume no more than topn similarities produced by a TermSimilarityIndex
Witiko Apr 11, 2018
4b895ff
Use short uints (<64b) for dok_matrix keys and num_nonzero array
Witiko Apr 12, 2018
5c100a9
Write to matrix_nonzero only when building a symmetric matrix
Witiko Apr 16, 2018
0efed5e
Ensure UniformTermSimilarityIndex does not yield only topn - 1 values
Witiko Apr 16, 2018
0c3549b
Document _shortest_uint_dtype
Witiko Apr 16, 2018
ee33db8
Add soft cosine measure benchmark, part 1
Witiko Apr 22, 2018
da6e6dd
Add soft cosine measure benchmark, part 2
Witiko Apr 23, 2018
d4053b2
Make similarity_matrix support non-contiguous dictionaries
Witiko May 13, 2018
093d569
Support fast inner product between a document and a corpus
Witiko May 20, 2018
c2888b4
Support fast inner product between a document and a corpus (python 2.7)
Witiko May 20, 2018
32cb4d7
Add faster sparse matrix slicing
Witiko Jul 1, 2018
099d768
Make Soft Cosine Measure support non-contiguous dictionaries
Witiko Jul 1, 2018
dd4561d
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jul 1, 2018
c8f6ef5
Remove gensim::similarities::levenshtein::similarity_matrix facade
Witiko Jul 1, 2018
8f026cc
Implement SoftCosineSimilarity using the inner_product method
Witiko Jul 1, 2018
227d09e
Fix flake8 warnings
Witiko Jul 1, 2018
9f8d0e8
Make Soft Cosine Measure support non-contiguous dictionaries (cont)
Witiko Jul 1, 2018
c316b95
Remove parallelization in gensim::similarities::levenshtein
Witiko Jul 2, 2018
d6b9bd4
Document future work
Witiko Jul 2, 2018
5e52477
Update Soft Cosine Measure benchmark after commits 093d569, and c316b95
Witiko Jul 12, 2018
4b46597
Update SCM tutorial after PR 2016
Witiko Jul 12, 2018
ce95fd9
Add example to gensim::similarities::termsim::SparseTermSimilarityMatrix
Witiko Jul 12, 2018
f8ff4c7
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jul 12, 2018
ac60615
Add max_distance kwarg to gensim::similarities::levenshtein::levsim
Witiko Jul 13, 2018
5154569
Replace max_distance kwarg in levsim with min_similarity, add tests
Witiko Jul 22, 2018
729d185
Remove conditional expression from levsim
Witiko Jul 23, 2018
155dc58
Use less confusing wording in docsting for min_similarity / max_distance
Witiko Jul 23, 2018
7e52ef8
Defer thresholding in LevenshteinSimilarityIndex.most_similar to levsim
Witiko Jul 23, 2018
3866bc9
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jul 30, 2018
a7ee779
Allow None value of nonzero_limit parameter in SparseTermSimilarityMa…
Witiko Aug 16, 2018
e4395e0
Add positive_definite parameter to SparseTermSimilarityMatrix
Witiko Aug 16, 2018
98f3f3d
Split test_building test into a number of atomic unit tests
Witiko Aug 16, 2018
2a55786
Presort dictionary keys in UniformTermSimilarityIndex constructor
Witiko Aug 17, 2018
4d8dc48
Make documentation of SparseTermSimilarityMatrix more accurate
Witiko Aug 25, 2018
d7fd3f1
Make SparseTermSimilarityMatrix expect negative similarities
Witiko Aug 25, 2018
46a477e
Avoid expensive array copying in dot_product
Witiko Sep 9, 2018
583c9c7
Update SCM tutorial, and benchmark after PR 2016
Witiko Sep 11, 2018
4f26de0
Merge branch 'develop' into levenshtein-softcossim
Witiko Sep 11, 2018
4d8338e
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jan 9, 2019
1cc4a49
Remove fluff from stderr in the SCM tutorial notebook
Witiko Jan 11, 2019
9ede310
Add a paper reference to the SCM tutorial notebook
Witiko Jan 11, 2019
c523aa5
Directly import Levenshtein package in levdist
Witiko Jan 11, 2019
e031630
Use embedded URI instead of indirect hyperlink target in documentation
Witiko Jan 11, 2019
19bedf1
Assume that max of lens is always an integer
Witiko Jan 11, 2019
83a07af
Make LevenshteinSimilarityIndex.most_similar easier to read
Witiko Jan 11, 2019
f3258d9
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jan 11, 2019
16ff7ef
Make LevenshteinSimilarityIndex.most_similar easier to read
Witiko Jan 12, 2019
12ee910
Add an ordering test for LevenshteinSimilarityIndex.most_similar
Witiko Jan 12, 2019
3f04940
Make WordEmbeddingSimilarityIndex.most_similar easier to read
Witiko Jan 12, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4,605 changes: 4,605 additions & 0 deletions docs/notebooks/soft_cosine_benchmark.ipynb

Large diffs are not rendered by default.

125 changes: 72 additions & 53 deletions docs/notebooks/soft_cosine_tutorial.ipynb

Large diffs are not rendered by default.

10 changes: 8 additions & 2 deletions gensim/matutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
import math

from gensim import utils
from gensim.utils import deprecated

import numpy as np
import scipy.sparse
Expand Down Expand Up @@ -796,6 +797,9 @@ def cossim(vec1, vec2):
return result


@deprecated(
"Function will be removed in 4.0.0, use "
"gensim.similarities.termsim.SparseTermSimilarityMatrix.inner_product instead")
def softcossim(vec1, vec2, similarity_matrix):
"""Get Soft Cosine Measure between two vectors given a term similarity matrix.

Expand All @@ -816,8 +820,10 @@ def softcossim(vec1, vec2, similarity_matrix):
vec2 : list of (int, float)
A document vector in the BoW format.
similarity_matrix : {:class:`scipy.sparse.csc_matrix`, :class:`scipy.sparse.csr_matrix`}
A term similarity matrix, typically produced by
:meth:`~gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix`.
A term similarity matrix. If the matrix is :class:`scipy.sparse.csr_matrix`, it is going
to be transposed. If you rely on the fact that there is at most a constant number of
non-zero elements in a single column, it is your responsibility to ensure that the matrix
is symmetric.

Returns
-------
Expand Down
2 changes: 1 addition & 1 deletion gensim/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from .logentropy_model import LogEntropyModel # noqa:F401
from .word2vec import Word2Vec # noqa:F401
from .doc2vec import Doc2Vec # noqa:F401
from .keyedvectors import KeyedVectors # noqa:F401
from .keyedvectors import KeyedVectors, WordEmbeddingSimilarityIndex # noqa:F401
from .ldamulticore import LdaMulticore # noqa:F401
from .phrases import Phrases # noqa:F401
from .normmodel import NormModel # noqa:F401
Expand Down
140 changes: 64 additions & 76 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,6 @@

from __future__ import division # py3 "true division"

from collections import deque
from itertools import chain
import logging

Expand All @@ -173,11 +172,12 @@
double, array, zeros, vstack, sqrt, newaxis, integer, \
ndarray, sum as np_sum, prod, argmax
import numpy as np

from gensim import utils, matutils # utility fnc for pickling, common scipy operations etc
from gensim.corpora.dictionary import Dictionary
from six import string_types, integer_types
from six.moves import zip, range
from scipy import sparse, stats
from scipy import stats
from gensim.utils import deprecated
from gensim.models.utils_any2vec import (
_save_word2vec_format,
Expand All @@ -186,6 +186,7 @@
_ft_hash,
_ft_hash_broken
)
from gensim.similarities.termsim import TermSimilarityIndex, SparseTermSimilarityMatrix

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -606,6 +607,9 @@ def similar_by_vector(self, vector, topn=10, restrict_vocab=None):
"""
return self.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)

@deprecated(
"Method will be removed in 4.0.0, use "
"gensim.models.keyedvectors.WordEmbeddingSimilarityIndex instead")
def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100, dtype=REAL):
"""Construct a term similarity matrix for computing Soft Cosine Measure.

Expand All @@ -615,24 +619,21 @@ def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0,
Parameters
----------
dictionary : :class:`~gensim.corpora.dictionary.Dictionary`
A dictionary that specifies a mapping between words and the indices of rows and columns
of the resulting term similarity matrix.
tfidf : :class:`gensim.models.tfidfmodel.TfidfModel`, optional
A model that specifies the relative importance of the terms in the dictionary. The rows
of the term similarity matrix will be build in a decreasing order of importance of terms,
or in the order of term identifiers if None.
A dictionary that specifies the considered terms.
tfidf : :class:`gensim.models.tfidfmodel.TfidfModel` or None, optional
A model that specifies the relative importance of the terms in the dictionary. The
columns of the term similarity matrix will be build in a decreasing order of importance
of terms, or in the order of term identifiers if None.
threshold : float, optional
Only pairs of words whose embeddings are more similar than `threshold` are considered
when building the sparse term similarity matrix.
Only embeddings more similar than `threshold` are considered when retrieving word
embeddings closest to a given word embedding.
exponent : float, optional
The exponent applied to the similarity between two word embeddings when building the term similarity matrix.
Take the word embedding similarities larger than `threshold` to the power of `exponent`.
nonzero_limit : int, optional
The maximum number of non-zero elements outside the diagonal in a single row or column
of the term similarity matrix. Setting `nonzero_limit` to a constant ensures that the
time complexity of computing the Soft Cosine Measure will be linear in the document
length rather than quadratic.
The maximum number of non-zero elements outside the diagonal in a single column of the
sparse term similarity matrix.
dtype : numpy.dtype, optional
Data-type of the term similarity matrix.
Data-type of the sparse term similarity matrix.

Returns
-------
Expand All @@ -654,66 +655,10 @@ def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0,
<http://www.aclweb.org/anthology/S/S17/S17-2051.pdf>`_.

"""
logger.info("constructing a term similarity matrix")
matrix_order = len(dictionary)
matrix_nonzero = [1] * matrix_order
matrix = sparse.identity(matrix_order, dtype=dtype, format="dok")
num_skipped = 0
# Decide the order of rows.
if tfidf is None:
word_indices = deque(sorted(dictionary.keys()))
else:
assert max(tfidf.idfs) < matrix_order
word_indices = deque([
index for index, _
in sorted(tfidf.idfs.items(), key=lambda x: (x[1], -x[0]), reverse=True)
])

# Traverse rows.
for row_number, w1_index in enumerate(list(word_indices)):
word_indices.popleft()
if row_number % 1000 == 0:
logger.info(
"PROGRESS: at %.02f%% rows (%d / %d, %d skipped, %.06f%% density)",
100.0 * (row_number + 1) / matrix_order, row_number + 1, matrix_order,
num_skipped, 100.0 * matrix.getnnz() / matrix_order**2)
w1 = dictionary[w1_index]
if w1 not in self.vocab:
num_skipped += 1
continue # A word from the dictionary is not present in the word2vec model.

# Traverse upper triangle columns.
if matrix_order <= nonzero_limit + 1: # Traverse all columns.
columns = (
(w2_index, self.similarity(w1, dictionary[w2_index]))
for w2_index in word_indices
if dictionary[w2_index] in self.vocab)
else: # Traverse only columns corresponding to the embeddings closest to w1.
num_nonzero = matrix_nonzero[w1_index] - 1
columns = (
(dictionary.token2id[w2], similarity)
for _, (w2, similarity)
in zip(
range(nonzero_limit - num_nonzero),
self.most_similar(positive=[w1], topn=nonzero_limit - num_nonzero)
)
if w2 in dictionary.token2id
)
columns = sorted(columns, key=lambda x: x[0])

for w2_index, similarity in columns:
# Ensure that we don't exceed `nonzero_limit` by mirroring the upper triangle.
if similarity > threshold and matrix_nonzero[w2_index] <= nonzero_limit:
element = similarity**exponent
matrix[w1_index, w2_index] = element
matrix_nonzero[w1_index] += 1
matrix[w2_index, w1_index] = element
matrix_nonzero[w2_index] += 1
logger.info(
"constructed a term similarity matrix with %0.6f %% nonzero elements",
100.0 * matrix.getnnz() / matrix_order**2
)
return matrix.tocsc()
index = WordEmbeddingSimilarityIndex(self, threshold=threshold, exponent=exponent)
similarity_matrix = SparseTermSimilarityMatrix(
index, dictionary, tfidf=tfidf, nonzero_limit=nonzero_limit, dtype=dtype)
return similarity_matrix.matrix

def wmdistance(self, document1, document2):
"""Compute the Word Mover's Distance between two documents.
Expand Down Expand Up @@ -1386,6 +1331,49 @@ def init_sims(self, replace=False):
self.vectors_norm = _l2_norm(self.vectors, replace=replace)


class WordEmbeddingSimilarityIndex(TermSimilarityIndex):
"""
Computes cosine similarities between word embeddings and retrieves the closest word embeddings
by cosine similarity for a given word embedding.

Parameters
----------
keyedvectors : :class:`~gensim.models.keyedvectors.WordEmbeddingsKeyedVectors`
The word embeddings.
threshold : float, optional
Only embeddings more similar than `threshold` are considered when retrieving word embeddings
closest to a given word embedding.
exponent : float, optional
Take the word embedding similarities larger than `threshold` to the power of `exponent`.
kwargs : dict or None
A dict with keyword arguments that will be passed to the `keyedvectors.most_similar` method
when retrieving the word embeddings closest to a given word embedding.

See Also
--------
:class:`~gensim.similarities.termsim.SparseTermSimilarityMatrix`
Build a term similarity matrix and compute the Soft Cosine Measure.

"""
def __init__(self, keyedvectors, threshold=0.0, exponent=2.0, kwargs=None):
assert isinstance(keyedvectors, WordEmbeddingsKeyedVectors)
self.keyedvectors = keyedvectors
self.threshold = threshold
self.exponent = exponent
self.kwargs = kwargs or {}
super(WordEmbeddingSimilarityIndex, self).__init__()

def most_similar(self, t1, topn=10):
if t1 not in self.keyedvectors.vocab:
logger.debug('an out-of-dictionary term "%s"', t1)
else:
for _, (t2, similarity) in zip(
Witiko marked this conversation as resolved.
Show resolved Hide resolved
range(topn), self.keyedvectors.most_similar(
positive=[t1], topn=topn, **self.kwargs)):
if similarity > self.threshold:
yield (t2, similarity**self.exponent)


class Word2VecKeyedVectors(WordEmbeddingsKeyedVectors):
"""Mapping between words and vectors for the :class:`~gensim.models.Word2Vec` model.
Used to perform operations on the vectors such as vector lookup, distance, similarity etc.
Expand Down
2 changes: 2 additions & 0 deletions gensim/similarities/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@

# bring classes directly into package namespace, to save some typing
from .docsim import Similarity, MatrixSimilarity, SparseMatrixSimilarity, SoftCosineSimilarity, WmdSimilarity # noqa:F401
from .termsim import TermSimilarityIndex, UniformTermSimilarityIndex, SparseTermSimilarityMatrix # noqa:F401
from .levenshtein import LevenshteinSimilarityIndex # noqa:F401
77 changes: 35 additions & 42 deletions gensim/similarities/docsim.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@
import scipy.sparse

from gensim import interfaces, utils, matutils
from .termsim import SparseTermSimilarityMatrix
from six.moves import map, range, zip


Expand Down Expand Up @@ -272,8 +273,6 @@ class Similarity(interfaces.SimilarityABC):
Index similarity (dense with cosine distance).
:class:`~gensim.similarities.docsim.SparseMatrixSimilarity`
Index similarity (sparse with cosine distance).
:class:`~gensim.similarities.docsim.SoftCosineSimilarity`
Index similarity (with soft-cosine distance).
:class:`~gensim.similarities.docsim.WmdSimilarity`
Index similarity (with word-mover distance).

Expand Down Expand Up @@ -866,20 +865,18 @@ class SoftCosineSimilarity(interfaces.SimilarityABC):

>>> from gensim.test.utils import common_texts
>>> from gensim.corpora import Dictionary
>>> from gensim.models import Word2Vec
>>> from gensim.similarities import SoftCosineSimilarity
>>> from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
>>> from gensim.similarities import SoftCosineSimilarity, TermSimilarityMatrix
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1) # train word-vectors
>>> termsim_index = WordEmbeddingSimilarityIndex(model)
>>> dictionary = Dictionary(common_texts)
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
>>> similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix
>>> docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
>>>
>>> similarity_matrix = model.wv.similarity_matrix(dictionary) # construct similarity matrix
>>> index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
>>>
>>> # Make a query.
>>> query = 'graph trees computer'.split()
>>> # calculate similarity between query and each doc from bow_corpus
>>> sims = index[dictionary.doc2bow(query)]
>>> query = 'graph trees computer'.split() # make a query
>>> sims = docsim_index[dictionary.doc2bow(query)] # calculate similarity of query to each doc from bow_corpus

Check out `Tutorial Notebook
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb>`_
Expand All @@ -893,24 +890,32 @@ def __init__(self, corpus, similarity_matrix, num_best=None, chunksize=256):
----------
corpus: iterable of list of (int, float)
A list of documents in the BoW format.
similarity_matrix : :class:`scipy.sparse.csc_matrix`
A term similarity matrix, typically produced by
:meth:`~gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix`.
similarity_matrix : :class:`gensim.similarities.SparseTermSimilarityMatrix`
A term similarity matrix.
num_best : int, optional
The number of results to retrieve for a query, if None - return similarities with all elements from corpus.
chunksize: int, optional
Size of one corpus chunk.

See Also
--------
:meth:`gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix`
A term similarity matrix produced from term embeddings.
:func:`gensim.matutils.softcossim`
The Soft Cosine Measure.
:class:`gensim.similarities.SparseTermSimilarityMatrix`
A sparse term similarity matrix build using a term similarity index.
:class:`gensim.similarities.LevenshteinSimilarityIndex`
A term similarity index that computes Levenshtein similarities between terms.
:class:`gensim.models.WordEmbeddingSimilarityIndex`
A term similarity index that computes cosine similarities between word embeddings.

"""
if scipy.sparse.issparse(similarity_matrix):
logger.warn(
"Support for passing an unencapsulated sparse matrix will be removed in 4.0.0, pass "
"a SparseTermSimilarityMatrix instance instead")
self.similarity_matrix = SparseTermSimilarityMatrix(similarity_matrix)
else:
self.similarity_matrix = similarity_matrix

self.corpus = corpus
self.similarity_matrix = similarity_matrix
self.num_best = num_best
self.chunksize = chunksize

Expand Down Expand Up @@ -943,31 +948,19 @@ def get_similarities(self, query):
Similarity matrix.

"""
if not self.corpus:
return numpy.array()

is_corpus, query = utils.is_corpus(query)
if not is_corpus:
if isinstance(query, numpy.ndarray):
# Convert document indexes to actual documents.
query = [self.corpus[i] for i in query]
else:
query = [query]

result = []
for query_document in query:
# Compute similarity for each query.
qresult = [matutils.softcossim(query_document, corpus_document, self.similarity_matrix)
for corpus_document in self.corpus]
qresult = numpy.array(qresult)

# Append single query result to list of all results.
result.append(qresult)

if is_corpus:
result = numpy.array(result)
else:
result = result[0]

return result
if not is_corpus and isinstance(query, numpy.ndarray):
query = [self.corpus[i] for i in query] # convert document indexes to actual documents
result = self.similarity_matrix.inner_product(query, self.corpus, normalized=True)

if scipy.sparse.issparse(result):
return numpy.asarray(result.todense())
if numpy.isscalar(result):
return numpy.array(result)
return numpy.asarray(result)[0]

def __str__(self):
return "%s<%i docs, %i features>" % (self.__class__.__name__, len(self), self.similarity_matrix.shape[0])
Expand Down
Loading