Implement Levenshtein term similarity matrix and fast SCM between corpora #2016

Witiko · 2018-04-04T22:57:29Z

Introduction

This is a follow-up of #1827 (Implement Soft Cosine Measure). The original implementation included only a single term similarity matrix based on word embeddings. The new commits add a Levenshtein term similarity matrix. To prevent code duplication and to reduce complexity, I have separated the matrix building algorithm from the code that retrieves most similar terms. In reaction to #1955, the Soft Cosine Measure (SCM) can now be computed not only between a pair of vectors, but also between a corpus, and a vector and between a pair of corpora. This last point is also the future work suggested in #1827. Issues to discuss are with regards to the placement of new code, deprecation of old code, and speeding up the Levenshtein distance implementation.

The `gensim.similarities.termsim` module

A major structural change is the addition of the gensim.similarities.termsim module. Before this addition, there existed the WordEmbeddingsKeyedVectors.similarity_matrix method that contained both the matrix building algorithm and the algorithm for retriving most similar terms for a given term. Now the matrix building algorithm has been moved into a separate SparseTermSimilarityMatrix director class and the algorithm for retrieving most similar terms has been moved into a separate WordEmbeddingSimilarityIndex builder class that implements the TermSimilarityIndex interface. This change follows the single responsibility principle.

Two new classes implementing the TermSimilarityIndex interface have also been added. The LevenshteinSimilarityIndex retrieves the most similar terms according to the “Levenshtein similarity” described by Charlet and Damnati, 2017 [1, sec. 2.2]. The UniformTermSimilarityIndex assumes all distinct terms are equally similar and its main use is in testing SparseTermSimilarityMatrix.

The following UML class diagram captures the new structure:

The WordEmbeddingsKeyedVectors.similarity_matrix method and the similarity_matrix function in the gensim.similarities.levenshtein module currently serve as facades that construct a SparseTermSimilarityMatrix using the appropriate TermSimilarityIndex behind the scenes. This keeps the code backwards-compatible. I marked the functions for deprecation in 4.0.0, but if the Gensim policies allow, we can get rid of them sooner. Besides backwards compatibility, the facades are also convenient for the users, but there is an associated maintenance cost if we decided to keep them.

The `gensim.similarities.levenshtein` module

The gensim.similarities.levenshtein module contains code for computing the “Levenshtein similarity” described by Charlet and Damnati, 2017 [1]. See the benchmark for a detailed performance analysis.

The `SparseTermSimilarityMatrix.inner_product` method

The SparseTermSimilarityMatrix.inner_product method contains code for computing the inner product between two vectors or corpora expressed in a non-orthogonal basis.

There is a bit of “smart” linear algebra involved in computing the inner product between two L2-normalized m×n corpus matrices X and Y, which I will briefly describe here. We need to normalize each column document vector x in X by sqrt(xᵀ ⋅ S ⋅ x), which is equivalent to the entrywise (Hadamard) division of each row in X by the diagonal of sqrt(Xᵀ ⋅ S ⋅ X), where S is the m×m term similarity matrix. However, sqrt(Xᵀ ⋅ S ⋅ X) is an O(mn² ≈ m⁵) operation. We can instead directly compute the column vector of the diagonal as sqrt(Xᵀ ⋅ S * Xᵀ) summed along the row axis, where * is the entrywise product, which is an O(m²n ≈ m⁴) operation.

The SparseTermSimilarityMatrix.inner_product method method is a more general variant of the gensim.matutils.softcossim function, which only computes the inner product between two normalized vectors, not general vectors or corpora. I marked the matutils.softcossim function for deprecation in 4.0.0, but if the Gensim policies allow, we can get rid of it sooner.

Since the gensim.similarities.termsim model imports the corpus2csc function from the matutils module, importing SparseTermSimilarityMatrix.inner_product from gensim.matutils would result in a cross-dependency. Therefore, the softcossim function carries all the old code instead of calling inner_product.

Future work

The SparseTermSimilarityMatrix class constructor uses the dictionary of keys (DOK) sparse matrix format to incrementally build a sparse matrix. This is convenient, but, as explained by @maciejkula, space-inefficient. I observed a 10-fold increase in RAM usage compared to using three dynamic arrays (two for indices, and one for the data) with the shortest possible unsigned integer data types. Similar technique will be implemented to the SparseTermSimilarityMatrix class constructor.

The levdist function admits a max_distance parameter that allows us to terminate the computation of the Levenshtein distance early. This optimization will be introduced to the python-Levenshtein module from Antti Haapala (see also the discussion below).

The LevenshteinSimilarityIndex.most_similar method currently uses pointwise Levenshtein distance to retrieve the most similar terms to a given term in average time O(mn²), where m is the number of terms in a dictionary and n is the average word length. When used to compute a term similarity matrix, this results in average time O(mmn² = m²n²). A more time-efficient procedure for computing the Levenshtein distance between all terms in a dictionary will be implemented.

References

Delphine Charlet and Geraldine Damnati, SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering, 2017. (link to PDF)
Vít Novotný. Implementation Notes for the Soft Cosine Measure, 2018. (link to PDF)

menshikh-iv · 2018-04-05T03:53:34Z

gensim/matutils.py

@@ -775,6 +776,9 @@ def cossim(vec1, vec2):
    return result


+@deprecated(
+    "Function will be removed in 4.0.0, use " +


nitpick: no need to use + for concatenation if this happens in ().

I will fix this once we figure out what to actually deprecate.

menshikh-iv · 2018-04-05T03:58:40Z

setup.py

@@ -309,6 +309,7 @@ def finalize_options(self):
        'scipy >= 0.18.1',
        'six >= 1.5.0',
        'smart_open >= 1.2.1',
+        'python-Levenshtein >= 0.10.2'


sorry, but for adding new core-dependency we should have serious reasons
CC: @piskvorky

I see 2 ways

implement this functionality yourself

add it as "conditional" import (and move to test dependencies)

This dependency might be temporary, since the current Levenshtein distance implementation is inefficient. We discussed more efficient bulk Levenshtein distance implementation in #1955.

Yes, introducing new core dependencies is not desirable. Especially if it's only used in new, experimental modules.

I agree that naive implementations that just call dist(s1, s2) repeatedly are not very useful in practice, but it's good as a baseline and for checking regressions.

For "non-experimental" code we'd want something that takes the concrete problem constraints into account: pre-calculating static parts, tries, automata, indexes, early-out when maximum acceptable distance exceeded etc.

menshikh-iv · 2018-04-05T04:00:30Z

gensim/test/test_similarities.py

@@ -371,6 +372,7 @@ def testIter(self):


 class TestSoftCosineSimilarity(unittest.TestCase, _TestSimilarityABC):
+    @deprecated("Method will be removed in 4.0.0")


I don't think that this is a good idea to deprecate something in tests. Anyway, if we'll remove old code - tests will be broken and we'll see it clearly.

After we have decided what to actually deprecate, I will remove the annotations from tests. They are currently there just to make it clear what parts of the code are proposed for deprecation.

menshikh-iv · 2018-04-05T04:17:48Z

gensim/test/test_levenshtein.py

@@ -0,0 +1,131 @@
+#!/usr/bin/env python


why not place it in test_similarities (same for test_term_similarity)?

menshikh-iv · 2018-04-05T04:20:21Z

gensim/models/term_similarity.py

+logger = logging.getLogger(__name__)
+
+
+class TermSimilarityIndex(SaveLoad):


I don't think that gensim.models is the right place for it (same for levenshtein), we have gensim.similarities for this kind of stuff (exception only for KeyedVectors)

CC: @piskvorky

The module placement is one of the things that I hoped we could discuss, since there is little documentation about where things should go. Both modules implement term similarity models; if you think they would feel more at home in gensim.similarities, then to gensim.similarities they will go.

Yes, I think better to place it in gensim.similarities, because

This is purely about the similarity (and we already have submodule for it)

This doesn't "quack" like a standard gensim model

menshikh-iv · 2018-04-05T04:21:22Z

gensim/models/term_similarity.py

+        self.dictionary = dictionary
+        self.term_similarity = term_similarity
+
+    def most_similar(self, t1, topn=10):


Looks like this isn't ready, am I right?

This is actually quite ready. UniformTermSimilarityIndex assigns a constant similarity to any pair of distinct words; the main use is for testing SparseTermSimilarityMatrix.

It is also quite useful for benchmarking the maximum throughput of the SparseTermSimilarityMatrix constructor.

Another use would be the construction of an identity term similarity matrix by setting term_similarity to zero.

menshikh-iv · 2018-04-06T05:02:39Z

@Witiko please write TO-DO list for this PR (what else need to change) for simpler navigation / better understanding

piskvorky · 2018-04-06T12:18:22Z

setup.py

@@ -309,6 +309,7 @@ def finalize_options(self):
        'scipy >= 0.18.1',
        'six >= 1.5.0',
        'smart_open >= 1.2.1',
+        'python-Levenshtein >= 0.10.2'


Yes, introducing new core dependencies is not desirable. Especially if it's only used in new, experimental modules.

I agree that naive implementations that just call dist(s1, s2) repeatedly are not very useful in practice, but it's good as a baseline and for checking regressions.

For "non-experimental" code we'd want something that takes the concrete problem constraints into account: pre-calculating static parts, tries, automata, indexes, early-out when maximum acceptable distance exceeded etc.

Witiko · 2018-04-08T15:07:33Z

I have finished the description and the TODO list as requested, and I inserted both into the first post for improved legibility. Any comments, especially on the deprecation TODO list items, are welcome.

Witiko · 2018-09-11T23:01:13Z

@menshikh-iv @piskvorky Just a reminder that this PR:

passes tests on Python 2.7.9, and 3.4.4 (CI services are false positives) and has no conflicts with the develop branch,
improves the design of the SCM by decomposing it into a matrix builder, a number of term similarity indices, and a routine for computing the inner product of corpora / documents,
improves the speed of the inner product of two corpora by a factor of thousands for large corpora,
maintains backwards compatibility with the API established in PR Implement Soft Cosine Measure #1827,
adds preliminary support for the Levenshtein distance,
adds support for producing symmetric positive definite matrices that can later be factorized.

In other words, this PR should be good to go for the upcoming release. What are your thoughts?

menshikh-iv · 2019-01-09T09:46:27Z

@Witiko please resolve an merge conflict

…cossim

Witiko · 2019-01-10T01:36:50Z

@menshikh-iv I merged the develop branch. There should be no conflicts for the moment.

menshikh-iv

Looks nice @Witiko 👍

please resolve current review and I'll merge that

docs/notebooks/soft_cosine_tutorial.ipynb

gensim/similarities/levenshtein.py

…cossim

gensim/similarities/levenshtein.py

gensim/models/keyedvectors.py

menshikh-iv · 2019-01-14T14:33:03Z

Awesome work @Witiko, I hope your research going well :)

Witiko · 2019-01-17T18:31:33Z

Below, you will find the suggested changelog.

🌟 New Features

Added similarity search using the Levenshtein distance in gensim.similarities.LevenshteinSimilarityIndex. (@Witiko, Implement Levenshtein term similarity matrix and fast SCM between corpora #2016)

👍 Improvements

Performance optimizations to gensim.similarities.SoftCosineSimilarity. 🚀 (@Witiko, Implement Levenshtein term similarity matrix and fast SCM between corpora #2016)

dictionary size corpus size speed

1000 100 1.0×

1000 1000 53.4×

1000 100000 156784.8×

100000 100 3.8×

100000 1000 405.8×

100000 100000 66262.0×
- Soft Cosine notebook with detailed benchmarks

⚠️ Deprecations (will be removed in the next major release)

gensim.matutils.softcossim
gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix

Witiko and others added 7 commits April 5, 2018 00:25

Wrap docstring for WordEmbeddingsKeyedVectors.similarity_matrix

ccadc8d

Add the gensim.models.levenshtein module

517bcc8

Add projected density to term similarity matrix logs

e71b6ff

Add tests for the gensim.models.levenshtein.similarity_matrix function

b8425af

Separate similarity_matrix methods into director and builder classes.

80c13ef

Add symmetric parameter to SparseTermSimilarityMatrix

6f6cdb7

Add corpus support to SparseTermSimilarityMatrix.inner_product

7274fac

menshikh-iv suggested changes Apr 5, 2018

View reviewed changes

Replace scipy.sparse.dok_matrix.has_key with the in operator

27e76b8

Witiko force-pushed the levenshtein-softcossim branch 2 times, most recently from fb4e3ff to 44e68f8 Compare April 5, 2018 09:49

Fix handling of unicode in Python 3 in levsim

739383a

Witiko force-pushed the levenshtein-softcossim branch from 44e68f8 to 739383a Compare April 5, 2018 10:46

Remove temporary method similarity of LevenshteinSimilarityIndex

9ecae3c

piskvorky requested changes Apr 6, 2018

View reviewed changes

Witiko force-pushed the levenshtein-softcossim branch 3 times, most recently from a2de996 to f7388e1 Compare April 11, 2018 11:45

Move models.term_similarity, and levenshtein to similarities

49a2160

Witiko force-pushed the levenshtein-softcossim branch 2 times, most recently from 190bf91 to 2c11ff2 Compare April 11, 2018 16:22

Witiko added 5 commits April 11, 2018 19:16

Make python-Levenshtein a conditional import

c5669fc

Add default values to gensim.similarities.levenshtein.levsim arguments

7b774dd

Remove extraneous addition operators from @deprecated annotations

2e8d4fa

Remove @deprecated annotation from tests

a6e295f

Merge test_term_similarity, and test_levenshtein with test_similarities

13948dc

Witiko force-pushed the levenshtein-softcossim branch from 2c11ff2 to 13948dc Compare April 11, 2018 17:16

Reword TermSimilarityIndex docstring

a9706de

Merge branch 'develop' into levenshtein-softcossim

4f26de0

Witiko force-pushed the levenshtein-softcossim branch 3 times, most recently from 55a19d8 to 597191b Compare January 9, 2019 22:14

Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…

4d8338e

…cossim

Witiko force-pushed the levenshtein-softcossim branch from 597191b to 4d8338e Compare January 9, 2019 22:50

menshikh-iv suggested changes Jan 11, 2019

View reviewed changes

Witiko added 7 commits January 11, 2019 22:18

Remove fluff from stderr in the SCM tutorial notebook

1cc4a49

Add a paper reference to the SCM tutorial notebook

9ede310

Directly import Levenshtein package in levdist

c523aa5

Use embedded URI instead of indirect hyperlink target in documentation

e031630

Assume that max of lens is always an integer

19bedf1

Make LevenshteinSimilarityIndex.most_similar easier to read

83a07af

Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…

f3258d9

…cossim

piskvorky requested changes Jan 12, 2019

View reviewed changes

gensim/similarities/levenshtein.py Outdated Show resolved Hide resolved

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

Witiko force-pushed the levenshtein-softcossim branch from aa08ad8 to 1f58a5e Compare January 12, 2019 02:39

Witiko added 3 commits January 12, 2019 04:55

Make LevenshteinSimilarityIndex.most_similar easier to read

16ff7ef

Add an ordering test for LevenshteinSimilarityIndex.most_similar

12ee910

Make WordEmbeddingSimilarityIndex.most_similar easier to read

3f04940

Witiko force-pushed the levenshtein-softcossim branch from 1f58a5e to 3f04940 Compare January 12, 2019 03:56

menshikh-iv approved these changes Jan 14, 2019

View reviewed changes

menshikh-iv merged commit f3cf463 into piskvorky:develop Jan 14, 2019

Witiko mentioned this pull request Jul 12, 2019

min_similarity & max_distance does not work in levsim #2541

Closed

Witiko mentioned this pull request Apr 4, 2020

Reduce memory use of the term similarity matrix constructor, deprecate the positive_definite parameter, and extend normalization capabilities of the inner_product method #2783

Merged

Witiko mentioned this pull request May 17, 2021

Use FastSS for fast kNN over Levenshtein distance #3146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Levenshtein term similarity matrix and fast SCM between corpora #2016

Implement Levenshtein term similarity matrix and fast SCM between corpora #2016

Witiko commented Apr 4, 2018 •

edited

Loading

menshikh-iv Apr 5, 2018

Witiko Apr 5, 2018

menshikh-iv Apr 5, 2018

menshikh-iv Apr 5, 2018

Witiko Apr 5, 2018

piskvorky Apr 6, 2018 •

edited

Loading

menshikh-iv Apr 5, 2018

Witiko Apr 6, 2018

menshikh-iv Apr 5, 2018

menshikh-iv Apr 5, 2018

Witiko Apr 5, 2018 •

edited

Loading

menshikh-iv Apr 6, 2018

menshikh-iv Apr 5, 2018

Witiko Apr 5, 2018 •

edited

Loading

Witiko Apr 8, 2018

Witiko Apr 6, 2020

menshikh-iv commented Apr 6, 2018

piskvorky Apr 6, 2018 •

edited

Loading

Witiko commented Apr 8, 2018

Witiko commented Sep 11, 2018 •

edited

Loading

menshikh-iv commented Jan 9, 2019

Witiko commented Jan 10, 2019 •

edited

Loading

menshikh-iv left a comment

menshikh-iv commented Jan 14, 2019

Witiko commented Jan 17, 2019

		@@ -371,6 +372,7 @@ def testIter(self):


		class TestSoftCosineSimilarity(unittest.TestCase, _TestSimilarityABC):
		@deprecated("Method will be removed in 4.0.0")

		logger = logging.getLogger(__name__)


		class TermSimilarityIndex(SaveLoad):

Implement Levenshtein term similarity matrix and fast SCM between corpora #2016

Implement Levenshtein term similarity matrix and fast SCM between corpora #2016

Conversation

Witiko commented Apr 4, 2018 • edited Loading

Introduction

The gensim.similarities.termsim module

The gensim.similarities.levenshtein module

The SparseTermSimilarityMatrix.inner_product method

Future work

References

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Witiko Apr 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Witiko Apr 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Apr 6, 2018

piskvorky Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

Witiko commented Apr 8, 2018

Witiko commented Sep 11, 2018 • edited Loading

menshikh-iv commented Jan 9, 2019

Witiko commented Jan 10, 2019 • edited Loading

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv commented Jan 14, 2019

Witiko commented Jan 17, 2019

🌟 New Features

👍 Improvements

⚠️ Deprecations (will be removed in the next major release)

Witiko commented Apr 4, 2018 •

edited

Loading

The `gensim.similarities.termsim` module

The `gensim.similarities.levenshtein` module

The `SparseTermSimilarityMatrix.inner_product` method

piskvorky Apr 6, 2018 •

edited

Loading

Witiko Apr 5, 2018 •

edited

Loading

Witiko Apr 5, 2018 •

edited

Loading

piskvorky Apr 6, 2018 •

edited

Loading

Witiko commented Sep 11, 2018 •

edited

Loading

Witiko commented Jan 10, 2019 •

edited

Loading