-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Levenshtein term similarity matrix and fast SCM between corpora #2016
Implement Levenshtein term similarity matrix and fast SCM between corpora #2016
Conversation
gensim/matutils.py
Outdated
@@ -775,6 +776,9 @@ def cossim(vec1, vec2): | |||
return result | |||
|
|||
|
|||
@deprecated( | |||
"Function will be removed in 4.0.0, use " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: no need to use +
for concatenation if this happens in ()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this once we figure out what to actually deprecate.
setup.py
Outdated
@@ -309,6 +309,7 @@ def finalize_options(self): | |||
'scipy >= 0.18.1', | |||
'six >= 1.5.0', | |||
'smart_open >= 1.2.1', | |||
'python-Levenshtein >= 0.10.2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, but for adding new core-dependency we should have serious reasons
CC: @piskvorky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see 2 ways
- implement this functionality yourself
- add it as "conditional" import (and move to test dependencies)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dependency might be temporary, since the current Levenshtein distance implementation is inefficient. We discussed more efficient bulk Levenshtein distance implementation in #1955.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, introducing new core dependencies is not desirable. Especially if it's only used in new, experimental modules.
I agree that naive implementations that just call dist(s1, s2)
repeatedly are not very useful in practice, but it's good as a baseline and for checking regressions.
For "non-experimental" code we'd want something that takes the concrete problem constraints into account: pre-calculating static parts, tries, automata, indexes, early-out when maximum acceptable distance exceeded etc.
gensim/test/test_similarities.py
Outdated
@@ -371,6 +372,7 @@ def testIter(self): | |||
|
|||
|
|||
class TestSoftCosineSimilarity(unittest.TestCase, _TestSimilarityABC): | |||
@deprecated("Method will be removed in 4.0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this is a good idea to deprecate something in tests. Anyway, if we'll remove old code - tests will be broken and we'll see it clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we have decided what to actually deprecate, I will remove the annotations from tests. They are currently there just to make it clear what parts of the code are proposed for deprecation.
gensim/test/test_levenshtein.py
Outdated
@@ -0,0 +1,131 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not place it in test_similarities
(same for test_term_similarity
)?
gensim/models/term_similarity.py
Outdated
logger = logging.getLogger(__name__) | ||
|
||
|
||
class TermSimilarityIndex(SaveLoad): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that gensim.models
is the right place for it (same for levenshtein
), we have gensim.similarities
for this kind of stuff (exception only for KeyedVectors
)
CC: @piskvorky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module placement is one of the things that I hoped we could discuss, since there is little documentation about where things should go. Both modules implement term similarity models; if you think they would feel more at home in gensim.similarities
, then to gensim.similarities
they will go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think better to place it in gensim.similarities
, because
- This is purely about the similarity (and we already have submodule for it)
- This doesn't "quack" like a standard gensim model
gensim/models/term_similarity.py
Outdated
self.dictionary = dictionary | ||
self.term_similarity = term_similarity | ||
|
||
def most_similar(self, t1, topn=10): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this isn't ready, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually quite ready. UniformTermSimilarityIndex
assigns a constant similarity to any pair of distinct words; the main use is for testing SparseTermSimilarityMatrix
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also quite useful for benchmarking the maximum throughput of the SparseTermSimilarityMatrix
constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another use would be the construction of an identity term similarity matrix by setting term_similarity
to zero.
fb4e3ff
to
44e68f8
Compare
44e68f8
to
739383a
Compare
@Witiko please write TO-DO list for this PR (what else need to change) for simpler navigation / better understanding |
setup.py
Outdated
@@ -309,6 +309,7 @@ def finalize_options(self): | |||
'scipy >= 0.18.1', | |||
'six >= 1.5.0', | |||
'smart_open >= 1.2.1', | |||
'python-Levenshtein >= 0.10.2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, introducing new core dependencies is not desirable. Especially if it's only used in new, experimental modules.
I agree that naive implementations that just call dist(s1, s2)
repeatedly are not very useful in practice, but it's good as a baseline and for checking regressions.
For "non-experimental" code we'd want something that takes the concrete problem constraints into account: pre-calculating static parts, tries, automata, indexes, early-out when maximum acceptable distance exceeded etc.
I have finished the description and the TODO list as requested, and I inserted both into the first post for improved legibility. Any comments, especially on the deprecation TODO list items, are welcome. |
a2de996
to
f7388e1
Compare
190bf91
to
2c11ff2
Compare
2c11ff2
to
13948dc
Compare
@menshikh-iv @piskvorky Just a reminder that this PR:
In other words, this PR should be good to go for the upcoming release. What are your thoughts? |
@Witiko please resolve an merge conflict |
55a19d8
to
597191b
Compare
597191b
to
4d8338e
Compare
@menshikh-iv I merged the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice @Witiko 👍
please resolve current review and I'll merge that
aa08ad8
to
1f58a5e
Compare
1f58a5e
to
3f04940
Compare
Awesome work @Witiko, I hope your research going well :) |
Below, you will find the suggested changelog. 🌟 New Features
👍 Improvements
|
Introduction
This is a follow-up of #1827 (Implement Soft Cosine Measure). The original implementation included only a single term similarity matrix based on word embeddings. The new commits add a Levenshtein term similarity matrix. To prevent code duplication and to reduce complexity, I have separated the matrix building algorithm from the code that retrieves most similar terms. In reaction to #1955, the Soft Cosine Measure (SCM) can now be computed not only between a pair of vectors, but also between a corpus, and a vector and between a pair of corpora. This last point is also the future work suggested in #1827. Issues to discuss are with regards to the placement of new code, deprecation of old code, and speeding up the Levenshtein distance implementation.
The
gensim.similarities.termsim
moduleA major structural change is the addition of the
gensim.similarities.termsim
module. Before this addition, there existed theWordEmbeddingsKeyedVectors.similarity_matrix
method that contained both the matrix building algorithm and the algorithm for retriving most similar terms for a given term. Now the matrix building algorithm has been moved into a separateSparseTermSimilarityMatrix
director class and the algorithm for retrieving most similar terms has been moved into a separateWordEmbeddingSimilarityIndex
builder class that implements theTermSimilarityIndex
interface. This change follows the single responsibility principle.Two new classes implementing the
TermSimilarityIndex
interface have also been added. TheLevenshteinSimilarityIndex
retrieves the most similar terms according to the “Levenshtein similarity” described by Charlet and Damnati, 2017 [1, sec. 2.2]. TheUniformTermSimilarityIndex
assumes all distinct terms are equally similar and its main use is in testingSparseTermSimilarityMatrix
.The following UML class diagram captures the new structure:
The
WordEmbeddingsKeyedVectors.similarity_matrix
method and thesimilarity_matrix
function in thegensim.similarities.levenshtein
module currently serve as facades that construct aSparseTermSimilarityMatrix
using the appropriateTermSimilarityIndex
behind the scenes. This keeps the code backwards-compatible. I marked the functions for deprecation in 4.0.0, but if the Gensim policies allow, we can get rid of them sooner. Besides backwards compatibility, the facades are also convenient for the users, but there is an associated maintenance cost if we decided to keep them.The
gensim.similarities.levenshtein
moduleThe
gensim.similarities.levenshtein
module contains code for computing the “Levenshtein similarity” described by Charlet and Damnati, 2017 [1]. See the benchmark for a detailed performance analysis.The
SparseTermSimilarityMatrix.inner_product
methodThe
SparseTermSimilarityMatrix.inner_product
method contains code for computing the inner product between two vectors or corpora expressed in a non-orthogonal basis.There is a bit of “smart” linear algebra involved in computing the inner product between two L2-normalized m×n corpus matrices X and Y, which I will briefly describe here. We need to normalize each column document vector x in X by sqrt(xᵀ ⋅ S ⋅ x), which is equivalent to the entrywise (Hadamard) division of each row in X by the diagonal of sqrt(Xᵀ ⋅ S ⋅ X), where S is the m×m term similarity matrix. However, sqrt(Xᵀ ⋅ S ⋅ X) is an O(mn² ≈ m⁵) operation. We can instead directly compute the column vector of the diagonal as sqrt(Xᵀ ⋅ S * Xᵀ) summed along the row axis, where * is the entrywise product, which is an O(m²n ≈ m⁴) operation.
The
SparseTermSimilarityMatrix.inner_product
method method is a more general variant of thegensim.matutils.softcossim
function, which only computes the inner product between two normalized vectors, not general vectors or corpora. I marked thematutils.softcossim
function for deprecation in 4.0.0, but if the Gensim policies allow, we can get rid of it sooner.Since the
gensim.similarities.termsim
model imports thecorpus2csc
function from thematutils
module, importingSparseTermSimilarityMatrix.inner_product
fromgensim.matutils
would result in a cross-dependency. Therefore, thesoftcossim
function carries all the old code instead of callinginner_product
.Future work
The
SparseTermSimilarityMatrix
class constructor uses the dictionary of keys (DOK) sparse matrix format to incrementally build a sparse matrix. This is convenient, but, as explained by @maciejkula, space-inefficient. I observed a 10-fold increase in RAM usage compared to using three dynamic arrays (two for indices, and one for the data) with the shortest possible unsigned integer data types. Similar technique will be implemented to theSparseTermSimilarityMatrix
class constructor.The
levdist
function admits amax_distance
parameter that allows us to terminate the computation of the Levenshtein distance early. This optimization will be introduced to the python-Levenshtein module from Antti Haapala (see also the discussion below).The
LevenshteinSimilarityIndex.most_similar
method currently uses pointwise Levenshtein distance to retrieve the most similar terms to a given term in average time O(mn²), where m is the number of terms in a dictionary and n is the average word length. When used to compute a term similarity matrix, this results in average time O(mmn² = m²n²). A more time-efficient procedure for computing the Levenshtein distance between all terms in a dictionary will be implemented.References