-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FastSS for fast kNN over Levenshtein distance #3146
Conversation
58efeef
to
d680af4
Compare
@Witiko what's the result of the benchmark, what am I looking at here? >>> SparseTermSimilarityMatrix(embedding_index, dictionary)
100%|████████████████████████████| 10781/10781 [00:01<00:00, 6401.08it/s] |
We are going over all 10,781 words in a dictionary and looking for the 100 nearest neighbors of every word. |
b18d608
to
bf904eb
Compare
4294577
to
df261b3
Compare
8dcf88d
to
e24e3e9
Compare
Following our discussion in #2541 (comment), I implemented the DAWG for approximate kNN search over the Levenshtein distance. Speed comparisonLet's see how the different techniques measure up. This time, I used the ca 200× larger text8 dataset (100 MB), which should be more representative. The rest of the experimental setup is unaltered from #3146 (comment). Brute-force kNN search$ pip install gensim==4.0.1 python-Levenshtein
$ wget http://mattmahoney.net/dc/text8.zip
$ unzip text8.zip
$ python
>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import LineSentence, Word2Vec
>>> from gensim.similarities import (
... SparseTermSimilarityMatrix,
... WordEmbeddingSimilarityIndex,
... LevenshteinSimilarityIndex,
... )
>>>
>>> corpus = LineSentence('text8')
>>> dictionary = Dictionary(corpus)
>>> w2v_model = Word2Vec(sentences=corpus)
>>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary)
>>>
>>> SparseTermSimilarityMatrix(embedding_index, dictionary)
4%|████▌ | 9949/253854 [00:33<13:51, 293.16it/s]
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
0%| | 20/253854 [00:22<79:30:25, 1.13s/it] VP-Tree index$ pip install vptree git+https://github.com/witiko/gensim@af5833d
$ python
>>> # Same as above
>>>
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
0%| | 20/253854 [00:17<61:20:21, 1.15it/s] DAWG index$ pip install lexpy git+https://github.com/witiko/gensim@fb98a435
$ python
>>> # Same as above
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=1)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
4%|████▌ | 9993/253854 [01:55<44:54, 90.50it/s]
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=2)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
0%| | 20/253854 [00:02<6:36:07, 10.67it/s]
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=3)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
0%| | 20/253854 [00:08<28:04:10, 2.51it/s]
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=10)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
0%| | 20/253854 [00:56<198:45:19, 2.82s/it] ConclusionCompared to the brute force kNN and the VP-Tree metric index, the DAWG approaches the speed of the word embedding kNN retrieval at the expense of only retrieving the terms with the Levenshtein distance 1 from the query. Retrieving terms with the Levenshtein distance 10 or less from the query using the DAWG is already slower than brute-force, but that's unlikely to be a problem with larger-than-toy dictionaries. |
e24e3e9
to
fb98a43
Compare
@Witiko I tried with the first Python implementation of automata that Google gave me:
This implementation seems pure-Python, so I expect 10-100x speed-up after compiling / optimizing. Which doesn't seem too bad… am I missing something? Why do you consider this problem so hard? |
Very nice! The DAWG implementation from lexpy is also pure Python, but your implementation is a fair bit faster! 🙂
In my experience, kNN over the Levenshtein distance is difficult to optimize. I stand corrected.
I regret I can't volunteer the time. The best I can offer is code with an external dependency. |
@piskvorky Actually, have you tried comparing your speed with the |
A pair of eyes to sanity-check the solution would be even better :) I just plugged the code in, didn't even look at the results: 4eeeae6 @Witiko could you double check 4eeeae6 please? You're much more involved in this code than I am.
That particular automaton lib has an MIT license, so IMO it'd make more sense to include it in Gensim directly, rather than as an external dependency. Especially since it's not even a package / on PyPI. But two points:
On the theoretical side, we should be able to do even better, because this algo doesn't take into account the fact that the query words are the words from the dictionary itself. But that seems a very useful fact. This algo is more general, the query can be anything, but that also means the algo is not as optimal as it could be. But before the theoretical algo optimization, let's clear up its practical value / potential impact. |
Good idea, to anchor / sync our results. My machine seems actually ~30% slower than yours (measured at 4eeeae6): |
@piskvorky I compare 4eeeae6 (yours) and fb98a43 (mine) on text8 below. $ git clone https://github.com/antoinewdg/pyffs
$ cd pyffs
$ mkdir generated
$ pip install git+https://github.com/RaRe-Technologies/gensim@4eeeae6
$ python
>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import LineSentence
>>> from gensim.similarities import (
... SparseTermSimilarityMatrix,
... LevenshteinSimilarityIndex,
... )
>>>
>>> corpus = LineSentence('text8')
>>> dictionary = Dictionary(corpus)
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=2)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
0%| | 178/253854 [00:07<2:53:27, 24.37it/s]
$ pip install lexpy git+https://github.com/witiko/gensim@fb98a435
$ python
>>> # Same as above
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=2)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
0%| | 20/253854 [00:02<6:36:07, 10.67it/s] The speed difference is not so large. Perhaps not large enough to justify using a library outside PyPi that requires disk storage? |
This is not an issue if we use the lexpy implementation from this PR.
At the moment, it is only used to produce a word similarity matrix for soft similarity search. However, users may also use it in isolation for their own term similarity queries.
That is true, but it is sadly not something I can devote my time to at the moment.
This is not enforced by the |
2.5x faster is worth some effort. But I'm still not clear about the impact of the whole thing. @Witiko can you think of some kick ass demonstration for this functionality? Something to drive home its usefulness, make people excited it exists? Because I have doubts anyone is using I mean it would be fun to optimize, but it looks like neither of us has the capacity. So either we optimize with a splash (blog post, promo article, demo), or not at all. |
I compiled pyffs with cython (no tweaking, just a dumb compile) and got from 15it/s to 20it/s => 33% improvement. Which is not much. So optimizing wouldn't be as trivial as running Cython over the existing Python code, we'd have to look deeper & profile. If worth at all, that is. |
The But, as it's really just using a set of lookup keys, no reason to limit the interface to a I think gradually adding the new type-hints, where easy to do so & in line with the contributors' style/preferences, is OK. The official Gensim project style might mention type hints as "welcome where helpful not not currently required". |
We used no pruning there, so the dictionary is actually ~1/4 million unique words (word types). I'd say that's excessive for most applications, and would generally recommend culling to ~50k or so. So, not "smaller". |
@piskvorky That is often true in the general case, but when we work with larger word embedding vocabs (2 million word types seem standard), using a larger dictionary can be useful to make the most of our word embeddings. With respect to OCR text retrieval, the typos are also going to increase the size of our dictionary, so we need to be careful with pruning if we wish to capture the typos in the word similarity matrix and still keep the interesting RaRe words. |
@gojomo Gotcha. We still need to know the number of words, so it should not be general iterables, but we can work with anything that is sized. (Sadly, there does not seem to be an easy way to type this; we can either have |
If that's the case, we have to be careful about RAM. FastSS is quite memory hungry. FYI, since I already Cythonized FastSS, I also rewrote the Levenshtein distance function in C. Mostly for my pleasure – the impact of this optimization on the overall TermSimilarity runtime is negligible (benchmark), because the previous version was already fast. But |
5481288
to
0fcfae8
Compare
0fcfae8
to
7655d75
Compare
Looks like there are no reviews forthcoming, so let me merge this. We can discuss the promo for this functionality (article, demo) separately. |
@piskvorky Which test corpus did you use to reach this conclusion? At least on random strings of equal length without a max_dist I could not reproduce your results. In my tests the I performed the following test: setup ="""
from rapidfuzz import string_metric
import Levenshtein
import gensim.similarities
import string
import random
random.seed(18)
characters = string.ascii_letters + string.digits + string.whitespace + string.punctuation
a = ''.join(random.choice(characters) for _ in range({0}))
b_list = [''.join(random.choice(characters) for _ in range({0})) for _ in range({1})]
"""
lengths = list(range(1,256,2))
count = 200
time_gensim = benchmark("python-Levenshtein",
'[gensim.similarities.levenshtein.editdist(a, b) for b in b_list]',
setup, lengths, count)
time_python_levenshtein = benchmark("python-Levenshtein",
'[Levenshtein.distance(a, b) for b in b_list]',
setup, lengths, count)
time_rapidfuzz = benchmark("rapidfuzz",
'[string_metric.levenshtein(a, b) for b in b_list]',
setup, lengths, count) with the following results: So I am quite interested in the kind of benchmarks you performed |
The benchmark code is included in the same place as its results: #3146 (comment). The last item is the final version = what we merged in the end. The merged code gets ~1318it/s; replacing I never heard of ^ EDITED: I originally posted a wrong number here (forgot to recompile |
@maxbachmann I plugged in |
Yes thats definetly a big plus. AFAIK all these solutions are pretty much bound by the time it takes to call the functions. In RapidFuzz (I am the author) I have some functions which calculate the similarity of one string to multiple other strings. This can be more than 10 times as fast due to the following two factors:
I have completely missed the button to open those 60 comments in between ... Thanks for pointing me towards the correct comment Btw is there any accepted standard for benchmarks in this area? I personally mostly test on randomized strings with a similar length (simply because I do not know any better approach). However I often find papers which e.g. propose filters for string similarity based on the length (which are generally a valid approach, since they are fast to calculate). In those the authors often choose datasets with big length differences. Then their algorithm suddenly performs better then the current best algorithm for one specific similarity threshold in which more than 99% of the elements can be filtered out in constant time. Apparently my way of benchmarking does not really represent real datasets either, since they will usually contain strings of different lengths and which do not have a random character (some characters are more common than others). I am asking because I developed an implementation of a string matching algorithm, which is significantly faster than the current implementation (O([N/w]M) instead of O(N*M)) and started to write a paper on it. However looking through different papers I could not really find a common test they choose. It appear like each of the papers tests datasets until they find one, that makes their algorithm appear slightly faster than the alternatives (or tweak their algorithm to fit this one specific dataset). Since I do not have to much knowledge in this space, I would like to make sure I am not doing the same. Especially since I write the paper in my free time simply because I believe that is a big improvement, while I would not bother about it for a couple of percent improvement on a specific dataset ;) |
I checked both polyleven and rapidfuzz and found the algos really exciting (and their implementations impressive). I wasn't aware so much effort went into this field. Now I'm glad I qualified my "this is the fastest implementation" with "that I know of" :-) I don't know about datasets for benchmarking edit distance specifically. My preference is always to start at the top (find a problem worth solving) and optimize down from there. So the choice of algorithms and datasets follows the problem – such as the FastSS algo + vocab from the English Wikipedia that fell out of this PR. I find it puts big-O claims and solution constraints into perspective, because the context and constants always matter. In other words, sorry, I don't know that much about the string matching space in general :-) Maybe @Witiko could help.
Which function is that? |
They both use very similar implementations. E.g. I took the mbleven idea from polyleven. Since RapidFuzz used C++ I was able to make use of templates for more specialised algorithms e.g. when strings only use extended ASCII.
Me neither. I started to work on this when I used FuzzyWuzzy in an MIT Licensed project and realised it was GPL licensed. Then the performance junkie in me was forcing me to optimize the algorithm further 😅
So far |
@piskvorky Rereading the implementation I think it should be possible to further improve the performance of FastSS by quite a bit without any major code changes and will test them this evening. Edit: is it fine to use C++ when it allows a cleaner implementation of things than C? https://github.com/RaRe-Technologies/gensim/blob/2feef89a24c222e4e0fc6e32ac7c6added752c26/gensim/models/word2vec_corpusfile.pxd already uses C++ so I think this should be fine. Also are you sure
really better than dynamic allocation? Allocating 80kb of Stack appears like a pretty bad idea. |
Thanks for the offer! The next big move is on @Witiko and his team now, motivating this functionality with an impactful demo / promo. Further performance optimizations are fun but diminishing returns. Also, I suspect getting deeper into the app side may open up new opportunities for optimization. Because we'll better understand the required / desired parameter choices, the problem space. Maybe the whole thing will yet prove useless (strictly inferior to vector similarity search).
Yeah, I was considering how much to use. I had
I just didn't want to bother. I chose stack mostly for convenience. |
I did just run some benchmarks using the text8 dataset. The edit distance speed is by far not the most important factor for this algorithm. It appears to be mostly bound by the time it takes to generate the set of possible candidates. With some quick modifications I was able to achieve:
So in case there is a real use case for this, those performance optimizations are probably still worth the effort. |
Nice! 10x is significant, definitely worth a discussion. How complex are the changes? |
If all goes well, I will be getting back to you with a draft for the demo by the end of this weekend. |
@piskvorky The summer is over, the researchers are returning from their vacations, and I have received information about the copyright of the books in our dataset of scanned OCR texts from the Hussite era (#3146 (comment)). Therefore, the legal hurdles in producing an information retrieval tutorial are beyond us and we can now get to the practicals. Here is my plan of action for the rest of 2021:
|
I have produced a smaller dataset of OCR texts from books with expired copyright and released it into the public domain. I am afraid that the remaining items will have to wait until the next year. Happy holidays! |
Introduction
The
LevenshteinSimilarityIndex
term similarity index in thetermsim.levenshtein
module implements the lexical text similarity search technique described by Charlet and Damnati (2017) in their paper describing their winning system at SemEval-2017 Task 3: Community Question Answering.We are showing a related semantic similarity search technique using the
WordEmbeddingSimilarityIndex
term similarity index in our Soft Cosine Similarity autoexample, which enjoys some popularity among our users. We would like to also advertiseLevenshteinSimilarityIndex
, which provides a different but equally useful kind of search. However, the current implementation uses brute-force kNN search over the Levenshtein distance to produce a term similarity matrix, which is so slow that it can take years to produce a matrix even for medium-sized corpora such as the English Wikipedia.Following the discussion in #2541, @piskvorky and I implemented indexing using the FastSS algorithm for kNN search over the Levenshtein distance in hopes that this would speed
LevenshteinSimilarityIndex
up by at least three orders of magnitude (1,000×), so that it can compete withWordEmbeddingSimilarityIndex
. As an added bonus, using the FastSS algorithm allows us to remove our external dependence on the python-Levenshtein library.Speed comparison
Below, I will show a before-and-after speed comparison of
LevenshteinSimilarityIndex
compared to the standardWordEmbeddingSimilarityIndex
shown in the Soft Cosine Similarity autoexample. We are measuring how many kNN searches per second, k = 100, a term similarity index can perform. To produce my dictionary (253,854 terms) and word embeddings, I will use the text8 corpus (100 MB). I am running the code on a Dell Inspiron 15 7559.Before the change
We can see that even with our tiny corpus, the
LevenshteinSimilarityIndex
takes over three days to find the 100 nearest neighbors for all 253,854 terms in our vocabulary. Contrast this with theWordEmbeddingSimilarityIndex
, which finishes in under four minutes even though we are using exact nearest-neighbor search and we could get further speed-up by using e.g. the Annoy index.After the change
With the FastSS algorithm, the
LevenshteinSimilarityIndex
receives a 1,500× speed-up and is now not only not slower than theWordEmbeddingSimilarityIndex
, but 1.5× faster. Both term similarity indexes now find the 100 nearest neighbors for all 253,854 terms in our vocabulary in under 4 minutes.Conclusion
Using the FastSS algorithm for kNN search over the Levenshtein distance, we managed to increase the speed of the
LevenshteinSimilarityIndex
term similarity index by four orders of magnitude (1,500×) on the text8 corpus. As an added bonus, using the FastSS algorithm allowed us to remove our external dependence on the Levenshtein library. Closes #2541.