Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FastSS for fast kNN over Levenshtein distance #3146

Merged
merged 27 commits into from
May 20, 2021

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented May 15, 2021

Introduction

The LevenshteinSimilarityIndex term similarity index in the termsim.levenshtein module implements the lexical text similarity search technique described by Charlet and Damnati (2017) in their paper describing their winning system at SemEval-2017 Task 3: Community Question Answering.

We are showing a related semantic similarity search technique using the WordEmbeddingSimilarityIndex term similarity index in our Soft Cosine Similarity autoexample, which enjoys some popularity among our users. We would like to also advertise LevenshteinSimilarityIndex, which provides a different but equally useful kind of search. However, the current implementation uses brute-force kNN search over the Levenshtein distance to produce a term similarity matrix, which is so slow that it can take years to produce a matrix even for medium-sized corpora such as the English Wikipedia.

Following the discussion in #2541, @piskvorky and I implemented indexing using the FastSS algorithm for kNN search over the Levenshtein distance in hopes that this would speed LevenshteinSimilarityIndex up by at least three orders of magnitude (1,000×), so that it can compete with WordEmbeddingSimilarityIndex. As an added bonus, using the FastSS algorithm allows us to remove our external dependence on the python-Levenshtein library.

Speed comparison

Below, I will show a before-and-after speed comparison of LevenshteinSimilarityIndex compared to the standard WordEmbeddingSimilarityIndex shown in the Soft Cosine Similarity autoexample. We are measuring how many kNN searches per second, k = 100, a term similarity index can perform. To produce my dictionary (253,854 terms) and word embeddings, I will use the text8 corpus (100 MB). I am running the code on a Dell Inspiron 15 7559.

Before the change

We can see that even with our tiny corpus, the LevenshteinSimilarityIndex takes over three days to find the 100 nearest neighbors for all 253,854 terms in our vocabulary. Contrast this with the WordEmbeddingSimilarityIndex, which finishes in under four minutes even though we are using exact nearest-neighbor search and we could get further speed-up by using e.g. the Annoy index.

$ pip install gensim==4.0.1 python-Levenshtein
$ wget http://mattmahoney.net/dc/text8.zip
$ unzip text8.zip
$ python
>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import LineSentence, Word2Vec
>>> from gensim.similarities import (
...     SparseTermSimilarityMatrix,
...     WordEmbeddingSimilarityIndex,
...     LevenshteinSimilarityIndex,
... )
>>> 
>>> corpus = LineSentence('text8')
>>> dictionary = Dictionary(corpus)
>>> w2v_model = Word2Vec(sentences=corpus)
>>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary)
>>>
>>> SparseTermSimilarityMatrix(embedding_index, dictionary)
100%|███████████████████████████████| 253854/253854 [04:04<00:00, 1037.97it/s]
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
  0%|                               | 124/253854 [02:24<80:18:05,  1.14s/it]

After the change

With the FastSS algorithm, the LevenshteinSimilarityIndex receives a 1,500× speed-up and is now not only not slower than the WordEmbeddingSimilarityIndex, but 1.5× faster. Both term similarity indexes now find the 100 nearest neighbors for all 253,854 terms in our vocabulary in under 4 minutes.

$ pip install lexpy git+https://github.com/witiko/gensim@7054f90
$ python
>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import LineSentence, Word2Vec
>>> from gensim.similarities import (
...     SparseTermSimilarityMatrix,
...     WordEmbeddingSimilarityIndex,
...     LevenshteinSimilarityIndex,
... )
>>> 
>>> corpus = LineSentence('text8')
>>> dictionary = Dictionary(corpus)
>>> w2v_model = Word2Vec(sentences=corpus)
>>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary)
>>>
>>> SparseTermSimilarityMatrix(embedding_index, dictionary)
100%|███████████████████████████████| 253854/253854 [03:57<00:00, 1070.14it/s]
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
100%|███████████████████████████████| 253854/253854 [02:34<00:00, 1639.23it/s]

Conclusion

Using the FastSS algorithm for kNN search over the Levenshtein distance, we managed to increase the speed of the LevenshteinSimilarityIndex term similarity index by four orders of magnitude (1,500×) on the text8 corpus. As an added bonus, using the FastSS algorithm allowed us to remove our external dependence on the Levenshtein library. Closes #2541.

@Witiko Witiko marked this pull request as draft May 15, 2021 20:50
@Witiko Witiko force-pushed the levenshtein-ball-tree branch from 58efeef to d680af4 Compare May 15, 2021 20:52
@Witiko Witiko changed the title Use sklearn.neighbors.BallTree for fast kNN over Levenshtein distance Use sklearn.neighbors.VPTree for fast kNN over Levenshtein distance May 15, 2021
@Witiko Witiko changed the title Use sklearn.neighbors.VPTree for fast kNN over Levenshtein distance Use VP-Tree for fast kNN over Levenshtein distance May 15, 2021
@piskvorky
Copy link
Owner

@Witiko what's the result of the benchmark, what am I looking at here?

>>> SparseTermSimilarityMatrix(embedding_index, dictionary)
100%|████████████████████████████| 10781/10781 [00:01<00:00, 6401.08it/s]

@Witiko
Copy link
Contributor Author

Witiko commented May 15, 2021

We are going over all 10,781 words in a dictionary and looking for the 100 nearest neighbors of every word.
WordEmbeddingSimilarityIndex is able to do 6400 kNN searches per second.

@Witiko Witiko force-pushed the levenshtein-ball-tree branch 3 times, most recently from b18d608 to bf904eb Compare May 15, 2021 23:36
@Witiko Witiko changed the title Use VP-Tree for fast kNN over Levenshtein distance Use DAWG for fast kNN over Levenshtein distance May 15, 2021
@Witiko Witiko force-pushed the levenshtein-ball-tree branch from 4294577 to df261b3 Compare May 15, 2021 23:48
@Witiko Witiko force-pushed the levenshtein-ball-tree branch 2 times, most recently from 8dcf88d to e24e3e9 Compare May 16, 2021 00:09
@Witiko
Copy link
Contributor Author

Witiko commented May 16, 2021

Following our discussion in #2541 (comment), I implemented the DAWG for approximate kNN search over the Levenshtein distance.

Speed comparison

Let's see how the different techniques measure up. This time, I used the ca 200× larger text8 dataset (100 MB), which should be more representative. The rest of the experimental setup is unaltered from #3146 (comment).

Brute-force kNN search

$ pip install gensim==4.0.1 python-Levenshtein
$ wget http://mattmahoney.net/dc/text8.zip
$ unzip text8.zip
$ python
>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import LineSentence, Word2Vec
>>> from gensim.similarities import (
...     SparseTermSimilarityMatrix,
...     WordEmbeddingSimilarityIndex,
...     LevenshteinSimilarityIndex,
... )
>>> 
>>> corpus = LineSentence('text8')
>>> dictionary = Dictionary(corpus)
>>> w2v_model = Word2Vec(sentences=corpus)
>>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary)
>>>
>>> SparseTermSimilarityMatrix(embedding_index, dictionary)
  4%|████▌                       | 9949/253854 [00:33<13:51, 293.16it/s]
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
  0%|                            | 20/253854 [00:22<79:30:25,  1.13s/it]

VP-Tree index

$ pip install vptree git+https://github.com/witiko/gensim@af5833d
$ python
>>> # Same as above
>>>
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
  0%|                            | 20/253854 [00:17<61:20:21,  1.15it/s]

DAWG index

$ pip install lexpy git+https://github.com/witiko/gensim@fb98a435
$ python
>>> # Same as above
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=1)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
  4%|████▌                       | 9993/253854 [01:55<44:54, 90.50it/s]
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=2)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
  0%|                            | 20/253854 [00:02<6:36:07, 10.67it/s]
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=3)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
  0%|                            | 20/253854 [00:08<28:04:10,  2.51it/s]
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=10)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
  0%|                            | 20/253854 [00:56<198:45:19,  2.82s/it]

Conclusion

Compared to the brute force kNN and the VP-Tree metric index, the DAWG approaches the speed of the word embedding kNN retrieval at the expense of only retrieving the terms with the Levenshtein distance 1 from the query. Retrieving terms with the Levenshtein distance 10 or less from the query using the DAWG is already slower than brute-force, but that's unlikely to be a problem with larger-than-toy dictionaries.

@Witiko Witiko force-pushed the levenshtein-ball-tree branch from e24e3e9 to fb98a43 Compare May 16, 2021 00:23
@Witiko Witiko marked this pull request as ready for review May 16, 2021 01:06
@piskvorky
Copy link
Owner

piskvorky commented May 16, 2021

@Witiko I tried with the first Python implementation of automata that Google gave me:

max_distance=1:
100%| | 10781/10781 [00:34<00:00, 310.78it/s]
2021-05-16 03:38:35,834 : INFO : constructed a sparse term similarity matrix with 0.025356% density
CPU times: user 34.6 s, sys: 65.1 ms, total: 34.6 s
Wall time: 34.7 s

max_distance=2:
100%| | 10781/10781 [01:57<00:00, 91.82it/s]
2021-05-16 03:35:15,161 : INFO : constructed a sparse term similarity matrix with 0.124872% density
CPU times: user 1min 56s, sys: 283 ms, total: 1min 56s
Wall time: 1min 57s

This implementation seems pure-Python, so I expect 10-100x speed-up after compiling / optimizing.

Which doesn't seem too bad… am I missing something? Why do you consider this problem so hard?

@Witiko
Copy link
Contributor Author

Witiko commented May 16, 2021

Which doesn't seem too bad… am I missing something?

Very nice! The DAWG implementation from lexpy is also pure Python, but your implementation is a fair bit faster! 🙂
Sadly, it does not seem published to pypi, so it's difficult to use as a dependency and we would need to adopt it.

Why do you consider this problem so hard?

In my experience, kNN over the Levenshtein distance is difficult to optimize. I stand corrected.

I expect 10-100x speed-up after compiling / optimizing.

I regret I can't volunteer the time. The best I can offer is code with an external dependency.

@Witiko Witiko marked this pull request as draft May 16, 2021 09:21
@Witiko
Copy link
Contributor Author

Witiko commented May 16, 2021

[...] your implementation is a fair bit faster! 🙂

@piskvorky Actually, have you tried comparing your speed with the WordEmbeddingSimilarityIndex?
You may well be using a faster machine and our results would then be incomparable.

@piskvorky
Copy link
Owner

piskvorky commented May 16, 2021

I regret I can't volunteer the time. The best I can offer is code with an external dependency.

A pair of eyes to sanity-check the solution would be even better :) I just plugged the code in, didn't even look at the results: 4eeeae6

@Witiko could you double check 4eeeae6 please? You're much more involved in this code than I am.

Sadly, it does not seem published to pypi, so it's difficult to use as a dependency and we would need to adopt it.

That particular automaton lib has an MIT license, so IMO it'd make more sense to include it in Gensim directly, rather than as an external dependency. Especially since it's not even a package / on PyPI.

But two points:

  1. How important is this whole LevenshteinSimilarityIndex functionality? What are its use-cases, motivation, user base? Basically, is it worth all the hassle.
  2. I'd hope a more thorough googling will reveal a better (faster, packaged, maintained) implementation. Or failing that, we could code it up ourselves – I didn't check in detail but the algo seems straightforward: prebuild a trie of the static dictionary and then intersect it dynamically with each automaton built from the dynamic query string.

On the theoretical side, we should be able to do even better, because this algo doesn't take into account the fact that the query words are the words from the dictionary itself. But that seems a very useful fact. This algo is more general, the query can be anything, but that also means the algo is not as optimal as it could be.

But before the theoretical algo optimization, let's clear up its practical value / potential impact.

@piskvorky
Copy link
Owner

piskvorky commented May 16, 2021

Actually, have you tried comparing your speed with the WordEmbeddingSimilarityIndex?

Good idea, to anchor / sync our results.

My machine seems actually ~30% slower than yours (measured at 4eeeae6):

Screen Shot 2021-05-16 at 12 38 40

@Witiko
Copy link
Contributor Author

Witiko commented May 16, 2021

@piskvorky I compare 4eeeae6 (yours) and fb98a43 (mine) on text8 below.

$ git clone https://github.com/antoinewdg/pyffs
$ cd pyffs
$ mkdir generated
$ pip install git+https://github.com/RaRe-Technologies/gensim@4eeeae6
$ python
>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import LineSentence
>>> from gensim.similarities import (
...     SparseTermSimilarityMatrix,
...     LevenshteinSimilarityIndex,
... )
>>> 
>>> corpus = LineSentence('text8')
>>> dictionary = Dictionary(corpus)
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=2)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
  0%|                            | 178/253854 [00:07<2:53:27, 24.37it/s] 

$ pip install lexpy git+https://github.com/witiko/gensim@fb98a435
$ python
>>> # Same as above
>>>
>>> levenshtein_index = LevenshteinSimilarityIndex(dictionary, max_distance=2)
>>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
  0%|                            | 20/253854 [00:02<6:36:07, 10.67it/s]

The speed difference is not so large. Perhaps not large enough to justify using a library outside PyPi that requires disk storage?

@Witiko
Copy link
Contributor Author

Witiko commented May 16, 2021

That particular automaton lib has an MIT license, so IMO it'd make more sense to include it in Gensim directly, rather than as an external dependency. Especially since it's not even a package / on PyPI.

This is not an issue if we use the lexpy implementation from this PR.

But two points:

How important is this whole LevenshteinSimilarityIndex functionality? What are its use-cases, motivation, user base? Basically, is it worth all the hassle.

At the moment, it is only used to produce a word similarity matrix for soft similarity search. However, users may also use it in isolation for their own term similarity queries.

I'd hope a more thorough googling will reveal a better (faster, packaged, maintained) implementation. Or failing that, we could code it up ourselves – I didn't check in detail but the algo seems straightforward: prebuild a trie of the static dictionary and then > intersect it dynamically with each automation built from the dynamic query string.

That is true, but it is sadly not something I can devote my time to at the moment.

On the theoretical side, we should be able to do even better, because this algo doesn't take into account the fact that the query words are the words from the dictionary itself. But that seems a very useful fact. This algo is more general, the query can be anything, but that also means the algo is not as optimal as it could be.

This is not enforced by the TermSimilarityIndex class and in general, we may receive words from outside the dictionary. This increases the usefulness of the standalone term similarity index.

@piskvorky
Copy link
Owner

piskvorky commented May 16, 2021

The speed difference is not so large. Perhaps not large enough to justify using a library outside PyPi that requires disk storage?
I am thinking of adding a couple of tunables

2.5x faster is worth some effort.

But I'm still not clear about the impact of the whole thing. @Witiko can you think of some kick ass demonstration for this functionality? Something to drive home its usefulness, make people excited it exists?

Because I have doubts anyone is using LevenshteinSimilarityIndex at all now, and optimizing something nobody uses is a waste of time.

I mean it would be fun to optimize, but it looks like neither of us has the capacity. So either we optimize with a splash (blog post, promo article, demo), or not at all.

@piskvorky
Copy link
Owner

piskvorky commented May 16, 2021

I compiled pyffs with cython (no tweaking, just a dumb compile) and got from 15it/s to 20it/s => 33% improvement.

Which is not much. So optimizing wouldn't be as trivial as running Cython over the existing Python code, we'd have to look deeper & profile. If worth at all, that is.

@gojomo
Copy link
Collaborator

gojomo commented May 19, 2021

The materialize/vectorize_dictionary functionality seems interesting for pre-composing a mix (or subset) of in-vocab & OOV word vectors for later steps; I agree it'd make sense as a FTKV utility function.

But, as it's really just using a set of lookup keys, no reason to limit the interface to a gensim.Dictionary – it might as well work with a list (or set, etc), and let the caller pull the keys from their Dictionary or dict or other source as needed. And thus I might name the method vectors_for_all() or keyedvectors_for_all().

I think gradually adding the new type-hints, where easy to do so & in line with the contributors' style/preferences, is OK. The official Gensim project style might mention type hints as "welcome where helpful not not currently required".

@piskvorky
Copy link
Owner

piskvorky commented May 19, 2021

smaller dictionaries such as text8

We used no pruning there, so the dictionary is actually ~1/4 million unique words (word types). I'd say that's excessive for most applications, and would generally recommend culling to ~50k or so. So, not "smaller".

@Witiko
Copy link
Contributor Author

Witiko commented May 19, 2021

We used no pruning there, so the dictionary is actually ~1/4 million unique words (word types). I'd say that's excessive for most applications, and would generally recommend culling to ~50k or so. So, not "smaller".

@piskvorky That is often true in the general case, but when we work with larger word embedding vocabs (2 million word types seem standard), using a larger dictionary can be useful to make the most of our word embeddings. With respect to OCR text retrieval, the typos are also going to increase the size of our dictionary, so we need to be careful with pruning if we wish to capture the typos in the word similarity matrix and still keep the interesting RaRe words.

@Witiko
Copy link
Contributor Author

Witiko commented May 19, 2021

But, as it's really just using a set of lookup keys, no reason to limit the interface to a gensim.Dictionary – it might as well work with a list (or set, etc), and let the caller pull the keys from their Dictionary or dict or other source as needed. And thus I might name the method vectors_for_all() or keyedvectors_for_all().

@gojomo Gotcha. We still need to know the number of words, so it should not be general iterables, but we can work with anything that is sized. (Sadly, there does not seem to be an easy way to type this; we can either have Sized or Iterable[str], but not both. There is some talk about adding an Intersection type: words: Sized & Iterable[str], but no decision yet: python/typing#213).

@piskvorky
Copy link
Owner

piskvorky commented May 19, 2021

typos are also going to increase the size of our dictionary, so we need to be careful with pruning if we wish to capture the typos

If that's the case, we have to be careful about RAM. FastSS is quite memory hungry.

FYI, since I already Cythonized FastSS, I also rewrote the Levenshtein distance function in C. Mostly for my pleasure – the impact of this optimization on the overall TermSimilarity runtime is negligible (benchmark), because the previous version was already fast.

But gensim.similarities.levenshtein.editdist is now the fastest Python implementation that I know of. Faster than Levenshtein.distance from the python-Levenshtein package, especially if you set the max_dist=N early-out parameter when you don't care about large distances. python-Levenshtein doesn't even have that parameter.

@piskvorky piskvorky force-pushed the levenshtein-ball-tree branch 2 times, most recently from 5481288 to 0fcfae8 Compare May 20, 2021 09:13
@piskvorky piskvorky force-pushed the levenshtein-ball-tree branch from 0fcfae8 to 7655d75 Compare May 20, 2021 09:58
@piskvorky
Copy link
Owner

piskvorky commented May 20, 2021

Looks like there are no reviews forthcoming, so let me merge this.

We can discuss the promo for this functionality (article, demo) separately.

@maxbachmann
Copy link

maxbachmann commented May 21, 2021

But gensim.similarities.levenshtein.editdist is now the fastest Python implementation that I know of. Faster than Levenshtein.distance from the python-Levenshtein package, especially if you set the max_dist=N early-out parameter when you don't care about large distances

@piskvorky Which test corpus did you use to reach this conclusion? At least on random strings of equal length without a max_dist I could not reproduce your results. In my tests the gensim implementation appears to perform worse than python-Levenshtein, which again performs a lot worse than some other implementations like the one in RapidFuzz.

I performed the following test:

setup ="""
from rapidfuzz import string_metric
import Levenshtein
import gensim.similarities
import string
import random
random.seed(18)
characters = string.ascii_letters + string.digits + string.whitespace + string.punctuation
a = ''.join(random.choice(characters) for _ in range({0}))
b_list = [''.join(random.choice(characters) for _ in range({0})) for _ in range({1})]
"""

lengths = list(range(1,256,2))
count = 200

time_gensim = benchmark("python-Levenshtein",
        '[gensim.similarities.levenshtein.editdist(a, b) for b in b_list]',
        setup, lengths, count)

time_python_levenshtein = benchmark("python-Levenshtein",
        '[Levenshtein.distance(a, b) for b in b_list]',
        setup, lengths, count)

time_rapidfuzz = benchmark("rapidfuzz",
        '[string_metric.levenshtein(a, b) for b in b_list]',
        setup, lengths, count)

with the following results:

bench

So I am quite interested in the kind of benchmarks you performed

@piskvorky
Copy link
Owner

piskvorky commented May 21, 2021

The benchmark code is included in the same place as its results: #3146 (comment). The last item is the final version = what we merged in the end.

The merged code gets ~1318it/s; replacing editdist with python-Levenshtein's distance gets 1208it/s = a few percent slower^. But that's because the candidate strings that come out of FastSS are already "close", so the max_distance clipping doesn't do much. The difference thanks to clipping should be much more pronounced for arbitrary inputs.

I never heard of rapidfuzz, I'll check it out, thanks. Someone also pointed me to polyleven, another alternative package.

^ EDITED: I originally posted a wrong number here (forgot to recompile fastss before running the python-Levenshtein benchmark).

@piskvorky
Copy link
Owner

piskvorky commented May 21, 2021

@maxbachmann I plugged in levenshtein(max=max_dist) from rapidfuzz: 1230it/s. Pretty good too. Any of these would be a great choice TBH, but having the implementation self-contained (no external dependencies) is a big plus of the merged version. This PR actually removed Gensim's existing dependency on python-Levenshtein :)

@maxbachmann
Copy link

maxbachmann commented May 21, 2021

I plugged in levenshtein(max=max_dist) from rapidfuzz: 1230it/s. Pretty good too. Any of these would be a great choice TBH, but having the implementation self-contained (no external dependencies) is a big plus of the merged version.

Yes thats definetly a big plus. AFAIK all these solutions are pretty much bound by the time it takes to call the functions. In RapidFuzz (I am the author) I have some functions which calculate the similarity of one string to multiple other strings. This can be more than 10 times as fast due to the following two factors:

  1. it saves constant Python functions calls + conversions between Python and C/C++ Types, which take a lot of time in these fast algorithms
  2. The fastest Levenshtein algorithms are based on bit parallelism. They create bitvectors for one string and then use bitwise operations to calculate the similarity to the second string in O([N/w]M) time with w as the word size which is 64 (higher if the implementation makes use of SIMD). So when comparing one string to multiple strings it is possible to reuse the bitvectors, which can save a significant amount of time.

The benchmark code is included in the same place as its results: #3146 (comment). The last item is the final version = what we merged in the end.

I have completely missed the button to open those 60 comments in between ... Thanks for pointing me towards the correct comment

Btw is there any accepted standard for benchmarks in this area? I personally mostly test on randomized strings with a similar length (simply because I do not know any better approach). However I often find papers which e.g. propose filters for string similarity based on the length (which are generally a valid approach, since they are fast to calculate). In those the authors often choose datasets with big length differences. Then their algorithm suddenly performs better then the current best algorithm for one specific similarity threshold in which more than 99% of the elements can be filtered out in constant time. Apparently my way of benchmarking does not really represent real datasets either, since they will usually contain strings of different lengths and which do not have a random character (some characters are more common than others).

I am asking because I developed an implementation of a string matching algorithm, which is significantly faster than the current implementation (O([N/w]M) instead of O(N*M)) and started to write a paper on it. However looking through different papers I could not really find a common test they choose. It appear like each of the papers tests datasets until they find one, that makes their algorithm appear slightly faster than the alternatives (or tweak their algorithm to fit this one specific dataset). Since I do not have to much knowledge in this space, I would like to make sure I am not doing the same. Especially since I write the paper in my free time simply because I believe that is a big improvement, while I would not bother about it for a couple of percent improvement on a specific dataset ;)

@piskvorky
Copy link
Owner

piskvorky commented May 21, 2021

I checked both polyleven and rapidfuzz and found the algos really exciting (and their implementations impressive). I wasn't aware so much effort went into this field. Now I'm glad I qualified my "this is the fastest implementation" with "that I know of" :-)

I don't know about datasets for benchmarking edit distance specifically. My preference is always to start at the top (find a problem worth solving) and optimize down from there. So the choice of algorithms and datasets follows the problem – such as the FastSS algo + vocab from the English Wikipedia that fell out of this PR. I find it puts big-O claims and solution constraints into perspective, because the context and constants always matter.

In other words, sorry, I don't know that much about the string matching space in general :-) Maybe @Witiko could help.

I have some functions which calculate the similarity of one string to multiple other strings

Which function is that?

@maxbachmann
Copy link

maxbachmann commented May 22, 2021

I checked both polyleven and rapidfuzz and found the algos really exciting (and their implementations impressive)

They both use very similar implementations. E.g. I took the mbleven idea from polyleven. Since RapidFuzz used C++ I was able to make use of templates for more specialised algorithms e.g. when strings only use extended ASCII.

In other words, sorry, I don't know that much about the string matching space in general :-) Maybe @Witiko could help.

Me neither. I started to work on this when I used FuzzyWuzzy in an MIT Licensed project and realised it was GPL licensed. Then the performance junkie in me was forcing me to optimize the algorithm further 😅

Which function is that?

So far process.extract will return the best matches up to a limit provided by the used and process.extractOne which will return the result with the best similarity. I will probably add more functions to compare lists of strings to each other and return result numpy matrices. Note that process.extract will get relatively slow when a lot of results are requested. The reason for this is, that I return the results as a list of tuples (because I re-implemented the API of FuzzyWuzzy). This list creation takes far longer than all string comparisons combined (I will probably return a numpy array after the next major release, which would fix this). Using random 10 characters ASCII strings I am able to process 6.5 million strings per second when I directly call the string metric for each string, while I can process around 70 million strings per second using the processor functions.

@maxbachmann
Copy link

maxbachmann commented May 22, 2021

@piskvorky Rereading the implementation I think it should be possible to further improve the performance of FastSS by quite a bit without any major code changes and will test them this evening.

Edit: is it fine to use C++ when it allows a cleaner implementation of things than C? https://github.com/RaRe-Technologies/gensim/blob/2feef89a24c222e4e0fc6e32ac7c6added752c26/gensim/models/word2vec_corpusfile.pxd already uses C++ so I think this should be fine.

Also are you sure

DEF MAX_WORD_LENGTH = 10000  # Maximum allowed word length, in characters. Must fit in the C `int` range.

WIDTH row1[MAX_WORD_LENGTH + 1];
WIDTH row2[MAX_WORD_LENGTH + 1];

really better than dynamic allocation? Allocating 80kb of Stack appears like a pretty bad idea.

@piskvorky
Copy link
Owner

piskvorky commented May 22, 2021

Thanks for the offer! The next big move is on @Witiko and his team now, motivating this functionality with an impactful demo / promo. Further performance optimizations are fun but diminishing returns.

Also, I suspect getting deeper into the app side may open up new opportunities for optimization. Because we'll better understand the required / desired parameter choices, the problem space. Maybe the whole thing will yet prove useless (strictly inferior to vector similarity search).

Allocating 80kb of Stack appears like a pretty bad idea.

Yeah, I was considering how much to use. I had MAX_WORD_LENGTH=250 and WIDTH=unsigned char here originally, but 10k and int seemed OK too. But I agree it's pushing it. We may reduce to MAX_WORD_LENGTH=1000 or something if that proves a problem :)

really better than dynamic allocation?

I just didn't want to bother. I chose stack mostly for convenience.

@maxbachmann
Copy link

Thanks for the offer! The next big move is on @Witiko and his team now, motivating this functionality with an impactful demo / promo. Further performance optimizations are fun but diminishing returns.

I did just run some benchmarks using the text8 dataset. The edit distance speed is by far not the most important factor for this algorithm. It appears to be mostly bound by the time it takes to generate the set of possible candidates. With some quick modifications I was able to achieve:

  • around a 10x performance improvement (on my machine from ~1700it/s to ~17000it/s)
  • slightly lower memory usage (The Python console, which did run the benchmark required 3.2Gb instead of 3.3 Gb )

So in case there is a real use case for this, those performance optimizations are probably still worth the effort.

@piskvorky
Copy link
Owner

Nice! 10x is significant, definitely worth a discussion. How complex are the changes?

@Witiko
Copy link
Contributor Author

Witiko commented May 28, 2021

Thanks for the offer! The next big move is on @Witiko and his team now, motivating this functionality with an impactful demo / promo. Further performance optimizations are fun but diminishing returns.

If all goes well, I will be getting back to you with a draft for the demo by the end of this weekend.
Hussites! 🪓😉🛡️

@Witiko
Copy link
Contributor Author

Witiko commented Sep 13, 2021

@piskvorky The summer is over, the researchers are returning from their vacations, and I have received information about the copyright of the books in our dataset of scanned OCR texts from the Hussite era (#3146 (comment)). Therefore, the legal hurdles in producing an information retrieval tutorial are beyond us and we can now get to the practicals. Here is my plan of action for the rest of 2021:

@Witiko
Copy link
Contributor Author

Witiko commented Dec 27, 2021

I have produced a smaller dataset of OCR texts from books with expired copyright and released it into the public domain. I am afraid that the remaining items will have to wait until the next year. Happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

min_similarity & max_distance does not work in levsim
4 participants