Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark #358

Open
mirix opened this issue Jan 16, 2024 · 4 comments
Open

Benchmark #358

mirix opened this issue Jan 16, 2024 · 4 comments

Comments

@mirix
Copy link

mirix commented Jan 16, 2024

This is not an issue, but to also tell about the positive things, when comparing the execution time on 10 million string pairs (2-22 character long) stored on a Pandas dataframe, RapidFuzz's Damerau-Levenshtein beats both pyxDamerauLevenshtein's and jelyfish's.

Interestingly, if you run one string pair only, pyxDamerauLevenshtein is considerably faster, but, after only a few string pairs, RapidFuzz catches up and beats it.

@maxbachmann
Copy link
Member

Glad to hear that the library proves to be useful :)

Do I understand it correctly, that you have essentially two columns A and B, where in each row you compare A with B?

@mirix
Copy link
Author

mirix commented Feb 22, 2024

Not exactly. On the one hand, I have lists of named entities extracted from different text. On the other, multiple columns from a PostgreSQL table, each containing a dictionary with multiple combinations of keywords. The algorithm needs to find the best match and provide a similarity score.

If the we are talking about written text, I am using the Damerau-Levenshtein similarity score for glyphs only.

If one the other hand, the text are transcripts, I am using the Levenshtein similarity score both for glyphs and metaphone3.

In summary, it is a many vs many comparison run in parallel.

This is the real scenario. The Pandas dataframe was just for testing, but it was also many vs many.

@cyrusmsk
Copy link

Did you try to compare with https://github.com/ashvardanian/StringZilla ?

@maxbachmann
Copy link
Member

Not in depth. However I just had a quick look at their two benchmarks:

Long sequences

This benchmark is mentioned in the readme and mentions them being significantly faster than other.

Several Python libraries provide edit distance computation. Most of them are implemented in C, but are not always as fast as StringZilla. Taking a 1'000 long proteins around 10'000 characters long, computing just a 100 distances:
JellyFish: 62.3s
EditDistance: 32.9s
StringZilla: 0.8s

Usually I would assume that they simply didn't know about better implementations. However their benchmark script (https://github.com/ashvardanian/StringZilla/blob/main/scripts/bench_similarity.ipynb) actually includes e.g. rapidfuzz as well. Which if I run their benchmark is significantly faster on my machine:

  • StringZilla: 8.8 sec
  • jellyfish: 38.5 sec
  • editdistance: 12.4 sec
  • rapidfuzz: 0.35 sec

In addition StringZilla is a lot slower than their reported number, while all other libraries are actually faster on my machine. So without them mentioning the CPU they tested this on (maybe one with AVX512?) I am questioning their results. In addition I find it questionable that they only publish results for the libraries they find to be much slower 🤷‍♂️

Short Strings

For their short string test they actually perform better than rapidfuzz. However this benchmark takes most time just to select the random strings, which is completely unusable as a benchmark. So as a first step I factored this out:

queries = [random.choice(words) for _ in range(1000000)]
choices = [random.choice(words) for _ in range(1000000)]

def checksum_distances(tokens, distance_function, n: int = 1000000):
    distances_sum = 0
    for a, b in zip(queries, choices):
        distances_sum += distance_function(a, b)
    return distances_sum

This has the additional advantage that everyone now compares the same strings. This gives the following results:

  • StringZilla: 310 ms
  • jellyfish: 1.8 sec
  • editdistance: 732 ms
  • rapidfuzz: 388 ms

When comparing only empty strings I get:

  • StringZilla: 230 ms
  • jellyfish: 220 sec
  • editdistance: 192 ms
  • rapidfuzz: 276 ms

Maybe there are things in the wrapping code which could be improved a bit in rapidfuzz to get the same results 🤔
Overall people should try to use the process module as much as possible when working with tons of short strings. E.g. when comparing each element of two lists with each other:

queries = [random.choice(words) for _ in range(10000)]
choices = [random.choice(words) for _ in range(10000)]

def checksum_distances(tokens, distance_function, n: int = 1000000):
    for a in queries:
        for b in choices:
            distance_function(a, b)

rapidfuzz can do a lot better by using process.cdist(queries, choices):

  • StringZilla: 25.3 sec
  • RapidFuzz: 23.8 sec
  • RapidFuzz (cdist): 1.28 sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants