Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation weights not configurable #86

Open
KthProg opened this issue Aug 2, 2022 · 5 comments
Open

Operation weights not configurable #86

KthProg opened this issue Aug 2, 2022 · 5 comments

Comments

@KthProg
Copy link

KthProg commented Aug 2, 2022

Hello.

Generally Lev allows configurable weights, I'd really like to see that here as using Lev as a substring "approximate" matcher doesn't work when the insert op is weighted the same as other operations. It's hard to tweak search results when the weights cannot be changed.

@maxbachmann
Copy link

You can use the Levenshtein fork for this

pip install Levenshtein

Which allows the following:

from Levenshtein import distance
distance(s1, s2, weights=(insertion, deletion, substitution))

A future release will allow the usage of character dependent weights as well: rapidfuzz/RapidFuzz#241

@gamesguru
Copy link

@maxbachmann

Is there any reference on which of these functions are fastest vs. slowest? We need to sort through about 10,000 search items very quickly. Or I can experiment with them.

And is there any maintenance work being done on the old version? We are supporting Python 3.4 still, and the new maintainer isn't providing backwards compatibility. Perhaps we can back port a few essentials?

@maxbachmann
Copy link

Is there any reference on which of these functions are fastest vs. slowest? We need to sort through about 10,000 search items very quickly. Or I can experiment with them.

The Levenshtein library only provides the Levenshtein distance with weights. Rapidfuzz provides a couple more metrics. When you say sort through 10,000 elements do you mean search for 1 query in these 10k elements, or do you mean a task like deduplication which will pretty much require 10k x 10k comparisions?

And is there any maintenance work being done on the old version?

No only the latest version is maintained

We are supporting Python 3.4 still, and the new maintainer isn't providing backwards compatibility.

I am this new maintainer ;)

Perhaps we can back port a few essentials?

The biggest problem is the fact, that rapidfuzz uses scikit-build for building. It is getting increasingly hard to build wheels for these versions, since tools like cibuildwheel and manylinux dropped support for them.
I currently maintain Python support as long as all of the following are true:

  • scikit-build supports the version
  • cibuildwheel supports the version (they support it as long as manylinux supports it)
  • Cython supports it

It is not really worth it to maintain support for other versions, since the user base is really slim and most of them can just use an older version of the projects. Python 3.4 is a special case, since it was never supported by the library. I did have support for Python 2.7 + Python 3.5 until rapidfuzz < 2.0.

@gamesguru
Copy link

It's only one query over 10k items. But fairly long items and has to be <100ms ideally.

The original library python-Levenshtein<=0.12.2 did work on python3.4

Python 3.5 would be a good start. I'm not sure where the distance() function is defined?

Old pip versions also don't respect python_requires=, and it's necessary to tell it specifically what bounds to use for each version of python. It seems f-strings were introduced at some point, and now it is only 3.7+? Getting specific version pins on these breaking changes would help.

@maxbachmann
Copy link

maxbachmann commented Apr 17, 2023

It's only one query over 10k items. But fairly long items and has to be <100ms ideally.

for long items the new implementation is significantly faster:

A large amount of those performance advantages did already exist prior to rapidfuzz v2.0

The original library python-Levenshtein<=0.12.2 did work on python3.4

in case the old implementation worked for you, you can always just pin it

Python 3.5 would be a good start. I'm not sure where the distance() function is defined?

This function was new in v2.0.0 of rapidfuzz. v1.9.1 is the last version which supported Python 3.5. In this older version there is rapidfuzz.string_metrics.levenshtein.

Old pip versions also don't respect python_requires=, and it's necessary to tell it specifically what bounds to use for each version of python. It seems f-strings were introduced at some point, and now it is only 3.7+? Getting specific version pins on these breaking changes would help.

Generally I mention these changes in the changelog: https://github.com/maxbachmann/RapidFuzz/blob/main/CHANGELOG.rst:

  • python 3.5 + python 2.7 for rapidfuzz < 2.0
  • python 3.6 support for rapidfuzz < 2.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants