Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

String fuzzy-matching From R to Python #317

Open
Magic-fan opened this issue Jul 6, 2021 · 1 comment
Open

String fuzzy-matching From R to Python #317

Magic-fan opened this issue Jul 6, 2021 · 1 comment

Comments

@Magic-fan
Copy link

I am trying to use string fuzzy-matching with both R and Python. I am actually using two packages:

  • stringdist from R
  • fuzzywuzzy from Python

When I try amatch("PARI", c("HELLO", "WORLD"), maxDist = 2) on R, I get NA as a result, which is intuitive. But when I try the same thing with Python : process.extract("PARI", ["HELLO", "WORLD"], limit = 2), I get [('world', 22), ('HELLO', 0)]

How could I get the same result as in R ?

Thanks in advance

@maxbachmann
Copy link

maxbachmann commented Jul 7, 2021

There are a couple of important differences between the two packages:

  1. In FuzzyWuzzy limit specifies how many elements you want extract to return. extract does not provide an argument to specify a maxDist. For this purpose you would have to use the extractBests with the score_cutoff argument.

  2. Stringdist appears to use an edit distance, while FuzzyWuzzy only provides normalized string metrics (0-100). So you would have to use e.g. score_cutoff=90. You can specify the string metric using the scorer argument.

  3. FuzzWuzzy preprocesses strings by default in the extract function (lowercase + replaces non alphanumeric characters). You can disable this using processor=None

As an alternative you could use RapidFuzz which allows the usage of edit distances and a score_cutoff parameter in the extract function:

>>> from rapidfuzz import process, string_metric
>>> process.extract("PARI", ["HELLO", "WORLD"], processor=None, scorer=string_metric.levenshtein, score_cutoff=2)
[]
>>> process.extract("HELL", ["HELLO", "WORLD"], processor=None, scorer=string_metric.levenshtein, score_cutoff=2)
[('HELLO', 1, 0)]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants