Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string:jaro_distance/2 #7863

Closed

Conversation

the-mikedavis
Copy link
Contributor

@garazdawi mentioned this might be a nice addition to string (erlang/rebar3#2844 (comment)). It's a translation from Elixir's String.jaro_distance/2 adapted to allow unicode:chardata() rather than only binaries.

@garazdawi also mentioned someone from the Erlang/OTP team was interested in working on this so I am submitting what I have now - please feel free to take it over or supersede this PR if you'd like!

Previously, if a case failed and expected a non-string, non-tuple value,
the 'io:format/2' call in the 'test_1/5' helper would error rather than
printing because of the '~ts' control sequence. For example, writing an
incorrect case in the `length` case like so will fail:

    ?TEST("abc", [], 4)

To fix this we swap the clauses so that we use '~ts' for binaries and
lists and '~w' for everything else.
@okeuday
Copy link
Contributor

okeuday commented Nov 15, 2023

@the-mikedavis Please consider having this functionality as string:distance/3 with the first argument as DistanceType :: jaro with the potential for adding the atoms jaro_winkler | levenshtein | damerau_levenshtein in the future (as described here) and the return value as an integer for the number of edits or characters in common. A second function string:distance_normalized/3 could be added for the [0.0 .. 1.0] return value. Providing people with a choice of the algorithm can relate to the expected string contents and the expected algorithm latency.

@dgud
Copy link
Contributor

dgud commented Nov 15, 2023

@okeuday I don't like that, they have very different properties and return values, (I have implemented most of them).
Also the Levenstein algorithms are 10 times slower then jaro, so I don't really see the need of them.

For what would you use the Levenstein algorithm when jaro is 10 times faster?

@okeuday
Copy link
Contributor

okeuday commented Nov 15, 2023

@dgud Applications of the Levenshtein algorithm are described here and most usages are likely uncommon in Erlang, but a shell feature that could provide help based on module or function spelling mistakes could be an Erlang use-case. If the strings are relatively short the latency may be justifiable with getting a better result. Not attempting to advocate for a slower shell. It just seems best to care about more than 1 string distance algorithm.

@dgud
Copy link
Contributor

dgud commented Nov 15, 2023

> a shell feature that could provide help based on module or function spelling mistakes could be an Erlang use-case.

Which is one reason why we want to add jaro, and why we considered the other ones.
I'm still not convinced, that a Levenshtein variant is significantly better at that.

@dgud dgud self-assigned this Nov 15, 2023
@dgud
Copy link
Contributor

dgud commented Nov 15, 2023

@the-mikedavis We have another variant of the code coming, we have concluded that the elixir variant calculates the transpositions "wrongly", at least it's different than the defacto standard, for example:
jaro_distance("Sunday", "Saturday") gives another answer than the algorithm on rosetta code.

The original paper also seems to agree more with rosetta algorithms, then elixirs, though the original code
does an integer division of 2 which would give the same result as elixirs of the example above.

But I'll keep this PR open for inspiration, i.e. test changes and docs.

@dgud
Copy link
Contributor

dgud commented Nov 17, 2023

Thanks, for the work.
Closed this in favor of #7879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants