feat(lyra): compute levenshtein metric within a given tolerance #131
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As discussed privately, I have noticed that the
levenshtein
distance is currently only used as a comparison metric to check whether two strings have edit distance less than a giventolerance
threshold.Computing the
levenshtein
metric in the general case takes timeO(max(m, n))
and spaceO(max(m, n))
, wherem
andn
is the length of the two input strings. When we're only interested in retrieving the exact edit distance when it's below or equal to a giventolerance
threshold, more time-efficient methods exist.Given
this PR turns the comparison
into the following:
The core algorithm this PR takes inspiration from has already been developed in the
talisman
project, which is MIT licensed.It'd be interested to see how the changes introduced by this PR will influence the benchmarks. I expect a marginal performance increase as the size of the inputs grows.
Alternative not considered in this PR
As the size of the input scales, computing an exact comparison metric may not be needed. Luckily, the literature offers a plethora of polylogarithmic approximation algorithms to choose from. Should you be interested in this slightly less precise alternative, I'd suggest to take a look at Andoni et al.'s method.