Get LCS string from LCSseq method #270

santhalakshminarayana · 2022-09-24T10:32:06Z

Hi @maxbachmann, O(N) time complexity implementation of this LCSseq method is very cool. Can you please provide any info about this algo and also how to get the longest-common-subsequence string? Thank you.

maxbachmann · 2022-09-24T10:48:05Z

This implementation is based on the paper Bit-Parallel LCS-length Computation Revisited from Heikki Hyyrö. Note that the time complexity is O([N/w]*M) where w is the bitwidth on the corresponding platform. Since Python has integers of arbitrary size, the pure Python implementation looks like O(N). This is more visible in the C++ implementation: https://github.com/maxbachmann/rapidfuzz-cpp/blob/75e10756124a2805752d8f4713918466cd67902f/rapidfuzz/distance/LCSseq_impl.hpp#L155. Still it is a lot faster than the naive O(NM) algorithms.

You should be able to retrieve the longest common subsequence string using the editops/opcodes functions:

>>> from rapidfuzz.distance import LCSseq
>>> a = "hell orld"
>>> b = "hello world"
>>> blocks = LCSseq.editops(a, b).as_matching_blocks()
>>> blocks
[MatchingBlock(a=0, b=0, size=4), MatchingBlock(a=4, b=5, size=1), MatchingBlock(a=5, b=7, size=4), MatchingBlock(a=9, b=11, size=0)]

blocks has the same format as the matching blocks in difflib. It can e.g. be used to get the subsequence strings:

>>> [a[block.a:block.a+block.size] for block in blocks[:-1]]
['hell', ' ', 'orld']

santhalakshminarayana · 2022-09-25T04:31:21Z

Thank you @maxbachmann for your quick reply and providing info about the algorithm.
Is there any way to pass custom weightage to LCSseq similarity? Like score for ('j', 'z') = 0.7, ('k', 'q') = 0.5 ...

maxbachmann · 2022-09-25T10:52:29Z

There is no such option yet, but it is planned to add this feature to Levenshtein/DamerauLevenshtein/OSA: #241

Note when using different weights, it will need to use an O(NM) implementation as well, since the bitparallel algorithm only works with the weight 1.

santhalakshminarayana · 2022-09-28T10:30:17Z

On way is to treat weightage as '1' for all elements, and get LCS string using Hirschberg’s recursive algorithm. Then compensate the total length by subtracting weights. This may not give the optimal LCS score but gives approximate solution for faster results.

maxbachmann · 2022-09-28T11:06:08Z

On way is to treat weightage as '1' for all elements, and get LCS string using Hirschberg’s recursive algorithm

This is still a planned optimization to reduce runtime + memory consumption of the editops implementation. I already use it for when calculating the editops for the Levenshtein distance. It is generally worth it for very long sequences (e.g. 100k characters)). In this case it makes sense to use Hirschberg until the individual sequences are smaller than a couple thousand characters and then use the bitparallel algorithm.

Then compensate the total length by subtracting weights. This may not give the optimal LCS score but gives approximate solution for faster results.

Yes even though it would be a bit unfortunate to have suboptimal results. That said I am actually unsure how to implement this correctly in all cases. I am only aware of the implementation in: https://github.com/infoscout/weighted-levenshtein, which:

segfaults in some cases, which should be easy enough to fix
does not result in the correct results in some cases: Example: minimal cost is not returned infoscout/weighted-levenshtein#24

Are you aware of any implementation that generally works? It would help to have a relatively simple implementation as fallback and for tests and then to specialize whatever possible.

maxbachmann added the question Further information is requested label Sep 24, 2022

santhalakshminarayana closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get LCS string from LCSseq method #270

Get LCS string from LCSseq method #270

santhalakshminarayana commented Sep 24, 2022

maxbachmann commented Sep 24, 2022

santhalakshminarayana commented Sep 25, 2022

maxbachmann commented Sep 25, 2022 •

edited

Loading

santhalakshminarayana commented Sep 28, 2022

maxbachmann commented Sep 28, 2022

Get LCS string from LCSseq method #270

Get LCS string from LCSseq method #270

Comments

santhalakshminarayana commented Sep 24, 2022

maxbachmann commented Sep 24, 2022

santhalakshminarayana commented Sep 25, 2022

maxbachmann commented Sep 25, 2022 • edited Loading

santhalakshminarayana commented Sep 28, 2022

maxbachmann commented Sep 28, 2022

maxbachmann commented Sep 25, 2022 •

edited

Loading