Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get LCS string from LCSseq method #270

Closed
santhalakshminarayana opened this issue Sep 24, 2022 · 5 comments
Closed

Get LCS string from LCSseq method #270

santhalakshminarayana opened this issue Sep 24, 2022 · 5 comments
Labels
question Further information is requested

Comments

@santhalakshminarayana
Copy link

Hi @maxbachmann, O(N) time complexity implementation of this LCSseq method is very cool. Can you please provide any info about this algo and also how to get the longest-common-subsequence string? Thank you.

@maxbachmann maxbachmann added the question Further information is requested label Sep 24, 2022
@maxbachmann
Copy link
Member

This implementation is based on the paper Bit-Parallel LCS-length Computation Revisited from Heikki Hyyrö. Note that the time complexity is O([N/w]*M) where w is the bitwidth on the corresponding platform. Since Python has integers of arbitrary size, the pure Python implementation looks like O(N). This is more visible in the C++ implementation: https://github.com/maxbachmann/rapidfuzz-cpp/blob/75e10756124a2805752d8f4713918466cd67902f/rapidfuzz/distance/LCSseq_impl.hpp#L155. Still it is a lot faster than the naive O(NM) algorithms.

You should be able to retrieve the longest common subsequence string using the editops/opcodes functions:

>>> from rapidfuzz.distance import LCSseq
>>> a = "hell orld"
>>> b = "hello world"
>>> blocks = LCSseq.editops(a, b).as_matching_blocks()
>>> blocks
[MatchingBlock(a=0, b=0, size=4), MatchingBlock(a=4, b=5, size=1), MatchingBlock(a=5, b=7, size=4), MatchingBlock(a=9, b=11, size=0)]

blocks has the same format as the matching blocks in difflib. It can e.g. be used to get the subsequence strings:

>>> [a[block.a:block.a+block.size] for block in blocks[:-1]]
['hell', ' ', 'orld']

@santhalakshminarayana
Copy link
Author

Thank you @maxbachmann for your quick reply and providing info about the algorithm.
Is there any way to pass custom weightage to LCSseq similarity? Like score for ('j', 'z') = 0.7, ('k', 'q') = 0.5 ...

@maxbachmann
Copy link
Member

maxbachmann commented Sep 25, 2022

There is no such option yet, but it is planned to add this feature to Levenshtein/DamerauLevenshtein/OSA: #241

Note when using different weights, it will need to use an O(NM) implementation as well, since the bitparallel algorithm only works with the weight 1.

@santhalakshminarayana
Copy link
Author

On way is to treat weightage as '1' for all elements, and get LCS string using Hirschberg’s recursive algorithm. Then compensate the total length by subtracting weights. This may not give the optimal LCS score but gives approximate solution for faster results.

@maxbachmann
Copy link
Member

On way is to treat weightage as '1' for all elements, and get LCS string using Hirschberg’s recursive algorithm

This is still a planned optimization to reduce runtime + memory consumption of the editops implementation. I already use it for when calculating the editops for the Levenshtein distance. It is generally worth it for very long sequences (e.g. 100k characters)). In this case it makes sense to use Hirschberg until the individual sequences are smaller than a couple thousand characters and then use the bitparallel algorithm.

Then compensate the total length by subtracting weights. This may not give the optimal LCS score but gives approximate solution for faster results.

Yes even though it would be a bit unfortunate to have suboptimal results. That said I am actually unsure how to implement this correctly in all cases. I am only aware of the implementation in: https://github.com/infoscout/weighted-levenshtein, which:

  1. segfaults in some cases, which should be easy enough to fix
  2. does not result in the correct results in some cases: Example: minimal cost is not returned infoscout/weighted-levenshtein#24

Are you aware of any implementation that generally works? It would help to have a relatively simple implementation as fallback and for tests and then to specialize whatever possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants