-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Myers Heuristics #15
Comments
I found a way to easily implement the distinct-line heuristic. It doesn't modify the Myers algorithm at all, and rather builds on top of it. The algorithm works in three steps: 1. Encode the input lists of type enum Elems<T> {
UniqueRun(Vec<T>),
NormalElem(T),
} In this representation, consecutive unique lines are concatenated into 2. Perform Meyers diff on 3. Decode This solves the super common pathological case of nearly-distinct files. However, if unique and non-unique lines are mixed together, it still fails. I implemented this algorithm for the Pijul crate. I don't know how easy/hard it would be to adapt for this crate. Just leaving this here in case it's useful. |
Thank you for that @potocpav. I will have a look and evaluate this. The underlying design is still somewhat similar to pijul so it should be easy enough to adapt. |
More specifically, this project started as a fork of "diffs", right? |
Yep. See also #1 |
This substantially improves performance on text files where there are few lines in common. For example, 10,000 line files with no lines in common is more than 10x faster (8.5 seconds to 0.49 seconds on my machine), and sample_files/huge_cpp_before.cpp is nearly 2% faster. Fixes the case mentioned by @quackenbush in #236. This is inspired by the heuristics discussions at mitsuhiko/similar#15
GNU diff and others have some internal heuristics to bail if there are too many changes. There are basically two optimizations:
https://github.com/reviewboard/reviewboard/blob/master/reviewboard/diffviewer/myersdiff.py
To be more aligned with git it might make sense to implement the heuristics in the current
Algorithm::Myers
variant and have a secondaryAlgorithm::MyersMinimal
which has these heuristics disabled (git calls the variantsmyers
andminimal
).These heuristics are likely needed as currently
lcs
outperformmyers
greatly if used on completely distinct files.The text was updated successfully, but these errors were encountered: