-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to a custom Myers-based LCS engine #103
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #103 +/- ##
===========================================
- Coverage 85.11% 70.14% -14.98%
===========================================
Files 12 13 +1
Lines 5818 7646 +1828
Branches 475 548 +73
===========================================
+ Hits 4952 5363 +411
- Misses 846 2266 +1420
+ Partials 20 17 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
GNU patch is getting confused and deciding the patch the wrong file on MacOS.
Interesting. Maybe create a BENCHMARKING.md file to document the improvement ? |
Yep! There are some details in the individual commit messages, but the most important improvement from my perspective is being able to handle larger files at all. For very short files it should be more or less the same. The worst case for this algorithm is completely different files, and for those (on files short enough for the old version to be able to process ;D) we have a performance regression: Before:
After:
Will do later today! |
Please run hyperfine once with the two commands. It produces better results and comparison |
Until now we have used the engine provided by the 'diff' crate to do the actual diffing. It implements a Longest Common Subsequence algorithm that explores all potential offsets between the two collections. This produces high quality diffs, but has the downside of requiring a huge amount of memory for the left x right lines matrix, which makes it unable to process big files (~36MB): > ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar memory allocation of 2202701222500 bytes failed fish: Job 1, './target/release/diffutils diff…' terminated by signal SIGABRT (Abort) The author has begun an implementation of the Myers algorithm, which will be offered as an alternative to the full LCS one, but has not made any progress on merging it for months, and has not been responsive. It probably makes sense for us to have our own engine, in any case, so that we can evolve it along with the tool and make any adjustments or apply any heuristics we decide could be helpful for matching GNU diff's behavior or performance. The Myers algorithm is a more efficient implementation of LCS, as it only uses a couple vectors with 2 * (m + n) positions, rather than the m * n positions used by the full LCS matrix, where m and n are the number of lines of each file. With this new engine we outperform GNU diff significantly when comparing those two big files that are largely equal, with changes at the top and bottom while producing the exact same diff and using almost exactly the same amount of memory at the peak: Benchmark 1: ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar Time (mean ± σ): 105.0 ms ± 2.5 ms [User: 62.2 ms, System: 41.6 ms] Range (min … max): 101.7 ms … 111.3 ms 28 runs Warning: Ignoring non-zero exit code. Benchmark 2: diff test-data/huge-base test-data/huge-very-similar Time (mean ± σ): 1.119 s ± 0.003 s [User: 1.068 s, System: 0.044 s] Range (min … max): 1.115 s … 1.126 s 10 runs Warning: Ignoring non-zero exit code. Summary ./target/release/diffutils diff test-data/huge-base test-data/huge-very-similar ran 10.66 ± 0.26 times faster than diff test-data/huge-base test-data/huge-very-similar It's not all flowers, however. Without heuristics we suffer on files which are very different, especially if they are large, but even if they are small. Diffing two ~36MB and completely different files may take tens of minutes - but it at least works. This is where our ability to add custom heuristics is helpful, though - we can avoid some of the most pathological cases. Those come on the next couple commits. Benchmark 1: ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3 Time (mean ± σ): 6.5 ms ± 0.3 ms [User: 5.5 ms, System: 0.8 ms] Range (min … max): 6.1 ms … 8.0 ms 435 runs Warning: Ignoring non-zero exit code. Benchmark 2: diff test-data/LGPL2.1 test-data/GPL3 Time (mean ± σ): 1.5 ms ± 0.1 ms [User: 1.1 ms, System: 0.3 ms] Range (min … max): 1.4 ms … 4.1 ms 1968 runs Warning: Ignoring non-zero exit code. Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark 3: ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3 Time (mean ± σ): 2.1 ms ± 0.2 ms [User: 1.2 ms, System: 0.8 ms] Range (min … max): 1.8 ms … 2.9 ms 1435 runs Warning: Ignoring non-zero exit code. Summary diff test-data/LGPL2.1 test-data/GPL3 ran 1.42 ± 0.17 times faster than ./target/release/diffutils.old diff test-data/LGPL2.1 test-data/GPL3 4.35 ± 0.43 times faster than ./target/release/diffutils diff test-data/LGPL2.1 test-data/GPL3 It is worth pointing out as well that the reason GNU diff is outperformed in that best case scenario is because it does a lot more work to enable other optimizations we do not implement such as hashing each line and separating out those that only appear on one of the files. That work adds up on big files, but allows GNU diff to outperform by a similar factor when the files are not just different by rearranging lines, but by having completely different lines.
This change adds some checks to decide that the search for the best place to split the diffing process has gone for too long, or long enough while finding a good chunk of matches. They are based on similar heuristics that GNU diff applies and will help in cases in which files are very long and have few common sequences. This brings comparing some large files (~36MB) that are very different from ~1 hour to ~8 seconds, but it will still hit some pathological cases, such as some very large cpp files I created for some benchmarking that still take 1 minute. Benchmark 1: diff test-data/huge-base test-data/huge-very-different Time (mean ± σ): 2.790 s ± 0.005 s [User: 2.714 s, System: 0.063 s] Range (min … max): 2.781 s … 2.798 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 2: ./target/release/diffutils.no-heuristics diff test-data/huge-base test-data/huge-very-different Time (mean ± σ): 4755.084 s ± 172.607 s [User: 4727.169 s, System: 0.330 s] Range (min … max): 4607.522 s … 5121.135 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 3: ./target/release/diffutils diff test-data/huge-base test-data/huge-very-different Time (mean ± σ): 7.197 s ± 0.099 s [User: 7.055 s, System: 0.094 s] Range (min … max): 7.143 s … 7.416 s 10 runs Warning: Ignoring non-zero exit code. Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary diff test-data/huge-base test-data/huge-very-different ran 2.58 ± 0.04 times faster than ./target/release/diffutils diff test-data/huge-base test-data/huge-very-different 1704.04 ± 61.93 times faster than ./target/release/diffutils.no-heuristics diff test-data/huge-base test-data/huge-very-different Note that the worse that should happen by heuristics causing the search to end early is a suboptimal diff, but the diff will still be correct and usable with patch.
This is the last piece of the puzzle to get somewhat comparable to GNU diff performance without implementing all of its tricks - although this one is also used by GNU diff, in its own way. It brings down a diff which still takes over a minute with the previous commit to under a second. Benchmark 1: diff test-data/b.cpp test-data/c.cpp Time (mean ± σ): 2.533 s ± 0.011 s [User: 2.494 s, System: 0.027 s] Range (min … max): 2.519 s … 2.553 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 2: ./target/release/diffutils.local-heuristics diff test-data/b.cpp test-data/c.cpp Time (mean ± σ): 65.798 s ± 1.080 s [User: 65.367 s, System: 0.053 s] Range (min … max): 64.962 s … 68.137 s 10 runs Warning: Ignoring non-zero exit code. Benchmark 3: ./target/release/diffutils diff test-data/b.cpp test-data/c.cpp Time (mean ± σ): 580.6 ms ± 6.5 ms [User: 521.9 ms, System: 38.8 ms] Range (min … max): 570.7 ms … 589.6 ms 10 runs Warning: Ignoring non-zero exit code. Summary ./target/release/diffutils diff test-data/b.cpp test-data/c.cpp ran 4.36 ± 0.05 times faster than diff test-data/b.cpp test-data/c.cpp 113.33 ± 2.26 times faster than ./target/release/diffutils.local-heuristics diff test-data/b.cpp test-data/c.cpp It basically keeps track of how much work we have done overall for a diff job and enables giving up completely on trying to find ideal split points if the cost implies we had to trigger the "too expensive" heuristic too often. From that point forward it only does naive splitting of the work. This should not generate diffs which are much worse than doing the diagonal search, as it should only trigger in cases in which the files are so different it won't find good split points anyway. This is another case in which GNU diff's additional work with hashing and splitting large chunks of inclusion / deletion from the diff work and trying harder to find ideal splits seem to cause it to perform slightly poorer: That said, GNU diff probably still generates better diffs not due to this, but due to its post-processing of the results, trying to create more hunks with nearby changes staying close to each other, which we do not do (but we didn't do that before anyway).
I have yet to do a proper code review, but the approach looks sound to me, and a huge undertaking, so thank you very much for this! @sylvestre any objection to not depending on the external diff crate and re-implementing it in-house? @kov can you add a heading comment to |
Add links to the original paper, and an explanation of the overall design to the implementation file.
FWIW, I joked on Discord that this was potentially controversial because I would completely understand not wanting to maintain a custom engine, especially if the current diff crate gets its Myers implementation merged, and maybe we could instead do a simpler decision of which "backend" of the diff crate to use based on file size, for instance, and potentially not support the most niche pathological cases (diffing completely different massive files is not that common after all). I don't know if you guys may have an easier time reaching the diff.rs maintainer. I would be totally fine with whatever decision, I have had a lot of fun learning about this stuff and writing this patch, but I am not emotionally tied to it, up to you! |
fine with that but I am wondering if the algo itself could not a into a different crate (in case others would like to use it) ? |
I'd be fine with that, do you envision it being a crate of the uutils project or do yu mean external? When I pondered this as I was starting I thought it may make more sense if we exported the diffutils crate as a library because then you'd get the whole functionality, but there is a benefit to having the generic implementation that can be used for file lines or anything else exposed separately, I suppose. |
yeah, we can create https://github.com/uutils/diff-mayers-based-lcs-engine for example :) |
Let me make it into a separate crate then, I'll start on my namespace and we can move. Will give API a bit more thought for that. I think we want to add some parameters to control the heuristics and whether we only care about a difference existing, too. But feel free to comment on the current code! |
i noticed this: https://crates.io/crates/lcs-diff |
I looked at this one when I was starting my research, yeah. It's basically implementing the same algorithm as the diff.rs crate that is used now, which creates a full matrix to compute the LCS: fn create_table<T: PartialEq + Clone>(old: &[T], new: &[T]) -> Vec<Vec<u32>> {
let new_len = new.len();
let old_len = old.len();
let mut table = vec![vec![0; old_len + 1]; new_len + 1];
<snip>
table
} So it will still do a large allocation for medium files and be unable to diff very large files. I could try to reach out to the author and see if they have plans of implementing a myers-based algo. |
I have been a bit busy the last couple weeks with work, but I'll get back to this on the holidays and get the standalone crate going, I'll close this one meanwhile, thank you for the the input so far! |
Thanks Gustavo. I'm looking forward to reviewing it! |
This is a big one. It basically allows our version of diff to be used on larger files, which right now crash due to an impossibly large allocation, due to the library we use implementing the LCS algorithm using a full matrix matching lines of both files.
I looked at implementing this on the library, but the author already has a patch for months and has been unresponsive. I also think it makes sense to have our own engine that can be customized and optimized as required by the goals of the tool rather than relying on a library which may be more conservative. The author mentioned in their PR that the plan there is to make the Myers diagonal search optional due to the pathological cases - which to me means heuristics will probably not be welcome.
I split the actual feature in 3 commits:
See each commit for more details of performance, it is looking pretty good. I would prefer to keep the commits separate, like I did for the cmp optimizations, so that we have a good base for debugging and bisecting if we find some weird corner case.
Tests all pass, the diffs produced I looked manually at seem to have good quality. To weed out any corner cases (and I did find some!) I have been running the following script for the last couple of days. Since I last fired it up it's been going for over 12 hours now, basically diffing every single text and source code file I have in my home directory (over 370k files, I have multiple browser code bases fwiw) against /etc/passwd, an empty file, the previous file, all of them back and forth, then trying to apply it with patch and checking for correctness for each of those. So far no further issues found!