Implement Myers Heuristics #15

mitsuhiko · 2021-02-17T08:28:09Z

GNU diff and others have some internal heuristics to bail if there are too many changes. There are basically two optimizations:

Discard lines which are completely distinct between two files: this has the advantage that they are entirely ignored by the diffing algorithm which helps to reduce how much has to be diffed greatly
Bail if the snake becomes too long

https://github.com/reviewboard/reviewboard/blob/master/reviewboard/diffviewer/myersdiff.py

To be more aligned with git it might make sense to implement the heuristics in the current Algorithm::Myers variant and have a secondary Algorithm::MyersMinimal which has these heuristics disabled (git calls the variants myers and minimal).

These heuristics are likely needed as currently lcs outperform myers greatly if used on completely distinct files.

The text was updated successfully, but these errors were encountered:

potocpav · 2021-12-08T20:45:02Z

I found a way to easily implement the distinct-line heuristic. It doesn't modify the Myers algorithm at all, and rather builds on top of it. The algorithm works in three steps:

1. Encode the input lists of type Vec<T> into an "optimized" representation:

enum Elems<T> {
    UniqueRun(Vec<T>),
    NormalElem(T),
}

In this representation, consecutive unique lines are concatenated into UniqueRun, and non-unique lines are just inside NormalElem. The whole sequence is then Vec<Elems<T>>.

2. Perform Meyers diff on Vec<Elems<T>>, instead of Vec<T> directly.

3. Decode Vec<Elems<T>> back into Vec<T> to get the result.

This solves the super common pathological case of nearly-distinct files. However, if unique and non-unique lines are mixed together, it still fails.

I implemented this algorithm for the Pijul crate. I don't know how easy/hard it would be to adapt for this crate. Just leaving this here in case it's useful.

mitsuhiko · 2021-12-08T23:48:30Z

Thank you for that @potocpav. I will have a look and evaluate this. The underlying design is still somewhat similar to pijul so it should be easy enough to adapt.

P-E-Meunier · 2021-12-09T07:13:19Z

More specifically, this project started as a fork of "diffs", right?

mitsuhiko · 2021-12-09T17:02:20Z

Yep. See also #1

@quackenbush

This substantially improves performance on text files where there are few lines in common. For example, 10,000 line files with no lines in common is more than 10x faster (8.5 seconds to 0.49 seconds on my machine), and sample_files/huge_cpp_before.cpp is nearly 2% faster. Fixes the case mentioned by @quackenbush in #236. This is inspired by the heuristics discussions at mitsuhiko/similar#15

This was referenced Feb 19, 2021

Implement deadline/bail #17

Closed

Unbearably slow for big texts rust-pretty-assertions/rust-pretty-assertions#19

Open

mitsuhiko mentioned this issue Feb 1, 2023

LCS isn't Hunt-McIlroy... and might not be LCS either? #44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Myers Heuristics #15

Implement Myers Heuristics #15

mitsuhiko commented Feb 17, 2021 •

edited

Loading

potocpav commented Dec 8, 2021

mitsuhiko commented Dec 8, 2021

P-E-Meunier commented Dec 9, 2021

mitsuhiko commented Dec 9, 2021

Implement Myers Heuristics #15

Implement Myers Heuristics #15

Comments

mitsuhiko commented Feb 17, 2021 • edited Loading

potocpav commented Dec 8, 2021

mitsuhiko commented Dec 8, 2021

P-E-Meunier commented Dec 9, 2021

mitsuhiko commented Dec 9, 2021

mitsuhiko commented Feb 17, 2021 •

edited

Loading