Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix line diff by using rune index without a separator #136

Merged
merged 2 commits into from
Aug 2, 2023

Commits on Jan 22, 2023

  1. Add a failing line diff test case

    Current implementation produces wrong result because
    it calls `DiffMain` on the following 2 arguments:
    
    * `1,2,3,4,5,6,7,8,9,10`
    * `1,2,3,4,5,6,7,8,9,11`
    
    This numbers represent indices into the lines array.
    The algorithm finds that equal part of those
    strings is `1,2,3,4,5,6,7,8,9,1` and which is followed
    by `Delete 0` and `Insert `1`.
    kdarkhan committed Jan 22, 2023
    Configuration menu
    Copy the full SHA
    74798f5 View commit details
    Browse the repository at this point in the history

Commits on Jan 23, 2023

  1. Fix line diff by using runes without separators

    [The suggested approach](https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode
    ) for doing line level diffing is the following set of steps:
    
    1. `ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)`
    2. `diffs = DiffMain(ti1, ti2)`
    3. `DiffCharsToLines(diff, linesIdx)`
    
    The original implementation in `google/diff-match-patch` uses
    unicode codepoints for storing indices in `ti1` and `ti2` joined by an empty string.
    Current implementation in this repo stores them as integers joined by a
    comma. While this implementation makes `ti1` and `ti2` more readable, it
    introduces bugs when trying to rely on it when doing line level diffing
    with `DiffMain`. The root cause of the issue is that an integer line
    index might span more than one character/rune, and `DiffMain` can assume
    that two different lines having the same index prefix match partially. For
    example, indices 123 and 129 will have partial match `12`. In that
    example, the diff will show lines 3 and 9 which is not correct. A simple
    failing test case demonstrating this issue is available at
    `TestDiffPartialLineIndex`.
    
    In this PR I am adjusting the algorithm to use the same approach as in
    [diff-match-patch](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L508-L510
    ) by storing each line index as a rune.
    While a rune in Golang is a type alias to uint32, not every uint32
    can be a valid rune. During string to rune slice conversion invalid runes will
    be replaced with `utf.RuneError`.
    
    The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding
    
    The first 127 lines will work the fastest as they are represented as a
    single bytes. Higher numbers are represented as 2-4 bytes.
    
    In addition to that, the range `U+D800 - U+DFFF` contains
    [invalid codepoints](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling).
    and all codepoints higher or equal to `0xD800` are incremented by
    `0xDFFF - 0xD800`.
    
    The maximum representable integer using this approach is 1'112'060.
    This improves on Javascript implementation which currently
    [bails out](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503-L505
    ) when files have more than 65535 lines.
    kdarkhan committed Jan 23, 2023
    Configuration menu
    Copy the full SHA
    a674b30 View commit details
    Browse the repository at this point in the history