-
Notifications
You must be signed in to change notification settings - Fork 210
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix line diff by using runes without separators
The [suggested approach](https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode ) for doing line level diffing is the following set of steps: 1. `ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)` 2. `diffs = DiffMain(ti1, ti2)` 3. `DiffCharsToLines(diff, linesIdx)` The original implementation in `google/diff-match-patch` uses unicode codepoints for storing indices in `ti1` and `ti2` joined by an empty string. Current implementation in this repo stores them as integers joined by a comma. While this implementation makes `ti1` and `ti2` more readable, it introduces bugs when trying to rely on it when doing line level diffing with `DiffMain`. The root cause of the issue is that an integer line index might span more than one character/rune, and `DiffMain` can assume that two different lines having the same index prefix match partially. For example, indices 123 and 129 will have partial match `12`. In that example, the diff will show lines 3 and 9 which is not correct. A simple failing test case demonstrating this issue is available at `TestDiffPartialLineIndex`. In this PR I am adjusting the algorithm to use the same approach as in [diff-match-patch](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L508-L510 ) by storing each line index as a rune. While a rune in Golang is a type alias to uint32, not every uint32 can be a valid rune. We cannot pass in integer indices because string operations in Golang will replace invalid runes with `utf.RuneError`. The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding The first 127 lines will work the fastest as they are represented as a single bytes. Higher numbers are represented as 2-4 bytes. In addition to that, the range `U+D800 - U+DFFF` contains [invalid codepoints](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling). and all codepoints higher or equal to `0xD800` are incremented by `0xDFFF - 0xD800`. The maximum representable integer using this approach is 1'112'060. This improves on Javascript implementation which currently [bails out](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503-L505 ) when files have more than 65535 lines.
- Loading branch information
Showing
5 changed files
with
132 additions
and
38 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,4 +8,4 @@ require ( | |
gopkg.in/yaml.v2 v2.4.0 // indirect | ||
) | ||
|
||
go 1.12 | ||
go 1.13 |