Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix line diff by using rune index without a separator #136

Merged
merged 2 commits into from
Aug 2, 2023

Conversation

kdarkhan
Copy link
Contributor

@kdarkhan kdarkhan commented Jan 22, 2023

The suggested approach for doing line level diffing is the following set of steps:

  1. ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)
  2. diffs = DiffMain(ti1, ti2)
  3. DiffCharsToLines(diff, linesIdx)

The original implementation in google/diff-match-patch uses
unicode codepoints for storing indices in ti1 and ti2 joined by an empty string.
Current implementation in this repo stores them as integers joined by a
comma. While this implementation makes ti1 and ti2 more readable, it
introduces bugs when trying to rely on it when doing line level diffing
with DiffMain. The root cause of the issue is that an integer line
index might span more than one character/rune, and DiffMain can assume
that two different lines having the same index prefix match partially. For
example, indices 123 and 129 will have partial match 12. In that
example, the diff will show lines 3 and 9 which is not correct. A simple
failing test case demonstrating this issue is available at
TestDiffPartialLineIndex.

In this PR I am adjusting the algorithm to use the same approach as in
diff-match-patch by storing each line index as a rune.
While a rune in Golang is a type alias to uint32, not every uint32
can be a valid rune. During string to rune slice conversion invalid runes will
be replaced with utf.RuneError.

The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding

The first 127 lines will work the fastest as they are represented as a
single bytes. Higher numbers are represented as 2-4 bytes.

In addition to that, the range U+D800 - U+DFFF contains
invalid codepoints.
and all codepoints higher or equal to 0xD800 are incremented by
0xDFFF - 0xD800.

The maximum representable integer using this approach is 1'112'060.
This improves on Javascript implementation which currently
bails out when files have more than 65535 lines.

Current implementation produces wrong result because
it calls `DiffMain` on the following 2 arguments:

* `1,2,3,4,5,6,7,8,9,10`
* `1,2,3,4,5,6,7,8,9,11`

This numbers represent indices into the lines array.
The algorithm finds that equal part of those
strings is `1,2,3,4,5,6,7,8,9,1` and which is followed
by `Delete 0` and `Insert `1`.
diffmatchpatch/diff_test.go Outdated Show resolved Hide resolved
@kdarkhan kdarkhan force-pushed the master branch 2 times, most recently from a9d3bea to dfb3555 Compare January 23, 2023 23:27
@@ -8,4 +8,4 @@ require (
gopkg.in/yaml.v2 v2.4.0 // indirect
)

go 1.12
go 1.13
Copy link
Contributor Author

@kdarkhan kdarkhan Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrading to go 1.13 in order to use binary and hex integer literals.

[The suggested approach](https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode
) for doing line level diffing is the following set of steps:

1. `ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)`
2. `diffs = DiffMain(ti1, ti2)`
3. `DiffCharsToLines(diff, linesIdx)`

The original implementation in `google/diff-match-patch` uses
unicode codepoints for storing indices in `ti1` and `ti2` joined by an empty string.
Current implementation in this repo stores them as integers joined by a
comma. While this implementation makes `ti1` and `ti2` more readable, it
introduces bugs when trying to rely on it when doing line level diffing
with `DiffMain`. The root cause of the issue is that an integer line
index might span more than one character/rune, and `DiffMain` can assume
that two different lines having the same index prefix match partially. For
example, indices 123 and 129 will have partial match `12`. In that
example, the diff will show lines 3 and 9 which is not correct. A simple
failing test case demonstrating this issue is available at
`TestDiffPartialLineIndex`.

In this PR I am adjusting the algorithm to use the same approach as in
[diff-match-patch](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L508-L510
) by storing each line index as a rune.
While a rune in Golang is a type alias to uint32, not every uint32
can be a valid rune. During string to rune slice conversion invalid runes will
be replaced with `utf.RuneError`.

The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding

The first 127 lines will work the fastest as they are represented as a
single bytes. Higher numbers are represented as 2-4 bytes.

In addition to that, the range `U+D800 - U+DFFF` contains
[invalid codepoints](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling).
and all codepoints higher or equal to `0xD800` are incremented by
`0xDFFF - 0xD800`.

The maximum representable integer using this approach is 1'112'060.
This improves on Javascript implementation which currently
[bails out](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503-L505
) when files have more than 65535 lines.
@kdarkhan
Copy link
Contributor Author

Here is the benchmark comparison with the original code and the new code from this PR:

## Original code
BenchmarkDiffCommonPrefix-16                     6884680               153.6 ns/op
BenchmarkDiffCommonSuffix-16                     7614050               152.6 ns/op
BenchmarkCommonLength/prefix/empty-16           1000000000               0.7482 ns/op
BenchmarkCommonLength/prefix/short-16           468785274                2.285 ns/op
BenchmarkCommonLength/prefix/long-16             2215281               522.5 ns/op
BenchmarkCommonLength/suffix/empty-16           935044214                1.257 ns/op
BenchmarkCommonLength/suffix/short-16           390328588                3.023 ns/op
BenchmarkCommonLength/suffix/long-16             1622370               763.2 ns/op
BenchmarkDiffHalfMatch-16                          11480            107185 ns/op
BenchmarkDiffCleanupSemantic-16                      571           2186640 ns/op
BenchmarkDiffMain-16                                   1        1166282802 ns/op
BenchmarkDiffMainLarge-16                             16          94201231 ns/op
BenchmarkDiffMainRunesLargeLines-16                  394           2833419 ns/op
BenchmarkDiffMainRunesLargeDiffLines-16               24          52753756 ns/op

## Code in this PR
BenchmarkDiffCommonPrefix-16                     7782129               151.3 ns/op
BenchmarkDiffCommonSuffix-16                     7802674               155.7 ns/op
BenchmarkCommonLength/prefix/empty-16           1000000000               0.7727 ns/op
BenchmarkCommonLength/prefix/short-16           461628601                2.292 ns/op
BenchmarkCommonLength/prefix/long-16             2258307               514.1 ns/op
BenchmarkCommonLength/suffix/empty-16           939004714                1.222 ns/op
BenchmarkCommonLength/suffix/short-16           353210576                3.125 ns/op
BenchmarkCommonLength/suffix/long-16             1621677               765.0 ns/op
BenchmarkDiffHalfMatch-16                          10000            109821 ns/op
BenchmarkDiffCleanupSemantic-16                      658           1978913 ns/op
BenchmarkDiffMain-16                                   1        1010640421 ns/op
BenchmarkDiffMainLarge-16                             10         139771776 ns/op
BenchmarkDiffMainRunesLargeLines-16                 1632            722348 ns/op
BenchmarkDiffMainRunesLargeDiffLines-16               27          43902502 ns/op

@kdarkhan
Copy link
Contributor Author

I did not realize that there is an existing PR #120 which implements somewhat similar solution.

My PR implements a solution closer to the Javascript version google/diff-match-patch.

@sergi
Copy link
Owner

sergi commented Aug 2, 2023

LGTM

@ffluk3
Copy link

ffluk3 commented Aug 24, 2023

@sergi could we publish a tag with this in it?

@IOrlandoni
Copy link

Re-ping for @sergi

atiratree added a commit to atiratree/api that referenced this pull request Aug 2, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
atiratree added a commit to atiratree/api that referenced this pull request Aug 5, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
atiratree added a commit to atiratree/api that referenced this pull request Aug 6, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
atiratree added a commit to atiratree/api that referenced this pull request Aug 6, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
atiratree added a commit to atiratree/api that referenced this pull request Aug 6, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
atiratree added a commit to atiratree/api that referenced this pull request Aug 6, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
atiratree added a commit to atiratree/api that referenced this pull request Aug 15, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
atiratree added a commit to atiratree/api that referenced this pull request Aug 20, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
bertinatto pushed a commit to bertinatto/api that referenced this pull request Aug 22, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
bertinatto pushed a commit to bertinatto/api that referenced this pull request Aug 22, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
bertinatto pushed a commit to bertinatto/api that referenced this pull request Aug 23, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
atiratree added a commit to atiratree/api that referenced this pull request Aug 30, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
atiratree added a commit to atiratree/api that referenced this pull request Sep 11, 2024
- removed exclude for github.com/sergi/go-diff as it should have been
  fixed by sergi/go-diff#136
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants