Skip to content

Conversation

@ShoshinNikita
Copy link

@ShoshinNikita ShoshinNikita commented Apr 7, 2021

This PR partially reverts commits db1b095 - 0a651d5 and fixes the issue #89 with a different approach. The current solution doesn't work well with (*DiffMatchPatch).diffMainRunes method because array indexes in the index string occupy multiple runes.

This fix is based on the previous approach. But elements of the lineArray with indexes from 0xD800 (55296) to 0xDFFF (57343) are skipped because runes in this range are invalid. It requires additional 32KB of memory ([2048]string{}) but allows us to safely encode line indexes in a string.

It doesn't completely fix #89 but increases the panic limit to ~1114111 lines (0x10FFFF is the maximum valid unicode code point). The complete fix will require a lot of changes. At the same time the current approach has a bug. So, I believe it's better to use this fix.

Fixes #115

@ShoshinNikita
Copy link
Author

There's another possible fix: https://github.com/ShoshinNikita/go-diff/commit/c8591cf97f43b198b269258a0c3f3b1fc07990df. It is also based on the previous approach but uses map[rune]string instead of []string to store lines. It doesn't require additional 32KB of memory but breaks the backward compatibility for methods DiffLinesToChars, DiffLinesToRunes and DiffCharsToLines.

@jlao
Copy link

jlao commented Apr 25, 2021

I'm also hitting this regression. Because the lineHash map is no longer shared, the diff assigns different chars/runes to lines that are the same.

@findleyr
Copy link

Hi! We (the gopls team) spoke to @sergi and will help review and test this PR. Specifically, I'll do my best to review the change, and we'll test the fix in gopls. Will probably need a couple days.

@sergi
Copy link
Owner

sergi commented Apr 26, 2021

Thanks for helping @findleyr and team!!

Copy link

@findleyr findleyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this!

This is my first time reading this source, so some of my questions / observations were general in nature. Will follow-up with another pass once you've responded.

// checkLineArray checks the size of the slice and ensures that the index of the next element
// will be the valid rune.
func checkLineArray(a *[]string) {
// Runes in this range are invalid, utf8.ValidRune() returns false.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very subtle. Please explain further in this comment why this is a problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment describes why runes must be valid: #89 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I think it's worth explaining here (you can reference #89).

return dmp.DiffLinesToRunes(string(text1), string(text2))
}

// diffLinesToRunesMunge splits a text into an array of strings, and reduces the texts to a []rune where each Unicode character represents one line.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The words 'array', 'list', and 'slice' are used interchangeably throughout, but array is really incorrect in this context.


// checkLineArray checks the size of the slice and ensures that the index of the next element
// will be the valid rune.
func checkLineArray(a *[]string) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have functions

// lineRune returns a valid rune value for the line at the given index
func lineRune(idx int) rune

// runeLine looks up the given rune hash in the lines slice.
func runeLine(idxRune rune, lines []string) string

Thereby avoiding the padding here. It could also be used to avoid the prepended "".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can break the user code because DiffLinesToRunes returns the lineArray. If a user has something like DiffCharsToLines, it can panic with index out of range.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that DiffLinesToRunes could use lineRune to convert to rune values, and DiffCharsToLines can use runeLine to convert to line indexes. These functions would be inverses of eachother.

}

// diffLinesToStringsMunge splits a text into an array of strings, and reduces the texts to a []string.
func (dmp *DiffMatchPatch) diffLinesToStringsMunge(text string, lineArray *[]string) []uint32 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reorganize to minimize the diff with this refactored method.

}

func (dmp *DiffMatchPatch) diffLinesToRunes(text1, text2 []rune) ([]rune, []rune, []string) {
return dmp.DiffLinesToRunes(string(text1), string(text2))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about all this unnecessary allocation converting to string, when it really shouldn't be necessary.

Did you compare benchmark performance?


// diffLinesToRunesMunge splits a text into an array of strings, and reduces the texts to a []rune where each Unicode character represents one line.
// We use strings instead of []runes as input mainly because you can't use []rune as a map key.
func (dmp *DiffMatchPatch) diffLinesToRunesMunge(text string, lineArray *[]string, lineHash map[string]int) []rune {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm staring at this, and I don't see how the previous algorithm worked without passing in lineHash. Is that the root cause of the bug?

Can you explain somewhere the exact nature of the bug in the fix for #89?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the root cause of the bug is described in the PR description:

The current solution doesn't work well with (*DiffMatchPatch).diffMainRunes method because array indexes in the index string occupy multiple runes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...so array indexes could be corrupted/split in the diff output, right?

It's worth explaining in a comment.

return append(diffs, diffsb...)
}

// DiffLinesToChars splits two texts into a list of strings, and educes the texts to a string of hashes where each Unicode character represents one line.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/educes/reduces

}

// DiffLinesToChars splits two texts into a list of strings, and educes the texts to a string of hashes where each Unicode character represents one line.
// It's slightly faster to call DiffLinesToRunes first, followed by DiffMainRunes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate?

lineEnd = len(text) - 1
}

line := text[lineStart : lineEnd+1]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd be interested to know how performance would be affected by using an actual hash to look up the rune value in the index, rather than a string. Then we could convert to []byte at API boundaries and avoid allocating again.

But that doesn't need to be done for this PR.

@findleyr
Copy link

findleyr commented May 3, 2021

@ShoshinNikita

The complete fix will require a lot of changes.

By this you mean modifying the diff algorithm to operate on []int, rather than string, right?

@ShoshinNikita
Copy link
Author

@findleyr

I am not very familiar with the codebase. I just could determine the cause of the regressions and fix #89 with another approach based on code in v1.1.0 (the second commit in this PR partially reverts the previous fix). So, the real diff is v1.1.0...d20955a, lines 431-465 (or just the last commit).

I think using []int instead of string can fix the issue completely. But it will break the backward compatibility. However, as I said before, I am not familiar with the code. So, I could be wrong.

@iambus
Copy link
Contributor

iambus commented Dec 13, 2021

I would suggest you guys look at my PR #128. It fixes a serious panic issue. It may also fix the issue you were disucssing (I'm not sure).

@kdarkhan
Copy link
Contributor

kdarkhan commented Aug 3, 2023

This PR can probably be closed now since my PR which fixes the same thing was merged yesterday #136.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

diffmatchpatch: invalid diff output with CRLF line endings Index out of range panic in DiffCharsToLines on large JSON diff

6 participants