You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Stop breaking surrogate pairs in toDelta()/fromDelta()
Resolvesgoogle/diff-match-patch#69 for the following languages:
- Objective-C
- Java
- JavaScript
- Python2
- Python3
Sometimes we can find a common prefix that runs into the middle of a
surrogate pair and we split that pair when building our diff groups.
This is fine as long as we are operating on UTF-16 code units. It
becomes problematic when we start trying to treat those substrings as
valid Unicode (or UTF-8) sequences.
When we pass these split groups into `toDelta()` we do just that and the
library crashes. In this patch we're post-processing the diff groups
before encoding them to make sure that we un-split the surrogate pairs.
The post-processed diffs should produce the same output when applying
the diffs. The diff string itself will be different but should change
that much - only by a single character at surrogate boundaries.
Alternative approaches:
=========
- The [`dissimilar`](https://docs.rs/dissimilar/latest/dissimilar/)
library in Rust takes a more comprehensive approach with its
`cleanup_char_boundary()` method. Since that approach resolves the
issue everywhere and not just in to/from Delta, it's worth
exploring as a replacement for this patch.
Remaining work to do:
========
-[ ] Fix CPP or verify not a problem
-[ ] Fix CSharp or verify not a problem
-[ ] Fix Dart or verify not a problem
-[ ] Fix Lua or verify not a problem
-[x] Refactor to use cleanupSplitSurrogates in JavaScript
-[x] Refactor to use cleanupSplitSurrogates in Java
-[ ] Refactor to use cleanupSplitSurrogates in Objective C
-[ ] Refactor to use cleanupSplitSurrogates in Python2
-[ ] Refactor to use cleanupSplitSurrogates in Python3
-[ ] Refactor to use cleanupSplitSurrogates in CPP
-[ ] Refactor to use cleanupSplitSurrogates in CSharp
-[ ] Refactor to use cleanupSplitSurrogates in Dart
-[ ] Refactor to use cleanupSplitSurrogates in Lua
-[x] Fix patch_toText in JavaScript
-[ ] Fix patch_toText in Java
-[ ] Fix patch_toText in Objective C
-[ ] Fix patch_toText in Python2
-[ ] Fix patch_toText in Python3
-[ ] Fix patch_toText in CPP
-[ ] Fix patch_toText in CSharp
-[ ] Fix patch_toText in Dart
-[ ] Fix patch_toText in Lua
-[ ] Figure out a "minimal" set of unit tests so we can get rid of the big
chunk currently in the PR, then carry it around to all the libraries.
The triggers are well understood, so we can write targeted tests
instead of broad ones.
0 commit comments