Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combinations of numbers and letters no longer considered a single token in v6 #553

Closed
greysteil opened this issue Sep 3, 2024 · 19 comments · Fixed by #554
Closed

Combinations of numbers and letters no longer considered a single token in v6 #553

greysteil opened this issue Sep 3, 2024 · 19 comments · Fixed by #554
Labels

Comments

@greysteil
Copy link

First up, thanks for all your work on jsdiff, and for the v6 release.

We have a bunch of tests around diffs, and I noticed an unexpected change:

On v5:

// Diff.diffWordsWithSpace("at least $2m", "at least $4m")
Array [
  Object {
    "count": 5,
    "value": "at least $",
  },
  Object {
    "added": undefined,
    "count": 1,
    "removed": true,
    "value": "2m",
  },
  Object {
    "added": true,
    "count": 1,
    "removed": undefined,
    "value": "4m",
  },
]

On v6:

// Diff.diffWordsWithSpace("at least $2m", "at least $4m")
Array [
  Object {
    "added": false,
    "count": 5,
    "removed": false,
    "value": "at least $",
  },
  Object {
    "added": false,
    "count": 1,
    "removed": true,
    "value": "2",
  },
  Object {
    "added": true,
    "count": 1,
    "removed": false,
    "value": "4",
  },
  Object {
    "added": false,
    "count": 1,
    "removed": false,
    "value": "m",
  },
]

Not sure if this is a deliberate change or not? I think where words and numbers are combined it's probably best to treat them as a single Word.

@efstathiosntonas
Copy link

efstathiosntonas commented Sep 5, 2024

I believe this also breaks emojis in react-native TextInput (or in general).

const changes = diffChars(originalText, changedText) as CharactersDiffChange[];

will result in "��" instead of "✌️😒"

downgrading to 5.2.0 everything is fine.

@ExplodingCabbage
Copy link
Collaborator

Not sure if this is a deliberate change or not? I think where words and numbers are combined it's probably best to treat them as a single Word.

It wasn't deliberate, and I agree with you. Damn, I was really hoping I could pull off that 6.0.0 release without introducing any new bugs. :(

@ExplodingCabbage
Copy link
Collaborator

@efstathiosntonas what you describe (with diffChars) sounds like an entirely separate bug that I've introduced in the same release. Can you provide minimal instructions on how to reproduce it, either here or in a new issue?

@efstathiosntonas
Copy link

@ExplodingCabbage will create an expo repro!

ExplodingCabbage added a commit that referenced this issue Sep 5, 2024
@ExplodingCabbage
Copy link
Collaborator

Looks like the first bug here - the one reported by @greysteil - was introduced by #494. Looks like I carelessly assumed that the pre-existing extendedWordChars regex would match any characters that we should consider part of a word, when in fact (whether by design or bug - hard to say because the old logic was also broken!) it doesn't do so. On the bright side, I think it's probably a super-easy one-line fix.

I'll try to get a new release out today with both these bugs fixed.

@greysteil
Copy link
Author

You're the best @ExplodingCabbage ❤️

@efstathiosntonas
Copy link

I'll try to get a new release out today with both these bugs fixed.

@ExplodingCabbage do you still need a repro for diffChars?

@ExplodingCabbage
Copy link
Collaborator

Would be appreciated - even just posting what two strings to diff against each other to see a bug! Otherwise will try to figure it out myself.

@efstathiosntonas
Copy link

efstathiosntonas commented Sep 5, 2024

I think the fastest would be to read or implement it on an expo project: https://gist.github.com/efstathiosntonas/01142fd9243573d649caddd952f81829

Just try to enter emojis on the textinput from the native keyboard of the device/simulator, they will turn out as ?? instead of the emojis.

@ExplodingCabbage
Copy link
Collaborator

Original issue fixed by #554.

I'll have a play around with your Gist and try to figure out the diffChars thing, @efstathiosntonas. I do note though that if I just try diffing some text involving the ✌ and 😒 emojis with diffChars it appears to behave correctly on v6.0.0:

> diff.diffChars('', '✌😒')
[ { count: 2, added: true, removed: false, value: '✌😒' } ]
> diff.diffChars('✌', '😒')
[
  { count: 1, added: false, removed: true, value: '✌' },
  { count: 1, added: true, removed: false, value: '😒' }
]

By contrast, v5.2.0 gets the counts wrong, treating 😒 as two characters (because it is outside the Basic Multilingual Plane):

> diff.diffChars('', '✌😒')
[ { count: 3, added: true, removed: undefined, value: '✌😒' } ]
> diff.diffChars('✌', '😒')
[
  { count: 1, added: undefined, removed: true, value: '✌' },
  { count: 2, added: true, removed: undefined, value: '😒' }
]

So it doesn't appear to me at first glance like I have broken anything emoji-related and I am rather hoping I am going to find that whatever the problem is is not in jsdiff but in your code... but I will see if I can repro now and then get to work on establishing whether that's true.

@ExplodingCabbage
Copy link
Collaborator

Okay, I just got round to glancing at the Gist, @efstathiosntonas. Didn't try to get it running since I haven't used React Native for eons and don't remember how, but did take a quick look at the utils.ts code. I would bet that the problem is to do with the change in concept of string length/count between 5.2.0 and 6.0.0 - that before, 😒 had length 2 (the number of UTF-16 code units required to encode it in UTF-16, as returned by '😒'.length) and now it has length 1 (the number of unicode code points, as returned by Array.from('😒').length). I bet some of your stuff to do with incrementing cursor positions assumes it is dealing with string indexes when conceiving of the string as a sequence of UTF-16 code units, and is therefore now incompatible with jsdiff v6.0.0's count attribute returning lengths as counts of Unicode code points.

Replacing all use of the .count property of change objects with .value.length will probably restore the old behaviour.

Note the change in Unicode handling in diffChars was intentional and documented - the release-notes say

#500 diffChars now diffs Unicode code points instead of UTF-16 code units.

and the README says:

Diff.diffChars(oldStr, newStr[, options]) - diffs two blocks of text, treating each character as a token.

("Characters" here means Unicode code points ...

If after doing some more debugging you can give me a snippet of code that just invokes jsdiff and gets a result you think is wrong, let me know, but I'm gonna assume diffChars is working as intended. (It is a breaking change of course, and I'm sorry it broke something for you, but that's why it was in a major release!)

ExplodingCabbage added a commit that referenced this issue Sep 5, 2024
…acters (#554)

* Add test for bug #553

* Fix bug

* Add release notes
@efstathiosntonas
Copy link

thank you Mark for taking the time to look into it. Will try your suggestions tomorrow and I’ll update the issue!

@ehoogeveen-medweb
Copy link

Interesting - worth noting that even Array.from doesn't extend to grapheme clusters such as 👩‍👩‍👦‍👦:

Array.from('👩‍👩‍👦‍👦').length
// 7

For constructs like that you need Intl.Segmenter:

const seg = new Intl.Segmenter('en', { granularity: "grapheme" });

Array.from(seg.segment('👩‍👩‍👦‍👦')).length
// 1

See e.g. here for more info.

@ExplodingCabbage
Copy link
Collaborator

ExplodingCabbage commented Sep 5, 2024

@ehoogeveen-medweb, yep!

Possibly relevant reading - I've got a (now somewhat-outdated, as it was written before Intl.Segmenter was available) SO answer about different notions of "character" in JavaScript strings at https://stackoverflow.com/a/59796758/1709587. Alas, while code to split text into UTF-16 code units or Unicode code points is not going to change or produce different results over time, the same cannot be said of code to split text into grapheme clusters. Different libraries have slightly different conceptions of what constitutes a grapheme cluster (probably due to being different amounts of up-to-date with the latest changes to Unicode) - e.g. the old grapheme-splitter library I suggest in that answer, which I'd no longer recommend, behaves differently to Intl.Segmenter. There are also issues of availability; Intl.Segmenter only got added to Firefox in August this year, so for many devs it's probably unacceptable to rely on it being natively available, but I'd also rather not couple jsdiff to a particular polyfill. A nightmare! So I decided against making diffChars do anything based on grapheme clusters.

It would be absolutely reasonable for a user of the library to shun diffChars in favour of calling diffArrays with arrays of grapheme clusters segmented as @ehoogeveen-medweb shows above, though, either using the native Intl.Segmenter implementation if their circumstances allow them to rely on its existence, or else including a polyfill.

@ExplodingCabbage
Copy link
Collaborator

Okay, I'll probably actually release the fix for the original issue here tomorrow. Before I release, I want to try to rewrite the extendedWordChars regex from first principles by determining which categories and/or properties in UCD 15 should identify a particular character as either a word character, punctuation character, or space character, then compare the regex I get to the one currently in jsdiff, and then dig into any inconsistencies I discover and figure out whether the current regex or my newly constructed one is more reasonable.

(I don't understand the UCD data very well and basically none of Unicode's data releases I've ever looked at have been well-documented, but https://www.unicode.org/Public/15.0.0/ucd/PropList.txt gives me a list of characters with the White_Space property, I can use Python's unicodedata module to get the Category for every character in existence, and https://unicode.org/reports/tr18/#General_Category_Property gives me category meanings which will hopefully all be easy to map to "word", "whitespace" or "punctuation". I guess I'll see!)

@efstathiosntonas
Copy link

@ExplodingCabbage can confirm, converting all .count to value.length fixed the ?? issue.

@ExplodingCabbage
Copy link
Collaborator

Okay, trying to generate a new extendedWordChars regex from scratch is gonna be a nightmare. I tried assigning characters to "word", "punctuation", or "space" categories based on their UCD category, with the following mapping...

word_cats = ['Cf', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu', 'Mc', 'Me', 'Mn', 'Nd', 'Nl', 'No', 'Pc', 'Sk']
punc_cats = ['Cc', 'Cn', 'Co', 'Cs', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps', 'Sc', 'Sm', 'So']
space_cats = ['Zl', 'Zp', 'Zs']

... but I end up with 761 different character ranges for word characters. Definitely not at all consistent with the existing regex and I cannot realistically audit all the differences.

Since I've fixed the specific issue raised here, I'm gonna leave the pre-existing regex alone and just ship the fix.

@greysteil
Copy link
Author

Makes sense! Using unicode categories might make sense as a future change, but just the tweak to the regex fixes my problem. Thanks for your work on this @ExplodingCabbage.

@AkonXI
Copy link

AkonXI commented Nov 21, 2024

Interesting - worth noting that even Array.from doesn't extend to grapheme clusters such as 👩‍👩‍👦‍👦:

Array.from('👩‍👩‍👦‍👦').length
// 7

For constructs like that you need Intl.Segmenter:

const seg = new Intl.Segmenter('en', { granularity: "grapheme" });

Array.from(seg.segment('👩‍👩‍👦‍👦')).length
// 1

See e.g. here for more info.

useful for me, just be replace the length of value with Segmenter's length , and then I get the correct traditional position. but still need to use value.length to cut the string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants