Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix diffWords treating numbers and underscores as not being word characters #554

Merged
merged 4 commits into from
Sep 5, 2024

Conversation

ExplodingCabbage
Copy link
Collaborator

Fixes #553 (the original issue, not the orthogonal bug posted in a comment).

The bug was introduced in #494. Previously we were splitting into words on \b, which per the linked MDN docs considers the following things to be word characters:

Letters (A–Z, a–z), numbers (0–9), and underscore (_).

Then I swapped to using the extendedWordChars regex (previously only used for stitching together some adjacent tokens into single tokens after the original splitting), but didn't ensure it actually contained all the things that \b considers to be word characters.

This should fix it!

@@ -19,7 +19,7 @@ import { longestCommonPrefix, longestCommonSuffix, replacePrefix, replaceSuffix,
// - U+02DC ˜ ˜ Small Tilde
// - U+02DD ˝ ˝ Double Acute Accent
// Latin Extended Additional, 1E00–1EFF
const extendedWordChars = 'a-zA-Z\\u{C0}-\\u{FF}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';
const extendedWordChars = 'a-zA-Z0-9_\\u{C0}-\\u{FF}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';
Copy link
Collaborator Author

@ExplodingCabbage ExplodingCabbage Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I am confused about the (pre-existing) ranges given here and by the comment explaining them above. I am gonna play around and maybe add more tests. But that can happen subsequent to just getting this PR merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Combinations of numbers and letters no longer considered a single token in v6
1 participant