Skip to content

Conversation

@ExplodingCabbage
Copy link
Collaborator

Fixes #553 (the original issue, not the orthogonal bug posted in a comment).

The bug was introduced in #494. Previously we were splitting into words on \b, which per the linked MDN docs considers the following things to be word characters:

Letters (A–Z, a–z), numbers (0–9), and underscore (_).

Then I swapped to using the extendedWordChars regex (previously only used for stitching together some adjacent tokens into single tokens after the original splitting), but didn't ensure it actually contained all the things that \b considers to be word characters.

This should fix it!

// - U+02DD ˝ ˝ Double Acute Accent
// Latin Extended Additional, 1E00–1EFF
const extendedWordChars = 'a-zA-Z\\u{C0}-\\u{FF}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';
const extendedWordChars = 'a-zA-Z0-9_\\u{C0}-\\u{FF}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';
Copy link
Collaborator Author

@ExplodingCabbage ExplodingCabbage Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I am confused about the (pre-existing) ranges given here and by the comment explaining them above. I am gonna play around and maybe add more tests. But that can happen subsequent to just getting this PR merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Combinations of numbers and letters no longer considered a single token in v6

2 participants