Does it matter that the code all works with code units, not code points? #6

idg10 · 2021-10-11T13:27:42Z

The code always progresses through text (in the glob pattern, and also in the input) one char at a time, with no regard for higher level units. For example, codepoints outside of the basic multilingual plane (BMP) are encoded as pairs of char value, but instead of treating these as a single character, the two halves of the surrogate pair get handled as separate characters.

It's possible this doesn't matter, but it would be good to add some tests for cases where either the pattern, the input, or both contain non-BMP characters.

(Another consideration is where multiple codepoints combine to form a single logical form, e.g., combining diacritics. These things raise questions of whether you want to treat "caf\u00e9"' and "cafe\u0301"as equal—both representcaféone with the unicode codepoint that pre-combinesewith an acute accent, and the other using an ordinarye` with a combining accent. The answer, mostly likely, is that we do not want to support such things, but it would be good to be explicit, and possibly even to have tests that call this out.)

The text was updated successfully, but these errors were encountered:

mwadams · 2021-10-11T13:35:16Z

This is interesting, because we do support this in JSON schema land (because it is part of the optional Unicode support) but it would be slower (you'd have to use the codepoint iterator thing).

Specs that demonstrate that it does not work (i.e. pass on failure) in their own section would be good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does it matter that the code all works with code units, not code points? #6

Does it matter that the code all works with code units, not code points? #6

idg10 commented Oct 11, 2021

mwadams commented Oct 11, 2021

Does it matter that the code all works with code units, not code points? #6

Does it matter that the code all works with code units, not code points? #6

Comments

idg10 commented Oct 11, 2021

mwadams commented Oct 11, 2021