You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code always progresses through text (in the glob pattern, and also in the input) one char at a time, with no regard for higher level units. For example, codepoints outside of the basic multilingual plane (BMP) are encoded as pairs of char value, but instead of treating these as a single character, the two halves of the surrogate pair get handled as separate characters.
It's possible this doesn't matter, but it would be good to add some tests for cases where either the pattern, the input, or both contain non-BMP characters.
(Another consideration is where multiple codepoints combine to form a single logical form, e.g., combining diacritics. These things raise questions of whether you want to treat "caf\u00e9"' and "cafe\u0301"as equal—both representcaféone with the unicode codepoint that pre-combinesewith an acute accent, and the other using an ordinarye` with a combining accent. The answer, mostly likely, is that we do not want to support such things, but it would be good to be explicit, and possibly even to have tests that call this out.)
The text was updated successfully, but these errors were encountered:
This is interesting, because we do support this in JSON schema land (because it is part of the optional Unicode support) but it would be slower (you'd have to use the codepoint iterator thing).
Specs that demonstrate that it does not work (i.e. pass on failure) in their own section would be good.
The code always progresses through text (in the glob pattern, and also in the input) one
char
at a time, with no regard for higher level units. For example, codepoints outside of the basic multilingual plane (BMP) are encoded as pairs ofchar
value, but instead of treating these as a single character, the two halves of the surrogate pair get handled as separate characters.It's possible this doesn't matter, but it would be good to add some tests for cases where either the pattern, the input, or both contain non-BMP characters.
(Another consideration is where multiple codepoints combine to form a single logical form, e.g., combining diacritics. These things raise questions of whether you want to treat
"caf\u00e9"' and
"cafe\u0301"as equal—both represent
caféone with the unicode codepoint that pre-combines
ewith an acute accent, and the other using an ordinary
e` with a combining accent. The answer, mostly likely, is that we do not want to support such things, but it would be good to be explicit, and possibly even to have tests that call this out.)The text was updated successfully, but these errors were encountered: