Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it matter that the code all works with code units, not code points? #6

Open
idg10 opened this issue Oct 11, 2021 · 1 comment
Open

Comments

@idg10
Copy link
Contributor

idg10 commented Oct 11, 2021

The code always progresses through text (in the glob pattern, and also in the input) one char at a time, with no regard for higher level units. For example, codepoints outside of the basic multilingual plane (BMP) are encoded as pairs of char value, but instead of treating these as a single character, the two halves of the surrogate pair get handled as separate characters.

It's possible this doesn't matter, but it would be good to add some tests for cases where either the pattern, the input, or both contain non-BMP characters.

(Another consideration is where multiple codepoints combine to form a single logical form, e.g., combining diacritics. These things raise questions of whether you want to treat "caf\u00e9"' and "cafe\u0301"as equal—both representcaféone with the unicode codepoint that pre-combinesewith an acute accent, and the other using an ordinarye` with a combining accent. The answer, mostly likely, is that we do not want to support such things, but it would be good to be explicit, and possibly even to have tests that call this out.)

@mwadams
Copy link
Contributor

mwadams commented Oct 11, 2021

This is interesting, because we do support this in JSON schema land (because it is part of the optional Unicode support) but it would be slower (you'd have to use the codepoint iterator thing).

Specs that demonstrate that it does not work (i.e. pass on failure) in their own section would be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants