-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integer versus string representation of code points #6
Comments
We could just do both, as @mathiasbynens suggested: #1 (comment) |
Note that if these string values are all you need, you can already get them by iterating over the string. for (const symbol of string) {
console.log(symbol);
} |
[Edit: I had been incorrectly remembering the default string iterator’s behavior; see https://github.com//issues/6#issuecomment-388534535.] @mathiasbynens: The default string iterator does not work for SMP/non-BMP code points (that is, code points beyond U+FFFF), such as the string in the example. The default iterator splits non-BMP code points them into their surrogate pairs. …Unless I’m misunderstanding the example, which certainly is possible. |
This api would have to work the same as the default iterator imo to be called “codePoints”. It sounds like you want graphemes, which would be a separate thing? |
[Edit: I had been incorrectly remembering the default string iterator’s behavior; see https://github.com//issues/6#issuecomment-388534535.] @ljharb: Could you clarify what you mean by “work the same as the default iterator”? The default iterator splits strings between their UTF-16 code units, not between their code points. In this case, I don’t want to split strings between graphemes; I want to split strings between code points. As you probably already know, splitting strings by graphemes is a much more complicated problem than simply splitting by code points (or even by combining character sequence). Grapheme segmentation is language- and culture-dependent, its general rules as defined by UAX 29 are relatively complex…and hopefully, for JavaScript, My parsers often want simply to handle Unicode text without splitting surrogate pairs. They need to consume each consecutive code point of the input string—rather than by UTF-16 code units, by combining character sequences, or by graphemes of some locale. Both the UTF-16 surrogate-pair string Yielding both in objects, as @michaelficarra mentioned in #6 (comment), would satisfy this need, I think. |
@js-choi The default JS string iterator most definitely interates over code points rather than UCS-2/UTF-16 code units. Whichever engine behaves the way you described is violating the spec. |
Where are you getting this? This is false. |
Ah, this is supremely embarrassing. You are of course correct; I had been recalling the default iterator’s behavior incorrectly. I’m not sure from where I got the false memory that it split between UTF-16 code units…I have been planning to port some parser code written in such languages to JavaScript for some time. I must have gotten wires crossed with some other programming languages that don’t split between code points. Heh, my sincere apologies; this issue is resolved. Thanks, everyone. |
#5 (comment) reminded me that I wanted to ask: Is there a particular reason why an integer representation for code points was chosen instead of a 1-or-2-UTF16-code-unit string representation? With the latter,
"\u0041\ud801\udc00\u0042".codePoints()
would then yield"\u0041"
then"\ud801\udc00"
then"\u0042"
?I would personally find such a string representation to be more generally useful. My parsers concatenate code points into new strings much more often than they perform integer arithmetic on them. But people’s usage here may vary, I suppose.
The text was updated successfully, but these errors were encountered: