Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer versus string representation of code points #6

Closed
js-choi opened this issue May 11, 2018 · 8 comments
Closed

Integer versus string representation of code points #6

js-choi opened this issue May 11, 2018 · 8 comments

Comments

@js-choi
Copy link

js-choi commented May 11, 2018

#5 (comment) reminded me that I wanted to ask: Is there a particular reason why an integer representation for code points was chosen instead of a 1-or-2-UTF16-code-unit string representation? With the latter,
"\u0041\ud801\udc00\u0042".codePoints() would then yield
"\u0041" then
"\ud801\udc00" then
"\u0042"?

I would personally find such a string representation to be more generally useful. My parsers concatenate code points into new strings much more often than they perform integer arithmetic on them. But people’s usage here may vary, I suppose.

@michaelficarra
Copy link
Member

We could just do both, as @mathiasbynens suggested: #1 (comment)

@mathiasbynens
Copy link
Member

Note that if these string values are all you need, you can already get them by iterating over the string.

for (const symbol of string) {
  console.log(symbol);
}

@js-choi
Copy link
Author

js-choi commented May 12, 2018

[Edit: I had been incorrectly remembering the default string iterator’s behavior; see https://github.com//issues/6#issuecomment-388534535.]

@mathiasbynens: The default string iterator does not work for SMP/non-BMP code points (that is, code points beyond U+FFFF), such as the string in the example. The default iterator splits non-BMP code points them into their surrogate pairs. [..."\u0041\ud801\udc00\u0042"], aka ["A𐐀B"], results in an array with four strings each with one UTF-16 code unit (["A", "\ud801", "\udc00", "B"]), rather than three strings each with one Unicode code point (["A", "𐐀", "B"]). Use cases like these are why this repository’s proposal would be useful.

…Unless I’m misunderstanding the example, which certainly is possible.

@ljharb
Copy link
Member

ljharb commented May 12, 2018

This api would have to work the same as the default iterator imo to be called “codePoints”. It sounds like you want graphemes, which would be a separate thing?

@js-choi
Copy link
Author

js-choi commented May 12, 2018

[Edit: I had been incorrectly remembering the default string iterator’s behavior; see https://github.com//issues/6#issuecomment-388534535.]

@ljharb: Could you clarify what you mean by “work the same as the default iterator”? The default iterator splits strings between their UTF-16 code units, not between their code points.

In this case, I don’t want to split strings between graphemes; I want to split strings between code points. As you probably already know, splitting strings by graphemes is a much more complicated problem than simply splitting by code points (or even by combining character sequence). Grapheme segmentation is language- and culture-dependent, its general rules as defined by UAX 29 are relatively complex…and hopefully, for JavaScript, Intl.Segmenter will eventually deal with the complexity graphemes anyway. (Readers interested in this sort of thing can take a look at the Unicode FAQ.)

My parsers often want simply to handle Unicode text without splitting surrogate pairs. They need to consume each consecutive code point of the input string—rather than by UTF-16 code units, by combining character sequences, or by graphemes of some locale.

Both the UTF-16 surrogate-pair string "\ud801\udc00" and the integer 0x10400 are reasonable representations of the same code point U+10400. Sometimes the former representation is more useful and sometimes the latter is more useful. But I usually find the former more useful.

Yielding both in objects, as @michaelficarra mentioned in #6 (comment), would satisfy this need, I think.

@mathiasbynens
Copy link
Member

@js-choi The default JS string iterator most definitely interates over code points rather than UCS-2/UTF-16 code units. Whichever engine behaves the way you described is violating the spec.

@mathiasbynens
Copy link
Member

The default iterator splits strings between their UTF-16 code units, not between their code points.

Where are you getting this? This is false.

@js-choi
Copy link
Author

js-choi commented May 12, 2018

Ah, this is supremely embarrassing. You are of course correct; I had been recalling the default iterator’s behavior incorrectly. I’m not sure from where I got the false memory that it split between UTF-16 code units…I have been planning to port some parser code written in such languages to JavaScript for some time. I must have gotten wires crossed with some other programming languages that don’t split between code points.

Heh, my sincere apologies; this issue is resolved. Thanks, everyone.

@js-choi js-choi closed this as completed May 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants