Integer versus string representation of code points #6

js-choi · 2018-05-11T22:50:21Z

#5 (comment) reminded me that I wanted to ask: Is there a particular reason why an integer representation for code points was chosen instead of a 1-or-2-UTF16-code-unit string representation? With the latter,
"\u0041\ud801\udc00\u0042".codePoints() would then yield
"\u0041" then
"\ud801\udc00" then
"\u0042"?

I would personally find such a string representation to be more generally useful. My parsers concatenate code points into new strings much more often than they perform integer arithmetic on them. But people’s usage here may vary, I suppose.

The text was updated successfully, but these errors were encountered:

michaelficarra · 2018-05-12T00:47:14Z

We could just do both, as @mathiasbynens suggested: #1 (comment)

mathiasbynens · 2018-05-12T01:41:48Z

Note that if these string values are all you need, you can already get them by iterating over the string.

for (const symbol of string) {
  console.log(symbol);
}

js-choi · 2018-05-12T01:51:25Z

[Edit: I had been incorrectly remembering the default string iterator’s behavior; see https://github.com//issues/6#issuecomment-388534535.]

@mathiasbynens: The default string iterator does not work for SMP/non-BMP code points (that is, code points beyond U+FFFF), such as the string in the example. The default iterator splits non-BMP code points them into their surrogate pairs. [..."\u0041\ud801\udc00\u0042"], aka ["A𐐀B"], results in an array with four strings each with one UTF-16 code unit (["A", "\ud801", "\udc00", "B"]), rather than three strings each with one Unicode code point (["A", "𐐀", "B"]). Use cases like these are why this repository’s proposal would be useful.

…Unless I’m misunderstanding the example, which certainly is possible.

ljharb · 2018-05-12T01:53:14Z

This api would have to work the same as the default iterator imo to be called “codePoints”. It sounds like you want graphemes, which would be a separate thing?

js-choi · 2018-05-12T02:06:09Z

[Edit: I had been incorrectly remembering the default string iterator’s behavior; see https://github.com//issues/6#issuecomment-388534535.]

@ljharb: Could you clarify what you mean by “work the same as the default iterator”? The default iterator splits strings between their UTF-16 code units, not between their code points.

In this case, I don’t want to split strings between graphemes; I want to split strings between code points. As you probably already know, splitting strings by graphemes is a much more complicated problem than simply splitting by code points (or even by combining character sequence). Grapheme segmentation is language- and culture-dependent, its general rules as defined by UAX 29 are relatively complex…and hopefully, for JavaScript, Intl.Segmenter will eventually deal with the complexity graphemes anyway. (Readers interested in this sort of thing can take a look at the Unicode FAQ.)

My parsers often want simply to handle Unicode text without splitting surrogate pairs. They need to consume each consecutive code point of the input string—rather than by UTF-16 code units, by combining character sequences, or by graphemes of some locale.

Both the UTF-16 surrogate-pair string "\ud801\udc00" and the integer 0x10400 are reasonable representations of the same code point U+10400. Sometimes the former representation is more useful and sometimes the latter is more useful. But I usually find the former more useful.

Yielding both in objects, as @michaelficarra mentioned in #6 (comment), would satisfy this need, I think.

mathiasbynens · 2018-05-12T02:32:06Z

@js-choi The default JS string iterator most definitely interates over code points rather than UCS-2/UTF-16 code units. Whichever engine behaves the way you described is violating the spec.

mathiasbynens · 2018-05-12T02:33:09Z

The default iterator splits strings between their UTF-16 code units, not between their code points.

Where are you getting this? This is false.

js-choi · 2018-05-12T06:39:55Z

Ah, this is supremely embarrassing. You are of course correct; I had been recalling the default iterator’s behavior incorrectly. I’m not sure from where I got the false memory that it split between UTF-16 code units…I have been planning to port some parser code written in such languages to JavaScript for some time. I must have gotten wires crossed with some other programming languages that don’t split between code points.

Heh, my sincere apologies; this issue is resolved. Thanks, everyone.

js-choi closed this as completed May 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integer versus string representation of code points #6

Integer versus string representation of code points #6

js-choi commented May 11, 2018 •

edited

Loading

michaelficarra commented May 12, 2018

mathiasbynens commented May 12, 2018

js-choi commented May 12, 2018 •

edited

Loading

ljharb commented May 12, 2018

js-choi commented May 12, 2018 •

edited

Loading

mathiasbynens commented May 12, 2018

mathiasbynens commented May 12, 2018

js-choi commented May 12, 2018

Integer versus string representation of code points #6

Integer versus string representation of code points #6

Comments

js-choi commented May 11, 2018 • edited Loading

michaelficarra commented May 12, 2018

mathiasbynens commented May 12, 2018

js-choi commented May 12, 2018 • edited Loading

ljharb commented May 12, 2018

js-choi commented May 12, 2018 • edited Loading

mathiasbynens commented May 12, 2018

mathiasbynens commented May 12, 2018

js-choi commented May 12, 2018

js-choi commented May 11, 2018 •

edited

Loading

js-choi commented May 12, 2018 •

edited

Loading

js-choi commented May 12, 2018 •

edited

Loading