Lone surrogates #20

annevk · 2017-12-01T09:33:08Z

How does this API deal with lone surrogates? Are those a valid grapheme cluster?

littledan · 2017-12-04T13:26:38Z

Looks like lone surrogates are typically treated as control characters, which get a break on both sides. These are the Unicode semantics, but I haven't tested with various implementations yet.

caridy · 2017-12-04T19:35:25Z

@annevk great question. In general, I will say that 402 is underspecified when it comes to surrogates, only mentioned briefly on collator. @littledan maybe this is a good opportunity to improve that area all together.

littledan · 2017-12-05T04:12:32Z

I can't imagine how it could be better specified than those details in UTS29 above. If we weaken the definition of breaks to allow more diversity of conforming implementations, do we want to then go back and specify the behaviour for surrogates?

caridy · 2017-12-05T05:20:22Z

I will add it to my list of things to discuss in the next/first 402 meeting :) if we have time.

annevk · 2017-12-05T06:52:45Z

If you defer to UTS29 it seems indeed this is defined, but would be good to have tests as they are a bit of an edge case.

littledan · 2017-12-05T19:34:21Z

Looking again at the spec, I see that I've merged a patch to no longer reference UTS29, see ad6fa34 . This was done based on a request from Mozilla engineers (which?) to allow this to have line breaking which matched the surrounding platform, where theirs was not based on UTS29. I guess we could be more specific just about lone surrogates, but I don't really see why that's more important than, well, everything else that we're leaving just plain undefined.

annevk · 2017-12-06T08:12:08Z

I see, it would be nice to be a little stricter than that or at least figure out what's common, but I guess nobody wants to sign up for that.

Seems that the minimum you could require in such a case is that the algorithm in question has an answer for all code points and not just scalar values or some such.

littledan · 2017-12-06T09:24:46Z

@annevk I don't understand the details of why Firefox wants to use a different line breaking algorithm (for lower memory usage?) and exactly how they compare. But this is a bigger issue to discuss mostly in CSSWG.

I think the current spec basically does say that it should have an answer for all code points. Even in the more general form, it doesn't give the algorithm an opportunity to "error out" on some strings.

annevk · 2017-12-06T09:46:05Z

Hmm, so one thing that could be tested is that CSS and this use the same algorithm.

littledan · 2017-12-06T10:03:38Z

FWIW I need to fix #17 before that's a valid test.

mathiasbynens · 2017-12-10T11:01:12Z

cc @bakkot

littledan · 2018-04-17T13:35:24Z

We've discussed at a few Intl calls whether we should revert #18 or not. It seems like we're still waiting for more data from Mozilla about whether they would be able to ship this data or not. If we do decide to make normative references to UAX 29 and 14, that answers the question; otherwise, I think we should leave this implementation-dependent.

gibson042 · 2019-05-04T13:08:35Z

If there's still a question here, I think it could be resolved by adding text that just requires consistent successful handling for all input but leaves the specifics of such handling implementation-defined, inside or outside a note and with or without a non-normative reference to UAX #14.

littledan · 2019-05-06T09:02:28Z

What's the current status of Mozilla and ICU segmentation? Has anything changed? Cc @jswalden @zbraniecki

I don't see why we would reference UAX/UTS 14, as we no longer include line breaking. We could reference UTS 29, if Mozilla is OK with it.

annevk · 2019-05-07T12:12:05Z

@jfkthame you should maybe chime in here.

jfkthame · 2019-05-07T13:19:55Z

Is this only about grapheme cluster segmentation, or do we also need to consider the other types of segment breaks described in UTS 29?

For grapheme-cluster breaks, UTS 29 is clear there should be a break before and after a lone surrogate, thanks to rule GB999.

But for other kinds of segmentation, it looks like UTS 29 no longer fully addresses the issue of lone surrogate code units, as it was updated for Unicode 12 (see the Modifications section) to classify them as XX (unknown). And for word breaks, the behavior of XX is not specified in UTS 29:

characters with the Line_Break property values of [...] Unknown (XX) are assigned Word_Break property values based on criteria outside of the scope of this annex

It seems appropriate to me that there should be a word break before and after any lone surrogate, but I don't know if there's any spec that explicitly says so at this point.

As for Mozilla's implementation: I think we're fine with using UTS 29 for the definition of grapheme cluster boundaries. If our implementation fails to follow its rules, that would simply be a bug that we should fix.

I'm unsure about other levels of segmentation: I don't think we currently ship all the ICU code to support word segmentation, and our existing code is unlikely to exactly match it. (It looks quite primitive, actually.) Whether we're ready to add all the ICU segmentation code would be a product-level decision, weighing up the cost (in added download and install size) vs the benefit.

gibson042 · 2019-11-20T12:06:53Z

We'll leave implementations to do the right thing here with respect to boundary determination in any given locale, rather than overconstrain them.

xorgy · 2021-02-17T20:49:31Z

For grapheme-cluster breaks, UTS 29 is clear there should be a break before and after a lone surrogate, thanks to rule GB999.

It doesn't seem so clear to me that GB999 means that. There is no mention of lone surrogates here, and while lone surrogates would fall under Any (as they are in none of the ranges in GraphemeBreakProperty.txt), Any is still unconditionally joined with Prepend, and possibly with SpacingMark, Extend, and ZWJ(!).

jfkthame · 2021-02-17T22:03:19Z

For grapheme-cluster breaks, UTS 29 is clear there should be a break before and after a lone surrogate, thanks to rule GB999.

It doesn't seem so clear to me that GB999 means that. There is no mention of lone surrogates here, and while lone surrogates would fall under Any (as they are in none of the ranges in GraphemeBreakProperty.txt), Any is still unconditionally joined with Prepend, and possibly with SpacingMark, Extend, and ZWJ(!).

Yes - on looking again now, that appears to be correct.

annevk mentioned this issue Dec 6, 2017

Platform conventions whatwg/infra#178

Closed

gibson042 closed this as completed Nov 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lone surrogates #20

Lone surrogates #20

annevk commented Dec 1, 2017

littledan commented Dec 4, 2017

caridy commented Dec 4, 2017

littledan commented Dec 5, 2017

caridy commented Dec 5, 2017

annevk commented Dec 5, 2017

littledan commented Dec 5, 2017

annevk commented Dec 6, 2017

littledan commented Dec 6, 2017

annevk commented Dec 6, 2017

littledan commented Dec 6, 2017

mathiasbynens commented Dec 10, 2017

littledan commented Apr 17, 2018

gibson042 commented May 4, 2019

littledan commented May 6, 2019

annevk commented May 7, 2019

jfkthame commented May 7, 2019

gibson042 commented Nov 20, 2019

xorgy commented Feb 17, 2021 •

edited

Loading

jfkthame commented Feb 17, 2021

Lone surrogates #20

Lone surrogates #20

Comments

annevk commented Dec 1, 2017

littledan commented Dec 4, 2017

caridy commented Dec 4, 2017

littledan commented Dec 5, 2017

caridy commented Dec 5, 2017

annevk commented Dec 5, 2017

littledan commented Dec 5, 2017

annevk commented Dec 6, 2017

littledan commented Dec 6, 2017

annevk commented Dec 6, 2017

littledan commented Dec 6, 2017

mathiasbynens commented Dec 10, 2017

littledan commented Apr 17, 2018

gibson042 commented May 4, 2019

littledan commented May 6, 2019

annevk commented May 7, 2019

jfkthame commented May 7, 2019

gibson042 commented Nov 20, 2019

xorgy commented Feb 17, 2021 • edited Loading

jfkthame commented Feb 17, 2021

xorgy commented Feb 17, 2021 •

edited

Loading