Update grapheme cluster break rules to Unicode 15.1 #4536

makotokato · 2024-01-23T05:42:14Z

robertbastian · 2024-01-23T10:01:08Z

provider/datagen/data/segmenter/uprops/small/ExtPict.toml

observation: these now match what's in icuexportdata, but as long as the rules are not versioned with ICU, I think it's preferable to ship our own copy.

eggrobin · 2024-01-23T13:43:31Z

@sffc requested a review from @eggrobin 5 hours ago

At a glance this looks plausible; I recently realized that the new rule does not have what I call extended context to the right, i.e., it does not require what this implementation calls intermediate break states; and the new InCB property mostly refines the GCB property, so there is not too much upheaval in the tables.

I would like to throw some monkeys at the question though (I really should have done monkey testing of the 15.0 version too, but it is much more straightforward than the other segmentations and no-one has complained about it, so it may well be fine). I will almost certainly not do his before Friday, since I have UTC today through Thursday; more realistically not before next week.

Manishearth

Looks mostly fine

Manishearth · 2024-01-23T19:24:25Z

provider/datagen/src/transform/segmenter/mod.rs

@@ -255,6 +261,56 @@ fn generate_rule_break_data(
                        continue;
                    }

+                    // InCB* isn't a part of grapheme break propery.
+                    // See https://unicode.org/reports/tr44/#Indic_Conjunct_Break
+                    if p.name == "InCBConsonant" || p.name == "InCBLinker" || p.name == "InCBExtend"


thought: this derivation will probably change relatively often, and this means there's one more part of the code to tweak every Unicode release.

I'm looking at icuexportdata and I don't see InCB in it (even in the released version I downloaded). @sffc do you know why it's not included? Do we not include derived props?

Even if we do not choose to turn this into a property, we should at the very least construct a CPT from InCB.toml here and use it.

Is this a merged property similar to General Category Group?

No, it's a DerivedCoreProperty

https://unicode.org/reports/tr44/#Indic_Conjunct_Break

Hmm, how stable is that formula for computing Indic_Conjunct_Break?

My first reaction is that we should export a function/class that depends on the existing data (looks like it needs Script and Indic_Syllabic_Category?), and then we can use it from here in segmenter datagen.

Please note that constructing CPTs is not cheap since it uses the WASM or ICU4C code path. We may want a class that uses a runtime algorithm.

Hmm, how stable is that formula for computing Indic_Conjunct_Break?

It is not stable at all, we are currently actively in discussion for changing it.

Please note that constructing CPTs is not cheap since it uses the WASM or ICU4C code path. We may want a class that uses a runtime algorithm.

Yeah but we only have to do it once here. We'd have to do it anyway if we chose to make the property public; I'm just proposing a solution that doesn't change the public properties API.

I want to have a discussion, including with ICU4C, on whether this and similar properties should be exported as data (maybe a bit simpler through the stack) or whether they should always be derived at runtime from existing data (sharing more data but maybe slower).

That's fair.

I don't think that blocks anything I want to do here, though. I am not proposing we add InCB as a public ICU4X property with data (at the moment), and for ICU4C the norm so far has been that all derived properties have ICU4C data and APIs, I don't think it makes sense to block adding a uchar.h API on deciding whether the API is computing or automagical, and as far as I can tell any such uchar.h API can have its implementation swapped out rather easily.

I'll also note that all core derived properties currently have ICU4X data and APIs as well, except for InCB. It definitely makes sense to try and winnow data here if possible, a fair number of these are simple general category masks (others invoke normalization and such, that's probably worth precomputing). The tricky thing is figuring out if it's worth adding alternate codepaths to CodePointSetData and breaking the not-guaranteed but useful invariant that CodePointSetData is always backed by a CodePointInversionList.

Another point that was brought up in discussion yesterday was that the algorithmic approach can sometimes get out-of-sync with the data published by Unicode, in addition to being subject to change over time. It seems the source of truth for property values is the UCD, not the spec algorithm.

Given all this background I'm okay adding it as a traditional property with a traditional data pipeline.

Yes. I forgot to mention this yesterday but the algorithms are not even maintained in ICU4C, they're maintained in unicodetools.

explicit note: this isn't a blocking concern

Manishearth · 2024-01-23T19:39:53Z

provider/datagen/data/segmenter/grapheme.toml

+[[tables]]
+name = "Extended_Pictographic_Extend"
+left = "Extended_Pictographic"
+right = "InCBExtend"


@eggrobin looks like this accounts for InCBLinker/InCBExtend characters that are also Extend and thus need to be handled by GB11.

Is there any likelihood InCB will change such that they contain things that are not Extend?

we should probably have comments documenting this here anyway?

sffc · 2024-01-25T18:49:59Z

@aethanyc PTAL

Manishearth · 2024-01-25T18:52:07Z

This is r=me for all changes aside from the toml changes. The toml changes look right to me but I would like someone else to take a look. If we can't find anyone with time, i'm fine to merge.

aethanyc

Question: is it worth capturing the text in #4365 (comment) as a testcase, and assert the char count is 151 (same as ICU4C)?

aethanyc · 2024-01-26T01:15:06Z

components/segmenter/tests/testdata/GraphemeBreakExtraTest.txt

@@ -0,0 +1,8 @@
+÷ 0915 × 094D ÷ 1100 ÷ # ÷ DEVANAGARI LETTER KA (ConjunctLinkingScripts_LinkingConsonant) × DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinkingScripts_ConjunctLinker_ExtCccZwj) ÷ HANGUL CHOSEONG KIYEOK (L) ÷


Nit: Other extra test files all have a header and issue number, so maybe add these lines.

# Additional grapheme breaking tests, not in GraphemeBreakTest.txt # # https://github.com/unicode-org/icu4x/issues/4365

sffc · 2024-01-30T20:06:32Z

@eggrobin Are you looking into running your monkey tests? Would you like to do that before merging the PR?

eggrobin · 2024-01-31T15:06:08Z

Are you looking into running your monkey tests?

Yes. They fail.

Would you like to do that before merging the PR?

Yes, because at first glance this seems to be breaking things that can occur in real text.

eggrobin · 2024-01-31T16:24:07Z

Making progress on this one; I will push some commits to this branch once I am done.

Unfortunately I found a bug in the Unicode Standard (the derivation of InCB is wrong, those who can view the PAG issue tracker will see the reference above; ICU does not use the wrong derivation and so luckily escapes the bug). I will update the derivation hardcoded in this PR according to my proposed resolution.

Manishearth · 2024-01-31T16:46:50Z

the derivation of InCB is wrong

so we have what ... three issues about fixing InCB now? oy vey

eggrobin · 2024-01-31T16:52:34Z

so we have what ... three issues about fixing InCB now? oy vey

I would consider the other ones to be about improving things. This is just flat-out wrong, I failed to make it equivalent to the ICU implementation, and I made it inconsistent with normalization. D’oh.

eggrobin

I asked a hundred thousand monkeys for their opinion, and they thought it looks good.

…Robert

Manishearth

r+ to robin's edits

We've updated sentence segmenter to 15.1 in unicode-org#4625 and grapheme cluster segmenter in unicode-org#4536.

Update grapheme cluster break rules to Unicode 15.1

d21d197

makotokato requested review from sffc, robertbastian, Manishearth, aethanyc and a team as code owners January 23, 2024 05:42

sffc requested a review from eggrobin January 23, 2024 07:50

robertbastian reviewed Jan 23, 2024

View reviewed changes

Manishearth reviewed Jan 23, 2024

View reviewed changes

sffc removed their request for review January 25, 2024 18:49

aethanyc previously approved these changes Jan 26, 2024

View reviewed changes

Verbose error reporting in the GCB tests

9764000

eggrobin added 2 commits January 31, 2024 17:50

Fixes to the rules

28940b4

Bad property derivation. Egg on egg’s face.

06d9965

eggrobin added 2 commits January 31, 2024 17:55

bake the data

e17e831

A hundred monkeys

253004a

eggrobin dismissed aethanyc’s stale review via 253004a January 31, 2024 17:08

eggrobin added 3 commits January 31, 2024 18:58

cargo make testdata

08ce937

Merge remote-tracking branch 'la-vache/main' into HEAD

70c3079

cargo make testdata anew

52cfa0a

eggrobin force-pushed the gcb15_1 branch from 6c52343 to 52cfa0a Compare January 31, 2024 18:03

eggrobin previously approved these changes Jan 31, 2024

View reviewed changes

eggrobin mentioned this pull request Feb 1, 2024

ICU-22518 Export monkeys unicode-org/icu#2637

Merged

7 tasks

Restructure the property derivation based on feedback from Asmus and …

b53ebcd

…Robert

eggrobin dismissed their stale review via b53ebcd February 1, 2024 12:10

Manishearth approved these changes Feb 2, 2024

View reviewed changes

sffc merged commit 51b3719 into unicode-org:main Feb 8, 2024
30 checks passed

aethanyc mentioned this pull request Feb 9, 2024

Change the crate name from icu_unitsconversion to icu_units #4593

Merged

aethanyc added a commit to aethanyc/icu4x that referenced this pull request Mar 20, 2024

Update grapheme.toml and sentence.toml headers

d161e9d

We've updated sentence segmenter to 15.1 in unicode-org#4625 and grapheme cluster segmenter in unicode-org#4536.

aethanyc mentioned this pull request Mar 20, 2024

Update grapheme.toml and sentence.toml headers #4711

Merged

aethanyc mentioned this pull request Apr 6, 2024

Unexpected grapheme boundary with regional indicators (GB12) #4780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update grapheme cluster break rules to Unicode 15.1 #4536

Update grapheme cluster break rules to Unicode 15.1 #4536

makotokato commented Jan 23, 2024 •

edited

Loading

robertbastian Jan 23, 2024

eggrobin commented Jan 23, 2024

Manishearth left a comment

Manishearth Jan 23, 2024

sffc Jan 23, 2024

Manishearth Jan 23, 2024

sffc Jan 23, 2024

Manishearth Jan 23, 2024

sffc Jan 24, 2024

Manishearth Jan 24, 2024

sffc Jan 24, 2024

Manishearth Jan 24, 2024

Manishearth Jan 25, 2024

Manishearth Jan 23, 2024

sffc commented Jan 25, 2024

Manishearth commented Jan 25, 2024

aethanyc left a comment

aethanyc Jan 26, 2024

sffc commented Jan 30, 2024

eggrobin commented Jan 31, 2024

eggrobin commented Jan 31, 2024

Manishearth commented Jan 31, 2024

eggrobin commented Jan 31, 2024

eggrobin left a comment

Manishearth left a comment

		@@ -0,0 +1,8 @@
		÷ 0915 × 094D ÷ 1100 ÷ # ÷ DEVANAGARI LETTER KA (ConjunctLinkingScripts_LinkingConsonant) × DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinkingScripts_ConjunctLinker_ExtCccZwj) ÷ HANGUL CHOSEONG KIYEOK (L) ÷

Update grapheme cluster break rules to Unicode 15.1 #4536

Update grapheme cluster break rules to Unicode 15.1 #4536

Conversation

makotokato commented Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

eggrobin commented Jan 23, 2024

Manishearth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc commented Jan 25, 2024

Manishearth commented Jan 25, 2024

aethanyc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc commented Jan 30, 2024

eggrobin commented Jan 31, 2024

eggrobin commented Jan 31, 2024

Manishearth commented Jan 31, 2024

eggrobin commented Jan 31, 2024

eggrobin left a comment

Choose a reason for hiding this comment

Manishearth left a comment

Choose a reason for hiding this comment

makotokato commented Jan 23, 2024 •

edited

Loading