Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update grapheme cluster break rules to Unicode 15.1 #4536

Merged
merged 10 commits into from
Feb 8, 2024

Conversation

makotokato
Copy link
Member

@makotokato makotokato commented Jan 23, 2024

Fixes #4365

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: these now match what's in icuexportdata, but as long as the rules are not versioned with ICU, I think it's preferable to ship our own copy.

@eggrobin
Copy link
Member

@sffc requested a review from @eggrobin 5 hours ago

At a glance this looks plausible; I recently realized that the new rule does not have what I call extended context to the right, i.e., it does not require what this implementation calls intermediate break states; and the new InCB property mostly refines the GCB property, so there is not too much upheaval in the tables.

I would like to throw some monkeys at the question though (I really should have done monkey testing of the 15.0 version too, but it is much more straightforward than the other segmentations and no-one has complained about it, so it may well be fine). I will almost certainly not do his before Friday, since I have UTC today through Thursday; more realistically not before next week.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine

@@ -255,6 +261,56 @@ fn generate_rule_break_data(
continue;
}

// InCB* isn't a part of grapheme break propery.
// See https://unicode.org/reports/tr44/#Indic_Conjunct_Break
if p.name == "InCBConsonant" || p.name == "InCBLinker" || p.name == "InCBExtend"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: this derivation will probably change relatively often, and this means there's one more part of the code to tweak every Unicode release.

I'm looking at icuexportdata and I don't see InCB in it (even in the released version I downloaded). @sffc do you know why it's not included? Do we not include derived props?

Even if we do not choose to turn this into a property, we should at the very least construct a CPT from InCB.toml here and use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a merged property similar to General Category Group?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how stable is that formula for computing Indic_Conjunct_Break?

My first reaction is that we should export a function/class that depends on the existing data (looks like it needs Script and Indic_Syllabic_Category?), and then we can use it from here in segmenter datagen.

Please note that constructing CPTs is not cheap since it uses the WASM or ICU4C code path. We may want a class that uses a runtime algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how stable is that formula for computing Indic_Conjunct_Break?

It is not stable at all, we are currently actively in discussion for changing it.

Please note that constructing CPTs is not cheap since it uses the WASM or ICU4C code path. We may want a class that uses a runtime algorithm.

Yeah but we only have to do it once here. We'd have to do it anyway if we chose to make the property public; I'm just proposing a solution that doesn't change the public properties API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to have a discussion, including with ICU4C, on whether this and similar properties should be exported as data (maybe a bit simpler through the stack) or whether they should always be derived at runtime from existing data (sharing more data but maybe slower).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair.

I don't think that blocks anything I want to do here, though. I am not proposing we add InCB as a public ICU4X property with data (at the moment), and for ICU4C the norm so far has been that all derived properties have ICU4C data and APIs, I don't think it makes sense to block adding a uchar.h API on deciding whether the API is computing or automagical, and as far as I can tell any such uchar.h API can have its implementation swapped out rather easily.

I'll also note that all core derived properties currently have ICU4X data and APIs as well, except for InCB. It definitely makes sense to try and winnow data here if possible, a fair number of these are simple general category masks (others invoke normalization and such, that's probably worth precomputing). The tricky thing is figuring out if it's worth adding alternate codepaths to CodePointSetData and breaking the not-guaranteed but useful invariant that CodePointSetData is always backed by a CodePointInversionList.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point that was brought up in discussion yesterday was that the algorithmic approach can sometimes get out-of-sync with the data published by Unicode, in addition to being subject to change over time. It seems the source of truth for property values is the UCD, not the spec algorithm.

Given all this background I'm okay adding it as a traditional property with a traditional data pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I forgot to mention this yesterday but the algorithms are not even maintained in ICU4C, they're maintained in unicodetools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit note: this isn't a blocking concern

[[tables]]
name = "Extended_Pictographic_Extend"
left = "Extended_Pictographic"
right = "InCBExtend"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eggrobin looks like this accounts for InCBLinker/InCBExtend characters that are also Extend and thus need to be handled by GB11.

Is there any likelihood InCB will change such that they contain things that are not Extend?

we should probably have comments documenting this here anyway?

@sffc sffc removed their request for review January 25, 2024 18:49
@sffc
Copy link
Member

sffc commented Jan 25, 2024

@aethanyc PTAL

@Manishearth
Copy link
Member

This is r=me for all changes aside from the toml changes. The toml changes look right to me but I would like someone else to take a look. If we can't find anyone with time, i'm fine to merge.

aethanyc
aethanyc previously approved these changes Jan 26, 2024
Copy link
Contributor

@aethanyc aethanyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is it worth capturing the text in #4365 (comment) as a testcase, and assert the char count is 151 (same as ICU4C)?

@@ -0,0 +1,8 @@
÷ 0915 × 094D ÷ 1100 ÷ # ÷ DEVANAGARI LETTER KA (ConjunctLinkingScripts_LinkingConsonant) × DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinkingScripts_ConjunctLinker_ExtCccZwj) ÷ HANGUL CHOSEONG KIYEOK (L) ÷
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Other extra test files all have a header and issue number, so maybe add these lines.

# Additional grapheme breaking tests, not in GraphemeBreakTest.txt
#
# https://github.com/unicode-org/icu4x/issues/4365

@sffc
Copy link
Member

sffc commented Jan 30, 2024

@eggrobin Are you looking into running your monkey tests? Would you like to do that before merging the PR?

@eggrobin
Copy link
Member

Are you looking into running your monkey tests?

Yes. They fail.

Would you like to do that before merging the PR?

Yes, because at first glance this seems to be breaking things that can occur in real text.

@eggrobin
Copy link
Member

Making progress on this one; I will push some commits to this branch once I am done.

Unfortunately I found a bug in the Unicode Standard (the derivation of InCB is wrong, those who can view the PAG issue tracker will see the reference above; ICU does not use the wrong derivation and so luckily escapes the bug). I will update the derivation hardcoded in this PR according to my proposed resolution.

@Manishearth
Copy link
Member

the derivation of InCB is wrong

so we have what ... three issues about fixing InCB now? oy vey

@eggrobin
Copy link
Member

so we have what ... three issues about fixing InCB now? oy vey

I would consider the other ones to be about improving things. This is just flat-out wrong, I failed to make it equivalent to the ICU implementation, and I made it inconsistent with normalization. D’oh.

eggrobin
eggrobin previously approved these changes Jan 31, 2024
Copy link
Member

@eggrobin eggrobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked a hundred thousand monkeys for their opinion, and they thought it looks good.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ to robin's edits

@sffc sffc merged commit 51b3719 into unicode-org:main Feb 8, 2024
30 checks passed
aethanyc added a commit to aethanyc/icu4x that referenced this pull request Mar 20, 2024
We've updated sentence segmenter to 15.1 in unicode-org#4625 and grapheme cluster segmenter
in unicode-org#4536.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition
6 participants