-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update grapheme cluster break rules to Unicode 15.1 #4536
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observation: these now match what's in icuexportdata
, but as long as the rules are not versioned with ICU, I think it's preferable to ship our own copy.
At a glance this looks plausible; I recently realized that the new rule does not have what I call extended context to the right, i.e., it does not require what this implementation calls intermediate break states; and the new InCB property mostly refines the GCB property, so there is not too much upheaval in the tables. I would like to throw some monkeys at the question though (I really should have done monkey testing of the 15.0 version too, but it is much more straightforward than the other segmentations and no-one has complained about it, so it may well be fine). I will almost certainly not do his before Friday, since I have UTC today through Thursday; more realistically not before next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly fine
@@ -255,6 +261,56 @@ fn generate_rule_break_data( | |||
continue; | |||
} | |||
|
|||
// InCB* isn't a part of grapheme break propery. | |||
// See https://unicode.org/reports/tr44/#Indic_Conjunct_Break | |||
if p.name == "InCBConsonant" || p.name == "InCBLinker" || p.name == "InCBExtend" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: this derivation will probably change relatively often, and this means there's one more part of the code to tweak every Unicode release.
I'm looking at icuexportdata and I don't see InCB in it (even in the released version I downloaded). @sffc do you know why it's not included? Do we not include derived props?
Even if we do not choose to turn this into a property, we should at the very least construct a CPT from InCB.toml here and use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a merged property similar to General Category Group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's a DerivedCoreProperty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how stable is that formula for computing Indic_Conjunct_Break?
My first reaction is that we should export a function/class that depends on the existing data (looks like it needs Script and Indic_Syllabic_Category?), and then we can use it from here in segmenter datagen.
Please note that constructing CPTs is not cheap since it uses the WASM or ICU4C code path. We may want a class that uses a runtime algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how stable is that formula for computing Indic_Conjunct_Break?
It is not stable at all, we are currently actively in discussion for changing it.
Please note that constructing CPTs is not cheap since it uses the WASM or ICU4C code path. We may want a class that uses a runtime algorithm.
Yeah but we only have to do it once here. We'd have to do it anyway if we chose to make the property public; I'm just proposing a solution that doesn't change the public properties API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to have a discussion, including with ICU4C, on whether this and similar properties should be exported as data (maybe a bit simpler through the stack) or whether they should always be derived at runtime from existing data (sharing more data but maybe slower).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair.
I don't think that blocks anything I want to do here, though. I am not proposing we add InCB as a public ICU4X property with data (at the moment), and for ICU4C the norm so far has been that all derived properties have ICU4C data and APIs, I don't think it makes sense to block adding a uchar.h API on deciding whether the API is computing or automagical, and as far as I can tell any such uchar.h API can have its implementation swapped out rather easily.
I'll also note that all core derived properties currently have ICU4X data and APIs as well, except for InCB. It definitely makes sense to try and winnow data here if possible, a fair number of these are simple general category masks (others invoke normalization and such, that's probably worth precomputing). The tricky thing is figuring out if it's worth adding alternate codepaths to CodePointSetData and breaking the not-guaranteed but useful invariant that CodePointSetData is always backed by a CodePointInversionList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point that was brought up in discussion yesterday was that the algorithmic approach can sometimes get out-of-sync with the data published by Unicode, in addition to being subject to change over time. It seems the source of truth for property values is the UCD, not the spec algorithm.
Given all this background I'm okay adding it as a traditional property with a traditional data pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I forgot to mention this yesterday but the algorithms are not even maintained in ICU4C, they're maintained in unicodetools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicit note: this isn't a blocking concern
[[tables]] | ||
name = "Extended_Pictographic_Extend" | ||
left = "Extended_Pictographic" | ||
right = "InCBExtend" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aethanyc PTAL |
This is r=me for all changes aside from the toml changes. The toml changes look right to me but I would like someone else to take a look. If we can't find anyone with time, i'm fine to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is it worth capturing the text in #4365 (comment) as a testcase, and assert the char count is 151 (same as ICU4C)?
@@ -0,0 +1,8 @@ | |||
÷ 0915 × 094D ÷ 1100 ÷ # ÷ DEVANAGARI LETTER KA (ConjunctLinkingScripts_LinkingConsonant) × DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinkingScripts_ConjunctLinker_ExtCccZwj) ÷ HANGUL CHOSEONG KIYEOK (L) ÷ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Other extra test files all have a header and issue number, so maybe add these lines.
# Additional grapheme breaking tests, not in GraphemeBreakTest.txt
#
# https://github.com/unicode-org/icu4x/issues/4365
@eggrobin Are you looking into running your monkey tests? Would you like to do that before merging the PR? |
Yes. They fail.
Yes, because at first glance this seems to be breaking things that can occur in real text. |
Making progress on this one; I will push some commits to this branch once I am done. Unfortunately I found a bug in the Unicode Standard (the derivation of InCB is wrong, those who can view the PAG issue tracker will see the reference above; ICU does not use the wrong derivation and so luckily escapes the bug). I will update the derivation hardcoded in this PR according to my proposed resolution. |
so we have what ... three issues about fixing InCB now? oy vey |
I would consider the other ones to be about improving things. This is just flat-out wrong, I failed to make it equivalent to the ICU implementation, and I made it inconsistent with normalization. D’oh. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked a hundred thousand monkeys for their opinion, and they thought it looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+ to robin's edits
We've updated sentence segmenter to 15.1 in unicode-org#4625 and grapheme cluster segmenter in unicode-org#4536.
Fixes #4365