-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition #4365
Comments
This is grapheme clusters. Are the options set correctly and consistently? @aethanyc @makotokato @eggrobin Note: for line break and word break, ICU4C and ICU4X use different models for complex script segmentation. In many cases ICU4X is more correct than ICU4C. If you need identical behavior between C and X, use the dictionary constructors like |
Options are irrelevant here, this is expected. ICU4X and ICU4C are just using different definitions of EGCs; ICU4C has had a tailoring for years which has just been incorporated into Unicode 15.1, whereas ICU4X implements the 15.0 version without that tailoring. The difference is the handling of aksaras in some indic scripts: in Unicode 15.1 (and in any recent ICU4C) क्या is one EGC, but it is two EGCs (क्, या) in untailored Unicode 15.0 (and in ICU4X). |
Yep I figured it had to do with extended grapheme clusters. Is there / should there be an option to switch between extended and standard/legacy grapheme clusters? |
No legacy grapheme clusters were harmed in the making of this issue, this is EGCs on both sides. EGCs changed in 15.1 (and ICU4C was using the now-15.1 definition for years already). This is just the UAX29 part of #3255. |
(For what it’s worth, क्या would be three legacy grapheme clusters, namely क्, य, and ा, see Table 1a of UAX29, whereas it is two 15.0 extended grapheme clusters and a single 15.1 extended grapheme cluster.) |
Fixes #4365 --------- Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>
If I use icu4c I get a different result
https://icu4c-demos.unicode.org/icu-bin/icusegments#0/en
icu4c: 151
rust: 161
The text was updated successfully, but these errors were encountered: