Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition #4365

Closed
nikitadol opened this issue Nov 27, 2023 · 5 comments · Fixed by #4536
Closed
Assignees
Labels
C-segmentation Component: Segmentation

Comments

@nikitadol
Copy link

If I use icu4c I get a different result
https://icu4c-demos.unicode.org/icu-bin/icusegments#0/en

fn main() {
    let text = "as `নমস্কাৰ, আপোনাৰ কি খবৰ?`
hi `हैलो, क्या हाल हैं?`
mai `नमस्ते अहाँ केना छथि?`
mr `नमस्कार, कसे आहात?`
ne `नमस्ते, कस्तो हुनुहुन्छ?`
or `ନମସ୍କାର ତୁମେ କେମିତି ଅଛ?`
sa `हे त्वं किदं असि?`
te `హాయ్, ఎలా ఉన్నారు?`";
    let segmenter = icu_segmenter::GraphemeClusterSegmenter::new();
    use itertools::Itertools;
    let grapheme_clusters: Vec<&str> = segmenter
        .segment_str(text)
        .tuple_windows()
        .map(|(i, j)| &text[i..j])
        .collect();

    println!("{:?}", grapheme_clusters);
    println!("{:?}", grapheme_clusters.len());
}

icu4c: 151
rust: 161

@sffc
Copy link
Member

sffc commented Nov 27, 2023

This is grapheme clusters. Are the options set correctly and consistently? @aethanyc @makotokato @eggrobin

Note: for line break and word break, ICU4C and ICU4X use different models for complex script segmentation. In many cases ICU4X is more correct than ICU4C. If you need identical behavior between C and X, use the dictionary constructors like try_new_dictionary in ICU4X.

@eggrobin
Copy link
Member

This is grapheme clusters. Are the options set correctly and consistently?

Options are irrelevant here, this is expected. ICU4X and ICU4C are just using different definitions of EGCs; ICU4C has had a tailoring for years which has just been incorporated into Unicode 15.1, whereas ICU4X implements the 15.0 version without that tailoring.

The difference is the handling of aksaras in some indic scripts: in Unicode 15.1 (and in any recent ICU4C) क्या is one EGC, but it is two EGCs (क्, या) in untailored Unicode 15.0 (and in ICU4X).

@sffc
Copy link
Member

sffc commented Nov 27, 2023

Yep I figured it had to do with extended grapheme clusters. Is there / should there be an option to switch between extended and standard/legacy grapheme clusters?

@eggrobin
Copy link
Member

Yep I figured it had to do with extended grapheme clusters. Is there / should there be an option to switch between extended and standard/legacy grapheme clusters?

No legacy grapheme clusters were harmed in the making of this issue, this is EGCs on both sides. EGCs changed in 15.1 (and ICU4C was using the now-15.1 definition for years already). This is just the UAX29 part of #3255.

@eggrobin
Copy link
Member

eggrobin commented Nov 27, 2023

(For what it’s worth, क्या would be three legacy grapheme clusters, namely क्, य, and ा, see Table 1a of UAX29, whereas it is two 15.0 extended grapheme clusters and a single 15.1 extended grapheme cluster.)

@aethanyc aethanyc added the C-segmentation Component: Segmentation label Dec 7, 2023
@sffc sffc added this to the 1.5 Blocking ⟨P1⟩ milestone Dec 7, 2023
@hsivonen hsivonen changed the title Segmenter does not work correctly in some languages Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition Jan 11, 2024
@sffc sffc closed this as completed in #4536 Feb 8, 2024
sffc pushed a commit that referenced this issue Feb 8, 2024
Fixes #4365

---------

Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants