Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition #4365

nikitadol · 2023-11-27T17:26:44Z

If I use icu4c I get a different result
https://icu4c-demos.unicode.org/icu-bin/icusegments#0/en

fn main() {
    let text = "as `নমস্কাৰ, আপোনাৰ কি খবৰ?`
hi `हैलो, क्या हाल हैं?`
mai `नमस्ते अहाँ केना छथि?`
mr `नमस्कार, कसे आहात?`
ne `नमस्ते, कस्तो हुनुहुन्छ?`
or `ନମସ୍କାର ତୁମେ କେମିତି ଅଛ?`
sa `हे त्वं किदं असि?`
te `హాయ్, ఎలా ఉన్నారు?`";
    let segmenter = icu_segmenter::GraphemeClusterSegmenter::new();
    use itertools::Itertools;
    let grapheme_clusters: Vec<&str> = segmenter
        .segment_str(text)
        .tuple_windows()
        .map(|(i, j)| &text[i..j])
        .collect();

    println!("{:?}", grapheme_clusters);
    println!("{:?}", grapheme_clusters.len());
}

icu4c: 151
rust: 161

sffc · 2023-11-27T20:44:36Z

This is grapheme clusters. Are the options set correctly and consistently? @aethanyc @makotokato @eggrobin

Note: for line break and word break, ICU4C and ICU4X use different models for complex script segmentation. In many cases ICU4X is more correct than ICU4C. If you need identical behavior between C and X, use the dictionary constructors like try_new_dictionary in ICU4X.

eggrobin · 2023-11-27T21:10:29Z

This is grapheme clusters. Are the options set correctly and consistently?

Options are irrelevant here, this is expected. ICU4X and ICU4C are just using different definitions of EGCs; ICU4C has had a tailoring for years which has just been incorporated into Unicode 15.1, whereas ICU4X implements the 15.0 version without that tailoring.

The difference is the handling of aksaras in some indic scripts: in Unicode 15.1 (and in any recent ICU4C) क्या is one EGC, but it is two EGCs (क्, या) in untailored Unicode 15.0 (and in ICU4X).

sffc · 2023-11-27T21:48:28Z

Yep I figured it had to do with extended grapheme clusters. Is there / should there be an option to switch between extended and standard/legacy grapheme clusters?

eggrobin · 2023-11-27T22:32:04Z

Yep I figured it had to do with extended grapheme clusters. Is there / should there be an option to switch between extended and standard/legacy grapheme clusters?

No legacy grapheme clusters were harmed in the making of this issue, this is EGCs on both sides. EGCs changed in 15.1 (and ICU4C was using the now-15.1 definition for years already). This is just the UAX29 part of #3255.

eggrobin · 2023-11-27T22:35:22Z

(For what it’s worth, क्या would be three legacy grapheme clusters, namely क्, य, and ा, see Table 1a of UAX29, whereas it is two 15.0 extended grapheme clusters and a single 15.1 extended grapheme cluster.)

Fixes #4365 --------- Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

aethanyc added the C-segmentation Component: Segmentation label Dec 7, 2023

sffc assigned eggrobin Dec 7, 2023

sffc added this to the 1.5 Blocking ⟨P1⟩ milestone Dec 7, 2023

hsivonen changed the title ~~Segmenter does not work correctly in some languages~~ Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition Jan 11, 2024

makotokato mentioned this issue Jan 23, 2024

Update grapheme cluster break rules to Unicode 15.1 #4536

Merged

sffc closed this as completed in #4536 Feb 8, 2024

sffc pushed a commit that referenced this issue Feb 8, 2024

Update grapheme cluster break rules to Unicode 15.1 (#4536)

51b3719

Fixes #4365 --------- Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition #4365

Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition #4365

nikitadol commented Nov 27, 2023

sffc commented Nov 27, 2023

eggrobin commented Nov 27, 2023

sffc commented Nov 27, 2023

eggrobin commented Nov 27, 2023

eggrobin commented Nov 27, 2023 •

edited

Loading

Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition #4365

Segmenter does not implement the new Unicode 15.1 extended grapheme cluster definition #4365

Comments

nikitadol commented Nov 27, 2023

sffc commented Nov 27, 2023

eggrobin commented Nov 27, 2023

sffc commented Nov 27, 2023

eggrobin commented Nov 27, 2023

eggrobin commented Nov 27, 2023 • edited Loading

eggrobin commented Nov 27, 2023 •

edited

Loading