Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected grapheme boundary with regional indicators (GB12) #4780

Closed
anba opened this issue Apr 6, 2024 · 1 comment · Fixed by #4781
Closed

Unexpected grapheme boundary with regional indicators (GB12) #4780

anba opened this issue Apr 6, 2024 · 1 comment · Fixed by #4781
Labels
C-segmentation Component: Segmentation T-bug Type: Bad behavior, security, privacy

Comments

@anba
Copy link

anba commented Apr 6, 2024

use icu::segmenter::GraphemeClusterSegmenter;

fn main() {
    let segmenter = GraphemeClusterSegmenter::new();
    let text = "🇺🇸🏴󠁧󠁢󠁥󠁮󠁧󠁿";

    segmenter
        .segment_str(text)
        .for_each(|i| println!("{}", i));
}

Reports the following break points:

0
4
8
36

which means "🇺🇸" is split into two graphemes, which should be disallowed per GB12.

CC: @aethanyc and @makotokato


Original bug report: https://bugzilla.mozilla.org/show_bug.cgi?id=1890007

@aethanyc aethanyc added T-bug Type: Bad behavior, security, privacy C-segmentation Component: Segmentation labels Apr 6, 2024
@aethanyc
Copy link
Contributor

aethanyc commented Apr 6, 2024

This is fixed by #4536. Let me add the example as a test.

aethanyc added a commit to aethanyc/icu4x that referenced this issue Apr 6, 2024
aethanyc added a commit to aethanyc/icu4x that referenced this issue Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation T-bug Type: Bad behavior, security, privacy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants