Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update grapheme cluster break rules to Unicode 15.1 #4536

Merged
merged 10 commits into from
Feb 8, 2024
Merged
52 changes: 47 additions & 5 deletions components/segmenter/tests/spec_test.rs
Original file line number Diff line number Diff line change
Expand Up @@ -206,15 +206,47 @@ fn run_word_break_test() {
}
}

#[test]
fn run_grapheme_break_test() {
let test_iter = TestContentIterator::new(include_str!("testdata/GraphemeBreakTest.txt"));
fn grapheme_break_test(file: &'static str) {
let test_iter = TestContentIterator::new(file);
let segmenter = GraphemeClusterSegmenter::new();
for test in test_iter {
for (i, test) in test_iter.enumerate() {
let s: String = test.utf8_vec.into_iter().collect();
let iter = segmenter.segment_str(&s);
let result: Vec<usize> = iter.collect();
assert_eq!(result, test.break_result_utf8, "{}", test.original_line);
if result != test.break_result_utf8 {
let gcb = icu::properties::maps::grapheme_cluster_break();
let gcb_name = icu::properties::GraphemeClusterBreak::enum_to_long_name_mapper();
let mut iter = segmenter.segment_str(&s);
// TODO(egg): It would be really nice to have Name here.
println!(" | A | E | Code pt. | GCB | State | Literal");
for (i, c) in s.char_indices() {
let expected_break = test.break_result_utf8.contains(&i);
let actual_break = result.contains(&i);
if actual_break {
iter.next();
}
println!(
"{}| {} | {} | {:>8} | {:>14} | {} | {}",
if actual_break != expected_break {
"😭"
} else {
" "
},
if actual_break { "÷" } else { "×" },
if expected_break { "÷" } else { "×" },
format!("{:04X}", c as u32),
gcb_name
.get(gcb.get(c))
.unwrap_or(&format!("{:?}", gcb.get(c))),
// Placeholder for logging the state if exposed.
// Not "?????" to hide from clippy.
"?".repeat(5),
c
)
}
println!("Test case #{}", i);
panic!()
}

let iter = segmenter.segment_utf16(&test.utf16_vec);
let result: Vec<usize> = iter.collect();
Expand All @@ -237,6 +269,16 @@ fn run_grapheme_break_test() {
}
}

#[test]
fn run_grapheme_break_test() {
grapheme_break_test(include_str!("testdata/GraphemeBreakTest.txt"));
}

#[test]
fn run_grapheme_break_extra_test() {
grapheme_break_test(include_str!("testdata/GraphemeBreakExtraTest.txt"));
}

fn sentence_break_test(file: &'static str) {
let test_iter = TestContentIterator::new(file);
let segmenter = SentenceSegmenter::new();
Expand Down
115 changes: 115 additions & 0 deletions components/segmenter/tests/testdata/GraphemeBreakExtraTest.txt

Large diffs are not rendered by default.

727 changes: 656 additions & 71 deletions components/segmenter/tests/testdata/GraphemeBreakTest.txt

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

174 changes: 167 additions & 7 deletions provider/datagen/data/segmenter/grapheme.toml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions provider/datagen/data/segmenter/uprops/small/ExtPict.toml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions provider/datagen/data/segmenter/uprops/small/GCB.toml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading