Skip to content

Commit

Permalink
Don't break word by MinNumLet with Extend. (#4550)
Browse files Browse the repository at this point in the history
partial progress on #4417
  • Loading branch information
makotokato authored Mar 6, 2024
1 parent db1772d commit bebc6f9
Show file tree
Hide file tree
Showing 7 changed files with 202 additions and 47 deletions.
15 changes: 12 additions & 3 deletions components/segmenter/tests/spec_test.rs
Original file line number Diff line number Diff line change
Expand Up @@ -175,9 +175,8 @@ fn run_line_break_extra_test() {
line_break_test(include_str!("testdata/LineBreakExtraTest.txt"));
}

#[test]
fn run_word_break_test() {
let test_iter = TestContentIterator::new(include_str!("testdata/WordBreakTest.txt"));
fn word_break_test(file: &'static str) {
let test_iter = TestContentIterator::new(file);
let segmenter = WordSegmenter::new_dictionary();
for test in test_iter {
let s: String = test.utf8_vec.into_iter().collect();
Expand Down Expand Up @@ -206,6 +205,16 @@ fn run_word_break_test() {
}
}

#[test]
fn run_word_break_test() {
word_break_test(include_str!("testdata/WordBreakTest.txt"));
}

#[test]
fn run_word_break_extra_test() {
word_break_test(include_str!("testdata/WordBreakExtraTest.txt"));
}

fn grapheme_break_test(file: &'static str) {
let test_iter = TestContentIterator::new(file);
let segmenter = GraphemeClusterSegmenter::new();
Expand Down
30 changes: 30 additions & 0 deletions components/segmenter/tests/testdata/WordBreakExtraTest.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Additional word breaking tests, not in WordBreakTest.txt
#
# https://github.com/unicode-org/icu4x/issues/4417
÷ 0041 × 002E × 00AD × 0042 ÷ # ÷ ALetter × MidNumLet × Format × ALetter ÷
÷ 0041 × 002E × 0308 × 0042 ÷ # ÷ ALetter × MidNumLet × Extend × ALetter ÷
÷ 0041 × 002E × 200D × 0042 ÷ # ÷ ALetter × MidNumLet × ZWJ × ALetter ÷
÷ 0041 × 002E × 00AD × 05D0 ÷ # ÷ ALetter × MidNumLet × Format × Hebrew_Letter ÷
÷ 0041 × 002E × 0308 × 05D0 ÷ # ÷ ALetter × MidNumLet × Extend × Hebrew_Letter ÷
÷ 0041 × 002E × 200D × 05D0 ÷ # ÷ ALetter × MidNumLet × ZWJ × Hebrew_Letter ÷
÷ 0041 × 0027 × 00AD × 0042 ÷ # ÷ ALetter × Single_Quote × Format × ALetter ÷
÷ 0041 × 0027 × 0308 × 0042 ÷ # ÷ ALetter × Single_Quote × Extend × ALetter ÷
÷ 0041 × 0027 × 200D × 0042 ÷ # ÷ ALetter × Single_Quote × ZWJ × ALetter ÷
÷ 0041 × 0027 × 00AD × 05D0 ÷ # ÷ ALetter × Single_Quote × Format × Hebrew_Letter ÷
÷ 0041 × 0027 × 0308 × 05D0 ÷ # ÷ ALetter × Single_Quote × Extend × Hebrew_Letter ÷
÷ 0041 × 0027 × 200D × 05D0 ÷ # ÷ ALetter × Single_Quote × ZWJ × Hebrew_Letter ÷
÷ 05D0 × 002E × 00AD × 05D0 ÷ # ÷ Hebrew_Letter × MidNumLet × Format × Hebrew_Letter ÷
÷ 05D0 × 002E × 0308 × 05D0 ÷ # ÷ Hebrew_Letter × MidNumLet × Extend × Hebrew_Letter ÷
÷ 05D0 × 002E × 200D × 05D0 ÷ # ÷ Hebrew_Letter × MidNumLet × ZWJ × Hebrew_Letter ÷
÷ 05D0 × 002E × 00AD × 0041 ÷ # ÷ Hebrew_Letter × MidNumLet × Format × ALetter ÷
÷ 05D0 × 002E × 0308 × 0041 ÷ # ÷ Hebrew_Letter × MidNumLet × Extend × ALetter ÷
÷ 05D0 × 002E × 200D × 0041 ÷ # ÷ Hebrew_Letter × MidNumLet × ZWJ × ALetter ÷
÷ 05D0 × 0027 × 00AD × 0041 ÷ # ÷ Hebrew_Letter × Single_Quote × Format × ALetter ÷
÷ 05D0 × 0027 × 0308 × 0041 ÷ # ÷ Hebrew_Letter × Single_Quote × Extend × ALetter ÷
÷ 05D0 × 0027 × 200D × 0041 ÷ # ÷ Hebrew_Letter × Single_Quote × ZWJ × ALetter ÷
÷ 05D0 × 0027 × 00AD × 05D0 ÷ # ÷ Hebrew_Letter × Single_Quote × Format × Hebrew_Letter ÷
÷ 05D0 × 0027 × 0308 × 05D0 ÷ # ÷ Hebrew_Letter × Single_Quote × Extend × Hebrew_Letter ÷
÷ 05D0 × 0027 × 200D × 05D0 ÷ # ÷ Hebrew_Letter × Single_Quote × ZWJ × Hebrew_Letter ÷
÷ 05D0 × 0022 × 00AD × 05D0 ÷ # ÷ Hebrew_Letter × Double_Quote × Format × Hebrew_Letter ÷
÷ 05D0 × 0022 × 0308 × 05D0 ÷ # ÷ Hebrew_Letter × Double_Quote × Extend × Hebrew_Letter ÷
÷ 05D0 × 0022 × 200D × 05D0 ÷ # ÷ Hebrew_Letter × Double_Quote × ZWJ × Hebrew_Letter ÷

Large diffs are not rendered by default.

111 changes: 105 additions & 6 deletions provider/datagen/data/segmenter/word.toml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Large diffs are not rendered by default.

Loading

0 comments on commit bebc6f9

Please sign in to comment.