`icu_segmenter::LineSegmenter` incorrectly applies rule LB8a #4146

tingerrr · 2023-10-12T15:45:21Z

The following code using icu_segmenter = "1.3.2" returns 3 breaks:

LineBreak::Unknown at 0
LineBreak::ZWJ at 10
LineBreak::Ideographic at 14

use icu_segmenter::LineSegmenter;

fn main() {
    assert_eq!(
        vec![0, 10, 14],
        LineSegmenter::new_auto()
            .segment_str("🏳️‍🌈")
            .collect::<Vec<_>>()
    );
}

According to the documentation, the segmenter should only return LB3 and LB7, not LB8a (LineBreak::ZWJ).

The text was updated successfully, but these errors were encountered:

sffc · 2023-10-12T16:07:27Z

@makotokato @aethanyc @eggrobin

eggrobin · 2023-10-12T16:54:06Z

I see what’s going on, the way the state machine works effectively means that we apply LB9 before LB8a.

The current implementation was attempting the LB25 tailoring recommended in Example 7 of [Section 8.2](https://www.unicode.org/reports/tr14/tr14-49.html#Examples) in UAX14 version 15.0; however, this requires more than one code point of lookahead* because of `(PR | PO) × ( OP | HY )? NU`, which the current implementation of the line segmenter cannot do. Instead this pull request goes back to the untailored LB25 from Unicode 15.0. The implementation was tested with two million test cases; I last encountered a failure somewhere in the nine thousands. I should probably do an overnight run. Only 200 test cases are included here; as usual, anyone working on the rules should try very long monkey test runs. This fixes #4146. — \* This will be needed for 15.1 line segmentation too. While we have that capability in the other segmenters, used in the sentence segmenter (the relevant rules are called intermediate match rules or interm(ediate) break states in this implementation), straightforwardly reusing that code would run into into issues as we have so many states in line breaking that we cannot dedicate a whole bit to that property of the state. This can probably be worked around (as far as I can tell we use the sign bit for a property of two special states, so we could probably be a bit more sparing), but will come later.

Emoji clusters, such ('🏳️‍🌈') do not render properly in Servo. This is because xi-unicode is inserting a linebreak opportunity between components of the cluster (see xi-editor/xi-editor#1322). This change adds a workaround for this issue. `xi-unicode` is fast, but supports an older version of the Unicode standard than libraries like `icu4x`. In addition, `icu4x` does not supoprt non-contiguous segmentation which Servo currently depends on. Finally, the currently released version of `icu4x` has the same issue (unicode-org/icu4x#4146).

This was referenced Oct 12, 2023

Line breaks are allowed in composite emojis typst/typst#80

Closed

Filter out prohibited line breaks in par typst/typst#2376

Merged

sffc added the C-segmentation Component: Segmentation label Oct 12, 2023

sffc added this to the 1.4 Blocking ⟨P1⟩ milestone Oct 19, 2023

sffc assigned eggrobin Oct 19, 2023

eggrobin added the T-bug Type: Bad behavior, security, privacy label Oct 19, 2023

sffc modified the milestones: 1.4 Blocking ⟨P1⟩, 1.5 Blocking ⟨P1⟩ Nov 14, 2023

eggrobin mentioned this issue Nov 30, 2023

Fix Unicode 15.0 line breaking #4389

Merged

eggrobin changed the title ~~icu_segmenter::LineSegmenter returns breaks other than LB3 and LB7~~ icu_segmenter::LineSegmenter incorrectly applies rule LB8a Nov 30, 2023

eggrobin added a commit to eggrobin/icu4x that referenced this issue Dec 1, 2023

doc test for unicode-org#4146

8a1cb70

eggrobin closed this as completed in #4389 Dec 1, 2023

mrobinson mentioned this issue May 23, 2024

fonts: Fix rendering of emoji clusters servo/servo#32350

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`icu_segmenter::LineSegmenter` incorrectly applies rule LB8a #4146

`icu_segmenter::LineSegmenter` incorrectly applies rule LB8a #4146

tingerrr commented Oct 12, 2023 •

edited

Loading

sffc commented Oct 12, 2023

eggrobin commented Oct 12, 2023

icu_segmenter::LineSegmenter incorrectly applies rule LB8a #4146

icu_segmenter::LineSegmenter incorrectly applies rule LB8a #4146

Comments

tingerrr commented Oct 12, 2023 • edited Loading

sffc commented Oct 12, 2023

eggrobin commented Oct 12, 2023

`icu_segmenter::LineSegmenter` incorrectly applies rule LB8a #4146

`icu_segmenter::LineSegmenter` incorrectly applies rule LB8a #4146

tingerrr commented Oct 12, 2023 •

edited

Loading