-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
icu_segmenter::LineSegmenter
incorrectly applies rule LB8a
#4146
Labels
Milestone
Comments
This was referenced Oct 12, 2023
I see what’s going on, the way the state machine works effectively means that we apply LB9 before LB8a. |
eggrobin
changed the title
Nov 30, 2023
icu_segmenter::LineSegmenter
returns breaks other than LB3 and LB7icu_segmenter::LineSegmenter
incorrectly applies rule LB8a
eggrobin
added a commit
to eggrobin/icu4x
that referenced
this issue
Dec 1, 2023
eggrobin
added a commit
that referenced
this issue
Dec 1, 2023
The current implementation was attempting the LB25 tailoring recommended in Example 7 of [Section 8.2](https://www.unicode.org/reports/tr14/tr14-49.html#Examples) in UAX14 version 15.0; however, this requires more than one code point of lookahead* because of `(PR | PO) × ( OP | HY )? NU`, which the current implementation of the line segmenter cannot do. Instead this pull request goes back to the untailored LB25 from Unicode 15.0. The implementation was tested with two million test cases; I last encountered a failure somewhere in the nine thousands. I should probably do an overnight run. Only 200 test cases are included here; as usual, anyone working on the rules should try very long monkey test runs. This fixes #4146. — \* This will be needed for 15.1 line segmentation too. While we have that capability in the other segmenters, used in the sentence segmenter (the relevant rules are called intermediate match rules or interm(ediate) break states in this implementation), straightforwardly reusing that code would run into into issues as we have so many states in line breaking that we cannot dedicate a whole bit to that property of the state. This can probably be worked around (as far as I can tell we use the sign bit for a property of two special states, so we could probably be a bit more sparing), but will come later.
mrobinson
added a commit
to mrobinson/servo
that referenced
this issue
May 23, 2024
Emoji clusters, such ('🏳️🌈') do not render properly in Servo. This is because xi-unicode is inserting a linebreak opportunity between components of the cluster (see xi-editor/xi-editor#1322). This change adds a workaround for this issue. `xi-unicode` is fast, but supports an older version of the Unicode standard than libraries like `icu4x`. In addition, `icu4x` does not supoprt non-contiguous segmentation which Servo currently depends on. Finally, the currently released version of `icu4x` has the same issue (unicode-org/icu4x#4146).
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
The following code using
icu_segmenter = "1.3.2"
returns 3 breaks:LineBreak::Unknown
at 0LineBreak::ZWJ
at 10LineBreak::Ideographic
at 14According to the documentation, the segmenter should only return LB3 and LB7, not LB8a (
LineBreak::ZWJ
).The text was updated successfully, but these errors were encountered: