-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linebreak generated before CL #4523
Comments
Related downstream issue: typst/typst#3082 |
(Copied from typst/typst#3082 (comment))
Well, it seems that breakpoints are counted in bytes ( The following version might be better. use icu_segmenter::LineSegmenter;
fn main() {
let examples = vec![
"念姐遠米巴急(abcd),松黃貫誰。",
"念姐遠米巴急(abc0),松黃貫誰。",
"念姐遠米巴急(0000),松黃貫誰。",
"念姐遠米巴急(8888),松黃貫誰。",
];
let segmenter = LineSegmenter::new_auto();
examples.iter().for_each(|line| {
let breakpoints: Vec<usize> = segmenter.segment_str(line).collect();
println!("{}\n{:?}", line, breakpoints);
for i in 1..breakpoints.len() {
print!(
"|{}",
line.get(breakpoints[i - 1]..breakpoints[i])
.expect("Breakpoints should be at characters' boundaries")
);
}
println!("|");
});
}
|
As noted in #4523 (comment), "念姐遠米巴急(abcd),松黃貫誰。" is segmented just fine, the snippet in the OP is just confused between code point and UTF-8 code unit indices. Indeed it gets broken fine in the screenshot in the downstream issue. But @YDX-2147483647 does show examples of bad segmentation, such as
That release is dated Nov 16, 2023. #4389 was merged on Dec 1, 2023, so we know that line breaking is broken in 1.4.0. At main the example from #4523 (comment) prints
so this has been fixed by #4389. (I suspect what we are seeing, namely a break between |
thank you! so i think this issue can be closed once a new version is released? (or it can be closed because it is already fixed in master) |
I'll close this as fixed in 1.5. Thank you @eggrobin! If you need the functionality sooner, you can use ICU4X from Git in your Cargo.toml. |
This code with
icu=1.4.0
.produces
, where a breakpoint is produced before
,
.,
is the full width comma, U+FF0C. It belongs to CL: Close Punctuation. Per LB13× CL
, we shouldn't produce that breakpoint.Update: It seems that this bug happens on some string, but not all of them.
念姐遠米巴急(abcd),松黃貫誰。
is a ramdomly generated one.The text was updated successfully, but these errors were encountered: