Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected word boundary with WB4 and WB6/WB7 rules #4417

Closed
anba opened this issue Dec 7, 2023 · 3 comments
Closed

Unexpected word boundary with WB4 and WB6/WB7 rules #4417

anba opened this issue Dec 7, 2023 · 3 comments
Assignees
Labels
C-segmentation Component: Segmentation

Comments

@anba
Copy link

anba commented Dec 7, 2023

Word segmentation over the string "a.\xADb" returns unexpected word breaks: The returned word segments are "a", ".\xAD", and "b". But I think WB4 should have prevented the word break, right? Compared to that, "a.b" and "a\xAD.b" are both correctly classified as a single word.

CC: @aethanyc and @makotokato

@sffc
Copy link
Member

sffc commented Dec 7, 2023

@eggrobin

@aethanyc aethanyc added the C-segmentation Component: Segmentation label Dec 7, 2023
@sffc sffc added this to the 1.5 Blocking ⟨P1⟩ milestone Dec 7, 2023
@Manishearth Manishearth moved this to Unclaimed for sprint in icu4x 2.0 Feb 23, 2024
@Manishearth
Copy link
Member

#4550 fixes some of the issues here but not all of them, see #4550 (comment)

Manishearth pushed a commit that referenced this issue Mar 6, 2024
@robertbastian
Copy link
Member

Tracking remaining changes in #5015

@github-project-automation github-project-automation bot moved this from Unclaimed for sprint to Done in icu4x 2.0 Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation
Projects
Status: Done
Development

No branches or pull requests

6 participants