-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Unicode 15.0 line breaking #4389
Conversation
The tests are failing because LineBreakTest.txt tests for LB25 as tailored by Example 7, not for the untailored version. |
# https://github.com/eggrobin/icu/tree/export-monkeys-15.0-untailor-lb | ||
# (specifically, at ea318304775e0a5194785f51120672aafac7b2bd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please get this change checked in. I'm not able to review the raw rule changes, so the test passing is all I can go by to confirm that they are correct. We also want to be able to regenerate this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot check this in, because it is in the past (forked from Unicode 15.0), and because removing the tailoring is not something we want to do in ICU.
Once we get to parity with ICU, this will be generated from the real ICU at its release tag; but I do not want to go from where we are to 15.1 with all ICU tailorings in one giant step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course the change that makes it possible to export the monkey tests will be checked in, as part of unicode-org/icu#2637 (whose reviewers I really need to poke); but that will be on top of 15.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tailorings could be behind a flag, and it could be on the 73 maintenance branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly that is not really feasible, as discussed.
# LineBreakTest-15.0.0.txt | ||
# Date: 2022-02-26, 00:38:39 GMT | ||
# © 2022 Unicode®, Inc. | ||
# THIS IS NOT LineBreakTest-15.0.0.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these manual changes to the test file? Please document its provenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented (though you won’t like it :-p).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏳️🌈🏳️🌈🏳️🌈
…g and unused (#4400) Something should still be done about #1637 eventually, but since the higher states have been renumbered by #4389, let’s not leave them around for someone to trip over. Since we have not yet migrated those properties to 15.1, the states that correspond to LB property values have not changed.
this is changelog worthy |
The current implementation was attempting the LB25 tailoring recommended in Example 7 of Section 8.2 in UAX14 version 15.0; however, this requires more than one code point of lookahead* because of
(PR | PO) × ( OP | HY )? NU
, which the current implementation of the line segmenter cannot do. Instead this pull request goes back to the untailored LB25 from Unicode 15.0.The implementation was tested with two million test cases; I last encountered a failure somewhere in the nine thousands. I should probably do an overnight run. Only 200 test cases are included here; as usual, anyone working on the rules should try very long monkey test runs.
This fixes #4146.
—
* This will be needed for 15.1 line segmentation too. While we have that capability in the other segmenters, used in the sentence segmenter (the relevant rules are called intermediate match rules or interm(ediate) break states in this implementation), straightforwardly reusing that code would run into into issues as we have so many states in line breaking that we cannot dedicate a whole bit to that property of the state. This can probably be worked around (as far as I can tell we use the sign bit for a property of two special states, so we could probably be a bit more sparing), but will come later.