-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core): normalization per spec for transforms/etc 🙀 #9468
Comments
|
NFD marker tricks:
|
upstream CLDR normalization ticket was merged, but basically, we don't need the ticket, we need the behavior. So this is shovel ready. |
So, I'm kind of thinking at this moment about not trying to normalize in kmc at all. The reason is, because the core side will already need to be able to normalize not just all strings in the compiled data, but also the context. Secondly, it gets us out of having to even consider what version of node (or browser!) kmc is running under. This could even lead to a class of non-determinism in the compiler, where two runs of kmc give different kmx depending on the node version. By a 'leave it alone' approach, we just write into kmx exactly whatever is in the xml. |
Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules? But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction. |
previously unencoded, yes.
The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently) |
|
And even worse, |
Ref https://unicode.org/reports/tr18/#Canonical_Equivalents Note the magical step 2: "Having the user design the regular expression pattern to match against that defined normalization form." |
if |
may need to parse and process such a range.
characters can be checked for perhaps… sequences may be more challenging. |
Precisely. If I have a transform from |
Could be a reason for limiting the size of ranges…if need be We may end up from all of this needing to say, the regexes must be written in NFD and see the TR… |
<transform from="[a][\u{0300}][\u{0320}]" />
<transform from="a\u{0300}\u{0320}" />
<transform from="à̠" />
<transform from="a\u{0320}\u{0300}" /> etc |
Let's discuss at our meeting tomorrow |
- some failing marker tests For: #9468
- add a new remove_markers(std::u32string) function - add test cases for text utils - update (failing) test cases for transform - improve documentation of append process - support KM_CORE_BT_UNKNOWN in ldml test - remove_markers with a map - update normalize test For: #9468
- km::kbp is soooo last month! - test_transforms can run NFD with markers, with some caveats. For: #9468
- refactor out backspace processing into a function. - for now, just drop any markers in the context when we're lopping off the end For: #9468
- fix a cast - fix some test cases For: #9468
- go back to NFD for the context, for now - anticipating when the privatecontext is NFD but the public context is NFC - also update the test cases For: #9468
- a little further - couple places where "it wasn't plugged in" - adding some LDML-TODOs - marker creep - fixed one unnecessary alloc/dealloc For: #9468
- literally a bad assert. the error case is handled below, in fact the unit test tests for it. - for some reason, assert.h wasn't included in some cases locally. For: #9468
- don't skip markers when calling context_to_string()! Oops. - update docs on ldml_processor::remove_text() - update remove_text() to handle markers in the context string. This is really: #9468
- the intermediate stages of transforms also need to use marker-safe normalization - re-enable a test that was failing previously due to this For: #9468
CLDR-16943 details (or will detail) SC consensus about the role of Unicode normalization. Implement it.
Split out to remaining issues under)m:normalization
The text was updated successfully, but these errors were encountered: