feat(core): normalization per spec for transforms/etc 🙀 #9468

srl295 · 2023-08-15T16:40:36Z

CLDR-16943 details (or will detail) SC consensus about the role of Unicode normalization. Implement it.

Split out to remaining issues under)m:normalization

srl295 · 2023-10-03T02:19:02Z

Cached Context needs to stay in NFD
Engine needs to be able to do a normalization-insensitive compare
Normalization should be available to KMN but not switched on now - KeyboardProcessor

mcdurdin · 2023-10-03T02:28:01Z

NFD marker tricks:

U+0061 \m{marker1} U+0300 gives NFC of à. So what happens to the marker in the cached context? Is cached context NFD but app context is NFC? How do we sync?
U+0061 \m{marker1} U+0300 \m{marker2} U+0320 goes to NFD U+0061 U+0320 U+0300 which breaks markers, because we no longer know where they go. Should this be a compiler error? Or is there some clever way we can work around this (e.g. normalize both app context and cached context before comparison for equality?) Do markers glue to previous codepoint, e.g. U+0061 \m{marker1} U+0320 U+0300 \m{marker2}
keyboard rule gives U+00e0 U+0320 \m{marker1}, which we NFD to U+0061 U+0320 \m{marker1} U+0300??
keyboard rule gives U+00e0 \m{marker1} U+0320, which we NFD to U+0061 U+0320 U+0300 \m{marker1}??

srl295 · 2023-10-04T21:15:58Z

upstream CLDR normalization ticket was merged, but basically, we don't need the ticket, we need the behavior. So this is shovel ready.

srl295 · 2023-10-07T16:58:56Z

So, I'm kind of thinking at this moment about not trying to normalize in kmc at all. The reason is, because the core side will already need to be able to normalize not just all strings in the compiled data, but also the context. Secondly, it gets us out of having to even consider what version of node (or browser!) kmc is running under. This could even lead to a class of non-determinism in the compiler, where two runs of kmc give different kmx depending on the node version. By a 'leave it alone' approach, we just write into kmx exactly whatever is in the xml.

mcdurdin · 2023-10-07T22:43:21Z

Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?

But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.

srl295 · 2023-10-08T00:10:45Z

Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?

previously unencoded, yes.

But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

mcdurdin · 2023-10-08T00:12:06Z

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

\uxxxx says otherwise.

mcdurdin · 2023-10-08T00:12:42Z

And even worse, [a-z]?

mcdurdin · 2023-10-08T00:15:48Z

Ref https://unicode.org/reports/tr18/#Canonical_Equivalents

Note the magical step 2: "Having the user design the regular expression pattern to match against that defined normalization form."

srl295 · 2023-10-08T00:17:08Z

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

\uxxxx says otherwise.

\u{xxxx} is already processed by KMC.
So \u{00E9} will already be E9 00 in UTF-16 in .kmx
when core pulls it in, it can be normalized

if \u{xxxx} was processed later by the regex engine, then yes. but that should only be necessary for syntactical elements.

srl295 · 2023-10-08T00:20:31Z

[a-z]

may need to parse and process such a range.

For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.

characters can be checked for perhaps… sequences may be more challenging.

mcdurdin · 2023-10-08T00:21:56Z

Precisely. If I have a transform from [\u{00E8}-\u{00EB}] (èéêë) to [\u{00EC}-\u{00EF}] (ìíîï), how will that work with decomposition? Do we need to expand ranges (beware 0020-10FFFF)?

srl295 · 2023-10-08T01:16:21Z

Precisely. If I have a transform from [\u{00E8}-\u{00EB}] (èéêë) to [\u{00EC}-\u{00EF}] (ìíîï), how will that work with decomposition? Do we need to expand ranges (beware 0020-10FFFF)?

Could be a reason for limiting the size of ranges…if need be

We may end up from all of this needing to say, the regexes must be written in NFD and see the TR…

srl295 · 2023-10-08T03:20:10Z

maybe the spec should say that the transforms are actually in NFD space (may make the most sense). That is, the pattern becomes NFD.
In other words, this would never match, because the input text would always be normalized to a\u{0320}\u{300}

<transform from="[a][\u{0300}][\u{0320}]" />

however, all of these could match, because the string would be normalized

<transform from="a\u{0300}\u{0320}" />
<transform from="à̠" />
<transform from="a\u{0320}\u{0300}" />

etc

mcdurdin · 2023-10-09T04:52:13Z

Let's discuss at our meeting tomorrow

- some failing marker tests For: #9468

For: #9468

- add a new remove_markers(std::u32string) function - add test cases for text utils - update (failing) test cases for transform - improve documentation of append process - support KM_CORE_BT_UNKNOWN in ldml test - remove_markers with a map - update normalize test For: #9468

- km::kbp is soooo last month! - test_transforms can run NFD with markers, with some caveats. For: #9468

- refactor out backspace processing into a function. - for now, just drop any markers in the context when we're lopping off the end For: #9468

- fix a cast - fix some test cases For: #9468

- go back to NFD for the context, for now - anticipating when the privatecontext is NFD but the public context is NFC - also update the test cases For: #9468

- a little further - couple places where "it wasn't plugged in" - adding some LDML-TODOs - marker creep - fixed one unnecessary alloc/dealloc For: #9468

- temporarily remove a backspace test until #9450 is fixed. For: #9468

- test fix For: #9468

- literally a bad assert. the error case is handled below, in fact the unit test tests for it. - for some reason, assert.h wasn't included in some cases locally. For: #9468

- KM_CORE_BT_UNKNOWN should only be used with an empty context For: #9468 but related to #9450

- also affects markers, feat(core): normalization per spec for transforms/etc 🙀 #9468 - keep markers in nfd context string - fix ldml test harness to handle context reset - update test case - still issues with overproduction of markers in the context For: #9451

- don't skip markers when calling context_to_string()! Oops. - update docs on ldml_processor::remove_text() - update remove_text() to handle markers in the context string. This is really: #9468

For #9468

- one disabled for now For #9468

- the intermediate stages of transforms also need to use marker-safe normalization - re-enable a test that was failing previously due to this For: #9468

srl295 added the epic-ldml label Aug 15, 2023

srl295 added this to the A17S20 milestone Aug 15, 2023

srl295 self-assigned this Aug 15, 2023

srl295 mentioned this issue Aug 15, 2023

feat(core): for KMW/wasm, devolve regex/normalization back to JS 🙀 #9467

Closed

mcdurdin added the core/ Keyman Core label Aug 16, 2023

mcdurdin modified the milestones: A17S20, A17S21 Sep 1, 2023

srl295 mentioned this issue Sep 7, 2023

feat: support normalization in keyboards #5809

Open

mcdurdin modified the milestones: A17S21, A17S22 Sep 15, 2023

mcdurdin mentioned this issue Sep 20, 2023

feat(developer): Test normalization of lexical models at build #2880

Closed

mcdurdin added the m:normalization label Sep 21, 2023

keymanapp-test-bot bot added the feat label Sep 21, 2023

mcdurdin modified the milestones: A17S22, A17S23 Sep 30, 2023

srl295 added a commit that referenced this issue Nov 14, 2023

feat(core): ldml normalization 🙀

efc8ef0

- some failing marker tests For: #9468

srl295 added a commit that referenced this issue Nov 14, 2023

feat(core): ldml dx: dump vkey and modifier 🙀

255960e

For: #9468

srl295 added a commit that referenced this issue Nov 14, 2023

feat(core): ldml marker normalization 🙀

a69feb5

- km::kbp is soooo last month! - test_transforms can run NFD with markers, with some caveats. For: #9468

srl295 added a commit that referenced this issue Nov 14, 2023

feat(core): ldml marker normalization 🙀

78c2ac4

- refactor out backspace processing into a function. - for now, just drop any markers in the context when we're lopping off the end For: #9468

srl295 added a commit that referenced this issue Nov 14, 2023

feat(core): test fix 🙀

7b48194

- fix a cast - fix some test cases For: #9468

srl295 added a commit that referenced this issue Nov 14, 2023

feat(core): marker normalization 🙀

2f843d9

- go back to NFD for the context, for now - anticipating when the privatecontext is NFD but the public context is NFC - also update the test cases For: #9468

srl295 added a commit that referenced this issue Nov 15, 2023

feat(core): marker normalization 🙀

07a4821

- a little further - couple places where "it wasn't plugged in" - adding some LDML-TODOs - marker creep - fixed one unnecessary alloc/dealloc For: #9468

srl295 added a commit that referenced this issue Nov 15, 2023

feat(core): marker normalization 🙀

5a04a81

- temporarily remove a backspace test until #9450 is fixed. For: #9468

srl295 added a commit that referenced this issue Nov 15, 2023

feat(core): marker normalization 🙀

e28f359

- test fix For: #9468

srl295 added a commit that referenced this issue Nov 16, 2023

feat(core): marker normalization 🙀

4bbafa6

- literally a bad assert. the error case is handled below, in fact the unit test tests for it. - for some reason, assert.h wasn't included in some cases locally. For: #9468

srl295 added a commit that referenced this issue Nov 20, 2023

feat(developer): ldml fix testcase processing 🙀

f524519

- KM_CORE_BT_UNKNOWN should only be used with an empty context For: #9468 but related to #9450

darcywong00 modified the milestones: A17S26, A17S27 Nov 27, 2023

srl295 mentioned this issue Dec 7, 2023

feat(core): ldml improve key-not-found 🙀 #10090

Merged

mcdurdin modified the milestones: A17S27, A17S28 Dec 8, 2023

srl295 added a commit that referenced this issue Dec 21, 2023

chore(core): uncomment some more tests 🙀

0b3e7a9

For #9468

srl295 mentioned this issue Dec 21, 2023

fix(core): ldml fixes for normalization between transform groups 🙀 #10290

Merged

srl295 added a commit that referenced this issue Dec 21, 2023

chore(core): add some more marker tests 🙀

12aa9a5

For #9468

srl295 added a commit that referenced this issue Dec 22, 2023

chore(core): add some more marker tests 🙀

9d1599d

- one disabled for now For #9468

srl295 modified the milestones: A17S28, A17S29 Dec 22, 2023

This was referenced Jan 4, 2024

feat(developer): regex: support ranges 🙀 #10316

Closed

feat(developer): normalization on the developer side 🙀 #10317

Closed

feat(core): Normalization needs to support multiple markers 🙀 #10320

Closed

srl295 closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): normalization per spec for transforms/etc 🙀 #9468

feat(core): normalization per spec for transforms/etc 🙀 #9468

srl295 commented Aug 15, 2023 •

edited

Loading

srl295 commented Oct 3, 2023

mcdurdin commented Oct 3, 2023 •

edited

Loading

srl295 commented Oct 4, 2023

srl295 commented Oct 7, 2023

mcdurdin commented Oct 7, 2023

srl295 commented Oct 8, 2023

mcdurdin commented Oct 8, 2023

mcdurdin commented Oct 8, 2023

mcdurdin commented Oct 8, 2023 •

edited

Loading

srl295 commented Oct 8, 2023

srl295 commented Oct 8, 2023

mcdurdin commented Oct 8, 2023

srl295 commented Oct 8, 2023

srl295 commented Oct 8, 2023 •

edited

Loading

mcdurdin commented Oct 9, 2023

feat(core): normalization per spec for transforms/etc 🙀 #9468

feat(core): normalization per spec for transforms/etc 🙀 #9468

Comments

srl295 commented Aug 15, 2023 • edited Loading

srl295 commented Oct 3, 2023

mcdurdin commented Oct 3, 2023 • edited Loading

srl295 commented Oct 4, 2023

srl295 commented Oct 7, 2023

mcdurdin commented Oct 7, 2023

srl295 commented Oct 8, 2023

mcdurdin commented Oct 8, 2023

mcdurdin commented Oct 8, 2023

mcdurdin commented Oct 8, 2023 • edited Loading

srl295 commented Oct 8, 2023

srl295 commented Oct 8, 2023

mcdurdin commented Oct 8, 2023

srl295 commented Oct 8, 2023

srl295 commented Oct 8, 2023 • edited Loading

mcdurdin commented Oct 9, 2023

srl295 commented Aug 15, 2023 •

edited

Loading

mcdurdin commented Oct 3, 2023 •

edited

Loading

mcdurdin commented Oct 8, 2023 •

edited

Loading

srl295 commented Oct 8, 2023 •

edited

Loading