Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): normalization per spec for transforms/etc 🙀 #9468

Closed
srl295 opened this issue Aug 15, 2023 · 22 comments
Closed

feat(core): normalization per spec for transforms/etc 🙀 #9468

srl295 opened this issue Aug 15, 2023 · 22 comments
Assignees
Milestone

Comments

@srl295
Copy link
Member

srl295 commented Aug 15, 2023

CLDR-16943 details (or will detail) SC consensus about the role of Unicode normalization. Implement it.

Split out to remaining issues under)m:normalization

@srl295
Copy link
Member Author

srl295 commented Oct 3, 2023

  • Cached Context needs to stay in NFD
  • Engine needs to be able to do a normalization-insensitive compare
  • Normalization should be available to KMN but not switched on now - KeyboardProcessor

@mcdurdin
Copy link
Member

mcdurdin commented Oct 3, 2023

NFD marker tricks:

  1. U+0061 \m{marker1} U+0300 gives NFC of à. So what happens to the marker in the cached context? Is cached context NFD but app context is NFC? How do we sync?
  2. U+0061 \m{marker1} U+0300 \m{marker2} U+0320 goes to NFD U+0061 U+0320 U+0300 which breaks markers, because we no longer know where they go. Should this be a compiler error? Or is there some clever way we can work around this (e.g. normalize both app context and cached context before comparison for equality?) Do markers glue to previous codepoint, e.g. U+0061 \m{marker1} U+0320 U+0300 \m{marker2}
  3. keyboard rule gives U+00e0 U+0320 \m{marker1}, which we NFD to U+0061 U+0320 \m{marker1} U+0300??
  4. keyboard rule gives U+00e0 \m{marker1} U+0320, which we NFD to U+0061 U+0320 U+0300 \m{marker1}??

@srl295
Copy link
Member Author

srl295 commented Oct 4, 2023

upstream CLDR normalization ticket was merged, but basically, we don't need the ticket, we need the behavior. So this is shovel ready.

@srl295
Copy link
Member Author

srl295 commented Oct 7, 2023

So, I'm kind of thinking at this moment about not trying to normalize in kmc at all. The reason is, because the core side will already need to be able to normalize not just all strings in the compiled data, but also the context. Secondly, it gets us out of having to even consider what version of node (or browser!) kmc is running under. This could even lead to a class of non-determinism in the compiler, where two runs of kmc give different kmx depending on the node version. By a 'leave it alone' approach, we just write into kmx exactly whatever is in the xml.

@mcdurdin
Copy link
Member

mcdurdin commented Oct 7, 2023

Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?

But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.

@srl295
Copy link
Member Author

srl295 commented Oct 8, 2023

Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?

previously unencoded, yes.

But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

@mcdurdin
Copy link
Member

mcdurdin commented Oct 8, 2023

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

\uxxxx says otherwise.

@mcdurdin
Copy link
Member

mcdurdin commented Oct 8, 2023

And even worse, [a-z]?

@mcdurdin
Copy link
Member

mcdurdin commented Oct 8, 2023

Ref https://unicode.org/reports/tr18/#Canonical_Equivalents

Note the magical step 2: "Having the user design the regular expression pattern to match against that defined normalization form."

@srl295
Copy link
Member Author

srl295 commented Oct 8, 2023

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

\uxxxx says otherwise.

  • \u{xxxx} is already processed by KMC.
  • So \u{00E9} will already be E9 00 in UTF-16 in .kmx
  • when core pulls it in, it can be normalized

if \u{xxxx} was processed later by the regex engine, then yes. but that should only be necessary for syntactical elements.

@srl295
Copy link
Member Author

srl295 commented Oct 8, 2023

[a-z]

may need to parse and process such a range.

For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.

characters can be checked for perhaps… sequences may be more challenging.

@mcdurdin
Copy link
Member

mcdurdin commented Oct 8, 2023

Precisely. If I have a transform from [\u{00E8}-\u{00EB}] (èéêë) to [\u{00EC}-\u{00EF}] (ìíîï), how will that work with decomposition? Do we need to expand ranges (beware 0020-10FFFF)?

@srl295
Copy link
Member Author

srl295 commented Oct 8, 2023

Precisely. If I have a transform from [\u{00E8}-\u{00EB}] (èéêë) to [\u{00EC}-\u{00EF}] (ìíîï), how will that work with decomposition? Do we need to expand ranges (beware 0020-10FFFF)?

Could be a reason for limiting the size of ranges…if need be

We may end up from all of this needing to say, the regexes must be written in NFD and see the TR…

@srl295
Copy link
Member Author

srl295 commented Oct 8, 2023

  • maybe the spec should say that the transforms are actually in NFD space (may make the most sense). That is, the pattern becomes NFD.
  • In other words, this would never match, because the input text would always be normalized to a\u{0320}\u{300}
<transform from="[a][\u{0300}][\u{0320}]" />
  • however, all of these could match, because the string would be normalized
<transform from="a\u{0300}\u{0320}" />
<transform from="à̠" />
<transform from="a\u{0320}\u{0300}" />

etc

@mcdurdin
Copy link
Member

mcdurdin commented Oct 9, 2023

Let's discuss at our meeting tomorrow

srl295 added a commit that referenced this issue Nov 14, 2023
- some failing marker tests

For: #9468
srl295 added a commit that referenced this issue Nov 14, 2023
srl295 added a commit that referenced this issue Nov 14, 2023
- add a new remove_markers(std::u32string) function
- add test cases for text utils
- update (failing) test cases for transform
- improve documentation of append process
- support KM_CORE_BT_UNKNOWN in ldml test
- remove_markers with a map
- update normalize test

For: #9468
srl295 added a commit that referenced this issue Nov 14, 2023
- km::kbp is soooo last month!
- test_transforms can run NFD with markers, with some caveats.

For: #9468
srl295 added a commit that referenced this issue Nov 14, 2023
- refactor out backspace processing into a function.
- for now, just drop any markers in the context when we're lopping off the end

For: #9468
srl295 added a commit that referenced this issue Nov 14, 2023
- fix a cast
- fix some test cases

For: #9468
srl295 added a commit that referenced this issue Nov 14, 2023
- go back to NFD for the context, for now
- anticipating when the privatecontext is NFD but the public context is NFC
- also update the test cases

For: #9468
srl295 added a commit that referenced this issue Nov 15, 2023
- a little further
- couple places where "it wasn't plugged in"
- adding some LDML-TODOs - marker creep
- fixed one unnecessary alloc/dealloc

For: #9468
srl295 added a commit that referenced this issue Nov 15, 2023
- temporarily remove a backspace test until #9450 is fixed.

For: #9468
srl295 added a commit that referenced this issue Nov 15, 2023
srl295 added a commit that referenced this issue Nov 16, 2023
- literally a bad assert. the error case is handled below, in fact the unit test tests for it.
- for some reason, assert.h wasn't included in some cases locally.

For: #9468
srl295 added a commit that referenced this issue Nov 20, 2023
- KM_CORE_BT_UNKNOWN should only be used with an empty context

For: #9468  but related to #9450
@darcywong00 darcywong00 modified the milestones: A17S26, A17S27 Nov 27, 2023
srl295 added a commit that referenced this issue Dec 7, 2023
- also affects markers, feat(core): normalization per spec for transforms/etc 🙀  #9468
- keep markers in nfd context string
- fix ldml test harness to handle context reset
- update test case
- still issues with overproduction of markers in the context

For: #9451
srl295 added a commit that referenced this issue Dec 7, 2023
- don't skip markers when calling context_to_string()! Oops.
- update docs on ldml_processor::remove_text()
- update remove_text() to handle markers in the context string.

This is really: #9468
@mcdurdin mcdurdin modified the milestones: A17S27, A17S28 Dec 8, 2023
srl295 added a commit that referenced this issue Dec 21, 2023
srl295 added a commit that referenced this issue Dec 21, 2023
srl295 added a commit that referenced this issue Dec 22, 2023
@srl295 srl295 modified the milestones: A17S28, A17S29 Dec 22, 2023
srl295 added a commit that referenced this issue Dec 22, 2023
- the intermediate stages of transforms also need to use marker-safe normalization
- re-enable a test that was failing previously due to this

For: #9468
@srl295 srl295 closed this as completed Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants