-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide UTS 46 mapping / disallow operations fused with normalization #2850
Comments
Should we even provide transitional processing, since it looks like the transition period is finally ending? |
What timing with the Mustafa email! As discussed in #42, I prefer for IDNA to be its own crate that can be powered with ICU4X data. I think the This is a concern largely driven by governance: ICU4X is already a large project, and I don't wish to take ownership of components for which we don't have sufficient knowledge to maintain. We accepted Collator and Normalizer because they are integral to ECMA-402; this required a fairly substantial effort between yourself and @echeran as co-owner, and future changes will require one of you as a reviewer. |
After exploring the Rust crate space around this topic a tiny bit, I'm not particularly happy to find how much code is needed to match How about this: We put a data struct carrying the disallowed & ignored data in a crate called This would put all the relevant data into ICU4X but would leave Punycode and the rest out of ICU4X for now. ( |
@markusicu @echeran can you provide input here? What sensible primitive can ICU4X expose that enables IDNA to be implemented externally? |
Note: My assumption is that UTS 46 section 4.1 Validity Criteria items 1 and 6 would be checked by applying the Map & Normalize primitive proposed above and checking the result for error or equality with input. This means that in Processing step 4 xn-- case, the Map & Normalize primitive would run again and int he non-xn-- case items 1 and 6 would already be OK by construction. Please let me know if I've missed something and this wouldn't hold. |
Oh, and the data needs to support flagging disallowed-if-UseSTD3ASCIIRules=true (and |
We need a trie with two bits per scalar value. We could pack the bits for four scalar values in one
|
Ping @markusicu @echeran |
Current notes:
|
From merely looking at the data with knowledge of the relevant data structure, making everyone carry the STD3 disallowed info does not seem particularly nice, but I haven't actually measured, yet. |
Now that I've looked at this some more, it seems to me that It's a bad idea to check the input scalar values for being STD3-disallowed and instead of makes sense to check the output for STD3-prohibited ASCII if the STD3 check is in effect. Am I missing something? |
About UseSTD3ASCIIRules:
|
* Bake ignored/disallow data into the normalization data after all. * Make public operations available via a dedicated wrapper type instead of the main normalizer types. Closes unicode-org#2850
* Bake ignored/disallow data into the normalization data after all. * Make public operations available via a dedicated wrapper type instead of the main normalizer types. Closes #2850
* Bake ignored/disallow data into the normalization data after all. * Make public operations available via a dedicated wrapper type instead of the main normalizer types. Closes unicode-org#2850
* Bake ignored/disallow data into the normalization data after all. * Make public operations available via a dedicated wrapper type instead of the main normalizer types. Closes #2850
Gecko implements most of its IDNA processing on its own, but it currently uses ICU4C's
uidna_labelToUnicode
for UTS 46 processing.Code that wraps
ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable
and augments it to provide the functionality of ICU4C'suidna_labelToUnicode
needs to go somewhere: Since it needs some data for ignored and disallowed that is in sync with ICU4X data, it would make sense to start a new ICU4X component for it. (We already reserve the name.)The text was updated successfully, but these errors were encountered: