-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode Normalization #2512
base: main
Are you sure you want to change the base?
Unicode Normalization #2512
Conversation
At this point, it probably makes sense to put the pitch up sooner rather than fully incorporate the feedback below. The below might make sense as an "Additional Design Considerations" section. Some of the API feedback we could incorporate right away. Normal forms and data tablesThe stdlib currently ships data tables which support NFC and NFD normal forms. The stdlib does not currently ship NFKC and NFKD data tables. NFKC and NFKD are for purposes other than checking canonical equivalence, and as such it may make sense to relegate those API to another library instead of the stdlib, such as swift-foundation. Stability over Unicode versionsThe concept of stability and "any version of Unicode" might be too nuanced for these doc comments. Broadly speaking, it doesn't make much sense to talk about Unicode prior to 2.0 outside of extremely niche, archeological use cases. The stdlib shouldn't really bother with any Unicode processing prior to 3.0, as those versions permitted overlong encodings in UTF-8. Canonical equivalence only really make sense for Unicode 3.1 and later, and the modern notion of normalization stability is a 4.1 and later guarantee. API feedbackStability seems a little more niche. It's relevant for domains that may have an invariant over content that is held across different processes running different versions of Swift or using different Unicode implementations. It seems like some kind of We can also make initializers that will decode-and-normalize the content in one go. I.e. a I'm not entirely sold on the normalized API on Character. I suppose it makes some kind of algebraic sense, but it seems like you should normalize the String itself instead. I think we can come out and say that the preferred form is NFC. We pitch an extension on For In addition to Another future direction would be protocols that abstract the normalization itself so that other libraries can plug in their own data tables or provide their own normal forms. |
Added some prose, alternatives, closure-taking API, and future direct…
I'm also wondering about this. From what I can tell, Unicode does not make any promises about the interaction between grapheme breaking and normalization. In fact, it encourages implementations to tailor the grapheme breaking rules to their liking, which would probably render all such promises toothless. (Swift itself implements non-standard rules, although hopefully they'll go away when we upgrade to Unicode 15.1. Until we decide to add more custom rules, that is -- for example, to match behavior between the Swift stdlib and the macOS/iOS text display/processing frameworks.) Is it guaranteed that the position/number of grapheme breaks will not be affected by normalization? If not, we probably cannot provide normalization APIs on |
The motivation for adding normalization to
Unfortunately, Characters are usually small strings, which don't have an Canonical normalisation should be safe to add to Character even though grapheme-breaking rules are not stable. Since a Character is just a String with length 1, the question can be rephrased: will grapheme-breaking ever see a different number of characters in two canonically-equivalent strings? And I think it's clear that, whichever rules we use (and even allowing for tailoring), that should never happen. If it ever did, it would indicate a bug in grapheme breaking. |
I do agree that it would very much be desirable if we could simply assume that (canonical) normalization will not affect grapheme breaking boundaries. But this isn't about hopes and dreams -- can we rely on current and future versions of Unicode to ensure this, or not? If not, then the stdlib must not make such guarantees on its API level. I could not find a place anywhere in the Unicode standard where this is called out as a feature of the current definitions (is it?), much less any indication that this property is covered by a Unicode stability policy. Please prove me otherwise -- it would be a huge relief if this was an invariant we could trust. I would much prefer if equal FWIW, as of 15.1, the baseline grapheme cluster boundary rules rely on properties such as Indic_Conjunct_Break (or Indic_Syllabic_Category) that don't even seem to be subject to any stability constraints. |
While I've always found it to be true in practice that grapheme break boundaries are also normalization segment boundaries, and that normalizing a grapheme cluster would not introduce new grapheme breaks, I don't think we can rely on this. We should probably remove the normalization API on |
@lorentey - Citations, as requested 😇 UAX#29 - Unicode Text Segmentation
I think we can rely on this - it seems the specifications go out of their way to ensure it holds. As it says, "using a grapheme cluster as the fundamental unit of matching thus provides a very clear and easily explained basis for canonically equivalent matching" -- this is only possible if boundaries are the same for canonically-equivalent strings. I appreciate that it is less common/useful than normalising a String, but the conversion That said, if we're still unconvinced, I don't mind deferring it. |
TIL. That sounds like the guarantee we need, @lorentey what do you think? |
No description provided.