-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do NFKC normalization in lexer #2253
Comments
Actually I've changed my mind on this and no longer think we should nfkc normalize here. But check the python unicode identifiers pep to confirm they do not either. |
Python does normalize. See http://bugs.python.org/issue10952 for some relevant discussion. |
Sadness! And double-sadness on the filesystem thing. Though I don't really understand what the commenter is getting at in http://bugs.python.org/issue10952#msg126592 ... wouldn't the filename in the filesystem (however it's encoded) NFKC-normalize to the NFKC-normalized string the user asked for? In any case, I think we can probably dodge whatever bullet is there (if there is one) due to us doing our linkage by relatively open-ended scanning plus matching metadata tags. I don't think it should ever be a bug to consider "too many crates" in the search path; including a directory in the search path should open up the possibility of the compiler looking at all crates in that dir as part of its scan. |
revisiting for triage; all i have to add is the sparkling new A-unicode tag |
visiting for triage, email from 2013-08-26. (Added link to definition of NFKC in unicode spec.) |
Regarding Graydon's question about the filesystem problem: My interpretation of the problem is that you have this scenario where:
Hypothetically I guess the compiler could attempt to invert the normalization process during the filesystem lookup and attempt to query the filesystem for every filename E such that E normalizes to F' ? (Maybe that's what Graydon meant when he said he did not understand the problem; maybe he was assuming that whatever component was doing a filesystem query should be able to walk over a relatively small set of candidate files and NFKC-normalize their names during the lookup?) In any case, don't we always have the option of putting explicit names for the modules we import into attribute-strings, which (AFAIK) should not be NFKC-normalized? And therefore someone should always be able to workaround this problem if it arises for their particular filesystem? (Isn't that right? We're only talking here about NFKC-normalization of identifier tokens, not of literal strings, right? That's certainly the only place that the FIXME occurs in lexer.rs.) (Also, the above scenario may just be an artificial problem. That argument was certainly made on that thread on Python Issue 10952.) There is also discussion of NFKC/NFC normalization for identifiers, in the context of Scheme, here: |
My personal instinct has been to do NFC normalization, not NFKC. Yet it seems that experts in both the Python community and the Scheme community have argued for NFKC normalization of identifiers. (As did @nikomatsakis during the triage meeting, I think.) Well, maybe for Scheme that makes more sense. At least on implementations that also case-fold; that was what came to my mind when I read the statement here: http://bugs.python.org/issue10952#msg182911 According to that note, fullwidth and halfwidth variants are commonly used and recognized as different characters in Japanese environments, and the commenter was thinking it odd to distinguish upper- and lower-case while conflating full- and half-width. (Anyone here familiar with Japanese environments? I'd love to get better understanding of this.) |
(also, added reference to UAX-31 to the description.) |
I actually don't claim to argue either way. I feel like this is not |
I am not familiar with Japanese environment(tagging @gifnksm), but for Korean, my understanding is that it is desirable to have U+1100 HANGUL CHOSEONG KIYEOK and U+3131 HANGUL LETTER KIYEOK equivalent, which means compatibility normalization. The reason is that the same keypress will result in U+3131 in Windows but U+1100 in Mac. |
My initial feeling was that we should do NFC and not NFKC, but fullwidth vs halfwidth, and @sanxiyn's example Korean characters, have changed my mind such that I think NFKC is the appropriate choice. Regarding the filesystem issue, the only filesystem I know of offhand that normalizes is HFS+, which is fine because it normalizes to NFD and AFAIK Unicode isn't adding any new composed characters (which means HFS+ won't encounter a composed character that it doesn't know how to decompose). Although strictly speaking it normalizes to a variant of NFD that doesn't match current Unicode (I don't know if it's pegged at an older Unicode standard or merely has modifications), for backwards compatibility reasons. Regardless, on HFS+ we shouldn't have an issue. Are there in fact filesystems out there that impose other character sets on filenames? If so, instead of trying to reverse the process (which seems implausible; besides not having a pre-existing mapping of NFC/NFD to all composed characters that could result in those (note: there are composed characters that will actually be decomposed by NFC), due to how normalization orders combining marks something like Zalgo would produce a combinatorial explosion of potential un-normalized names). Instead, if this is actually found to be an issue in practice, I would recommend simply reading the directory, passing all encountered files through NFKC normalization, and picking the file that matches. If multiple files match, consider that an error. Naturally this would only be done if there is no file that already matches the normalized identifier. In any case, I think the right approach is to NFKC-normalize identifiers during lexing, and then if we don't want to solve the filesystem issue we can either introduce a lint or a feature-gate for non-ascii module names and extern crate names (as no other identifier will map to a filename). Another point to consider is future compatibility for XID_Start/XID_Continue. There's 3 approaches we can take:
These 3 approaches are all suggested by UAX #31 (Unicode Identifier and Pattern Syntax). Approach #3 is the most forward-compatible, but it has the downside of allowing identifiers that would otherwise be invalid. #1 is obviously the easiest. And personally, I think #1 is probably ok. If you want to use identifiers characters added in future versions of Unicode in your project, you just have to be ok with requiring a minimum version of rustc to compile it. And if this is a big worry we can always add a lint to catch such newly-added characters. |
Non-ascii identifiers are feature gated, I believe this to not be 1.0. Nominating. |
And by not 1.0, I mean P-backcompat-lang. |
Assigning P-low, not 1.0. |
The reference manual said that code is interpreted as UTF-8 text and a implementation will normalize it to NFKC. However, rustc doesn't do any normalization now. We may want to do any normalization for symbols, but normalizing whole text seems harmful because doing so loses some sort of information even if we choose a non-K variant of normalization. I'd suggest removing "normalized to Unicode normalization form NFKC" phrase for the present so that the manual represents the current state properly. When we address the problem (with a RFC?), then the manual should be updated. Closes #12388. Reference: #2253
I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized. This issue has been moved to the RFCs repo: rust-lang/rfcs#802 |
This is somewhat controversial thing, but I have made a decision: identifiers should be normalized to fold visual ambiguity, and the normalization form should be NFKC. Rationale: 1. Compatibility decomposition is favored over canonical one because it provides useful folding for letter ligatures, fullwidth forms, certain CJK ideographs, etc. 2. Compatibility decomposition is favored over canonical one because it provides more protection from visual spoofing. 3. Standard Unicode transformation should be favored over anything ad-hoc because it's predictable and more mature. 4. Normalization is a compromise between freedom of expression and ease of implementation. Source code is not prose, there are rules. Here are some references to other languages: SRFI 52: http://srfi-email.schemers.org/srfi-52/ Julia: JuliaLang/julia#5434 Python: http://bugs.python.org/issue10952 Rust: rust-lang/rust#2253 Unfortunately, there aren't very many precedents and open discussions about Unicode usage in programming languages, especially in languages with very permissive identifier syntax (like Scheme). Aside from identifiers there are more places where Unicode can be used: * Characters are not normalized, not even to NFC. This may have been useful, for example, to recompose combining marks, but unfortunately NFC may do more transformations than that, so it is no go. We preserve the exact Unicode character. * Character names, on the other hand, are case-sensitive identifiers, so they are normalized as such. * Strings and escaped identifiers are left untouched in order to preserve the exact spelling as in the source code. * Directives are case-insensitive identifiers and are normalized as such. * Numbers should be composed from ASCII only so they are not normalized. Sometimes this produces weird parses because characters that look like signs are not treated as such. However, these characters are invalid in numbers, so it's somewhat justified. * Peculiar identifiers are shit. I'm sorry. Because of NFKC is is possible to write a plain, unescaped identifier that will parse as a number after going through NFKC. It may even look exactly like a number without being one. There is not much we can do about this, so we produce a warning just in case. * Datum labels are mostly numbers, so they are not normalized as well. Note that sometimes they can be treated as numbers with invalid prefix. * Comments are ignored. * Delimiters should be ASCII-only. No discussion on this. Unicode has various fancy whitespaces and line separators, but this is source code, not a rich text document in a word processor. Also, currently case-folding is performed only for ASCII range. Identifiers should use NFKC_casefold transformation. It will be implemented later.
fix behavior of only-Nbits comments Fixes rust-lang/miri#2206
next_token_inner
in the lexer has a comment saying to do NKFC normalization. I have no idea what that is, but I guess we should do it.reference: NFKC is one of four unicode Normalization Forms.
reference: UAX-31 supplies guidelines for use of normalization with identifiers.
The text was updated successfully, but these errors were encountered: