Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Remove "normalized to NFKC" clause from the reference manual, section 3.1 #12388

Closed
omasanori opened this issue Feb 19, 2014 · 6 comments · Fixed by #16216
Closed

[RFC] Remove "normalized to NFKC" clause from the reference manual, section 3.1 #12388

omasanori opened this issue Feb 19, 2014 · 6 comments · Fixed by #16216

Comments

@omasanori
Copy link
Contributor

From The Rust Reference Manual;

Rust input is interpreted as a sequence of Unicode codepoints encoded in UTF-8, normalized to Unicode normalization form NFKC.

However, NFKC requires to transform some characters into different ones even in strings or comments and then we will get different results on such cases. Even NFC have some problems if we have to preserve a text strictly.
(yes, the word different is ambiguous; in NFKC, they are treated as the same, but the glyphs of them are different... sometimes depends on the font, though)

I'd suggest to remove the "normalized to NFKC" clause and leave the input, like golang. From The Go Programming Language Specification:

The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points.

@Kimundi
Copy link
Member

Kimundi commented Feb 19, 2014

We actually don't do any normalization of source input right now, the manual is just plain wrong in claiming it. A PR removing that sentence would probably be welcomed.

@omasanori
Copy link
Contributor Author

@Kimundi the text requires that we, programmers, must normalize our source code before the compilation IMHO, so I would ask whether it is really needed.

@emberian
Copy link
Member

That's not what the manual is saying, the manual is saying that it
normalizes input. We haven't decided if normalizing input is the correct
thing to do yet, though.

On Wed, Feb 19, 2014 at 4:01 AM, OGINO Masanori notifications@github.comwrote:

@Kimundi https://github.com/Kimundi the text requires that we,
programmers, must normalize our source code before the compilation IMHO, so
I would ask whether it is really needed.


Reply to this email directly or view it on GitHubhttps://github.com//issues/12388#issuecomment-35478211
.

@omasanori
Copy link
Contributor Author

@cmr Thank you for clarifying. Why is the spec saying so?

@pnkfelix
Copy link
Member

@omasanori probably because there is precedent in other languages for doing such normalization, at least for identifiers. (Though if the spec implies doing it in string constants, then that is probably just sloppiness in the writing of the spec.)

Related bug: #2253

update: To clarify, graydon originally wanted to do NFKC normalization in the lexer (as noted in the bug above), but he changed his mind and so we have been in a bit of a state of limbo ever since. But as I said above, the scope of that normalization was, I think, intended to be restricted to identifiers, not all lexical syntax (i.e. not the interior of string constants).

@omasanori
Copy link
Contributor Author

@pnkfelix Thank you.

I agree on the normalization for identifiers. I think it is acceptable. Certainly some similar but different identifiers treated as the same ones, but we should not do such cheat. (NFKC vs. NFC problem remains, though)
I think it is good to update the spec to specify that we may normalize identifiers for ease of comparison/collation.

bors added a commit that referenced this issue Aug 5, 2014
The reference manual said that code is interpreted as UTF-8 text and a implementation will normalize it to NFKC. However, rustc doesn't do any normalization now.

We may want to do any normalization for symbols, but normalizing whole text seems harmful because doing so loses some sort of information even if we choose a non-K variant of normalization.

I'd suggest removing "normalized to Unicode normalization form NFKC" phrase for the present so that the manual represents the current state properly. When we address the problem (with a RFC?), then the manual should be updated.

Closes #12388.

Reference: #2253
bors added a commit to rust-lang-ci/rust that referenced this issue Jul 25, 2022
internal: Make use of the statusBarItem colors in VSCode

Fixes rust-lang/rust-analyzer#7736
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants