Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uncommon_codepoints is only checked post-NFC #120697

Open
Jules-Bertholet opened this issue Feb 6, 2024 · 4 comments
Open

uncommon_codepoints is only checked post-NFC #120697

Jules-Bertholet opened this issue Feb 6, 2024 · 4 comments
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-lints Area: Lints (warnings about flaws in source code) such as unused_mut. A-Unicode Area: Unicode C-bug Category: This is a bug. L-uncommon_codepoints Lint: uncommon_codepoints T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Jules-Bertholet
Copy link
Contributor

Jules-Bertholet commented Feb 6, 2024

Code

#![forbid(uncommon_codepoints)]
pub const L·L: u32 = 7;

Current output

(compiles successfully)

Desired output

error: identifier contains an uncommon Unicode codepoint: '·'
 --> src/lib.rs:2:11
  |
2 | pub const L·L: u32 = 7;
  |           ^^^
  |

Rationale and extra context

The · in the above code snippet is U+0387 GREEK ANO TELEIA, which has an Identifier_Status of Restricted and should therefore trigger the uncommon_codepoints lint. However, U+0387 has an NFC decomposition to U+00B7 ( · ) MIDDLE DOT, which has an Identifier_Status of Allowed, and is therefore not flagged by the lint. Because the compiler applies NFC normalization to identifiers before checking uncommon_codepoints, the lint incorrectly fails to fire in this case.

Rust Version

rustc 1.75.0 (82e1608df 2023-12-21)
binary: rustc
commit-hash: 82e1608dfa6e0b5569232559e3d385fea5a93112
commit-date: 2023-12-21
host: x86_64-unknown-linux-gnu
release: 1.75.0
LLVM version: 17.0.6

@rustbot label A-unicode

@Jules-Bertholet Jules-Bertholet added A-diagnostics Area: Messages for errors, warnings, and lints T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 6, 2024
@rustbot rustbot added the A-Unicode Area: Unicode label Feb 6, 2024
@Jules-Bertholet
Copy link
Contributor Author

Alternatively to or along with fixing this, perhaps rustfmt should NFC-normalize identifiers?

@workingjubilee
Copy link
Member

Should we lint against non-NFC-normalized idents?

@workingjubilee
Copy link
Member

Opened rust-lang/rustfmt#6058

@workingjubilee workingjubilee added A-lints Area: Lints (warnings about flaws in source code) such as unused_mut. C-bug Category: This is a bug. labels Feb 6, 2024
@workingjubilee
Copy link
Member

A program that makes the difference visible to human eyes: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=8a408f3aba0d4a3c99a1fbf85f9c4473

fn main() {
    for id in ["·", stringify!(L·L), "L·L", stringify!(), "_·"] {
        println!("printing {len} bytes of: {id}", len = id.len());
        for c in id.chars() {
            let mut bytes = [0; 4];
            c.encode_utf8(&mut bytes);
            for byte in bytes.iter().take(c.len_utf8()) {
                print!("{:x} ", byte);
            }
        }
        println!("");
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-lints Area: Lints (warnings about flaws in source code) such as unused_mut. A-Unicode Area: Unicode C-bug Category: This is a bug. L-uncommon_codepoints Lint: uncommon_codepoints T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

4 participants