From ec728b3e24a2ea2ad802328e4a6d1ac628731ee6 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Sun, 3 Jun 2018 18:52:55 +0200 Subject: [PATCH 01/26] Initial draft of unicode-idents RFC --- text/0000-unicode-idents.md | 145 ++++++++++++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) create mode 100644 text/0000-unicode-idents.md diff --git a/text/0000-unicode-idents.md b/text/0000-unicode-idents.md new file mode 100644 index 00000000000..6fad20b1bff --- /dev/null +++ b/text/0000-unicode-idents.md @@ -0,0 +1,145 @@ +- Feature Name: unicode_idents +- Start Date: 2018-06-03 +- RFC PR: (leave this empty) +- Rust Issue: (leave this empty) + +# Summary +[summary]: #summary + +Allow non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Rust identifiers. + +# Motivation +[motivation]: #motivation + +Rust is written by many people who are not fluent in the English language. Using identifiers in ones native language eases writing and reading code for these developers. + +The rationale from [PEP 3131] nicely explains it: + +> ~~Python~~ *Rust* code is written by many people in the world who are not familiar with the English language, or even well-acquainted with the Latin writing system. Such developers often desire to define classes and functions with names in their native languages, rather than having to come up with an (often incorrect) English translation of the concept they want to name. By using identifiers in their native language, code clarity and maintainability of the code among speakers of that language improves. +> +> For some languages, common transliteration systems exist (in particular, for the Latin-based writing systems). For other languages, users have larger difficulties to use Latin to write their native words. + +Additionally some math oriented projects may want to use identifiers closely resembling mathematical writing. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +Identifiers include variable names, function and trait names and module names. They start with a letter or an underscore and may be followed by more letters, digits and some connecting punctuation. + +Examples of valid identifiers are: + +* English language words: `color`, `image_width`, `line2`, `Photo`, `_unused`, ... +* ASCII words in foreign languages: `die_eisenbahn`, `el_tren`, `artikel_1_grundgesetz` +* words containing accented characters: `garçon`, `hühnervögel` +* identifiers in other scripts: `Москва`, `東京`, ... + +Examples of invalid identifiers are: + +* Keywords: `impl`, `fn`, `_` (underscore), ... +* Identifiers starting with numbers or "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ... +* Emojis: 🙂, 🦀, 💩, ... + +Similar Unicode identifiers are normalized: `a1` and `a₁` refer to the same variable. This also applies to accented characters which can be represented in different ways. + +To disallow any Unicode identifiers in a project (for example to ease collaboration or for security reasons) limiting the accepted identifiers to ASCII add this lint to the `lib.rs` or `main.rs` file of your project: + +```rust +#![forbid(unicode_idents)] +``` + +Some Unicode character look confusingly similiar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about easy to confuse names in the same scope. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_unicode_idents)]` annotation on the enclosing function or module. + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +Identifiers in Rust are based on the [Unicode® Standard Annex #31 Unicode Identifier and Pattern Syntax][TR31]. Rust compilers shall use at least Revision 27 of the standard. + +The lexer defines identifiers as: + +> **Lexer:** +> IDENTIFIER_OR_KEYWORD: +>    XID_Start XID_Continue\* +>    | `_` XID_Continue+ +> +> IDENTIFIER : +> IDENTIFIER_OR_KEYWORD *Except a [strict] or [reserved] keyword* + +`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed. + +Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y). + +A `unicode_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context. + +## Confusable detection + +Rust compilers should detect confusingly similar Unicode identifiers and warn the user about it. + +Note: This is *not* a mandatory for all Rust compilers as it requires considerable implementation effort and is not related to the core function of the compiler. It rather is a tool to detect accidental misspellings and intentional homograph attacks. + +A new `confusable_unicode_idents` lint is added to the compiler. The default setting is `warn`. + +Note: The confusable detection is set to `warn` instead of `deny` to enable forward compatibility. The list of confusable characters will be extended in the future and programs that were once valid would fail to compile. + +The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X in the current scope execute the function `skeleton(X)`. If there exist two distinct identifiers X and Yin the same crate where `skeleton(X) = skeleton(Y)` report it. + +# Drawbacks +[drawbacks]: #drawbacks + +* "ASCII is enough for anyone." As source code should be written in English and in English only (source: various people) no charactes outside the ASCII range are needed to express identifiers. Therefore support for Unicode identifiers introduces unnecceray complexity to the compiler. +* "Foreign characters are hard to type." Usually computer keyboards provide access to the US-ASCII printable characters and the local language characters. Characters from other scripts are difficult to type, require entering numeric codes or are not available at all. These characters either need to be copy-pasted or entered with an alternative input method. +* "Foreign characters are hard to read." If one is not familiar with the characters used it can be hard to tell them apart (e.g. φ and ψ) and one may not be able refer to the identifiers in an appropriate way (e.g. "loop" and "trident" instead of phi and psi) +* Homoglyph attacks are possible. Without confusable detection identifiers can be distinct for the compiler but visually the same. Even with confusable detection there are still similar looking characters that may be confused by the casual reader. + +# Rationale and alternatives +[alternatives]: #alternatives + +As stated in [Motivation](#motivation) allowing Unicode identifiers outside the ASCII range improves Rusts accessiability for developers not working in English. Especially in teaching and when the application domain vocabulary is not in English it can be beneficial to use names from the native language. To facilitate this it is necessary to allow a wide range of Unicode character in identifiers. The proposed implementation based on the Unicode TR31 is already used by other programming languages (e.g. Python 3) and is implemented behind the `non_ascii_idents` in *rustc* but lacks the NFKC normalization proposed. + +Possible variants: + +1. Require all identifiers to be in NFKC or NFC form. +2. Two identifiers are only equal if their codepoints are equal. +3. Perform NFC mapping instead of NFKC mapping for identifiers. +4. Only a number of common scripts could be supported. + +An alternative design would use [Immutable Identifiers][TR31Alternative] as done in [C++]. In this case a list of Unicode codepoints is reserved for syntax (ASCII operators, braces, whitespace) and all other codepoints (including currently unassigned codepoints) are allowed in identifiers. The advantages are that the compiler does not need to know the Unicode character classes XID_Start and XID_Continue for each character and that the set of allowed identifiers never changes. It is disadvantageous that all not explicitly excluded characters at the time of creation can be used in identifiers. This allows developers to create identifiers that can't be recognized as such. It also impedes other uses of Unicode in Rust syntax like custom operators if they were not initially reserved. + +It always a possibility to do nothing and limit identifiers to ASCII. + +It has been suggested that Unicode identifiers should be opt-in instead of opt-out. The proposal chooses opt-out to benefit the international Rust community. New Rust users should not need to search for the configuration option they may not even know exists. Additionally it simplifies tutorials in other languages as they can omit an annotation in every code snippet. + +## Confusable detection + +The current design was chosen because the algorithm and list of similar characters are already provided by the Unicode Consortium. A different algorithm and list of characters could be created. I am not aware of any other programming language implementing confusable detection. The confusable detection was primarily included because homoglyph attacks are a huge concern for some member of the community. + +Instead of offering confusable detection the lint `forbid(unicode_idents)` is sufficient to protect project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos. + +# Prior art +[prior-art]: #prior-art + +"[Python PEP 3131][PEP 3131]: Supporting Non-ASCII Identifiers" is the Python equivalent to this proposal. The proposed identifier grammar **XID_Start XID_Continue\*** is identical to the one used in Python 3. + +[JavaScript] supports Unicode identifiers based on the same Default Identifier Syntax but does not apply normalization. + +The [CPP reference][C++] describes the allowed Unicode identifiers it is based on the immutable identifier principle. + +[Java] also supports Unicode identifiers. Character must belong to a number of Unicode character classes similar to XID_start and XID_continue used in Python. Unlike in Python no normalization is performed. + +# Unresolved questions +[unresolved]: #unresolved-questions + +* Which context is adequate for confusable detection: file, current scope, crate? +* Are Unicode characters allowed in `no_mangle` and `extern fn`s? +* How do Unicode names interact with the file system? +* Are crates with Unicode names allowed and can they be published to crates.io? +* Are `unicode_idents` and `confusable_unicode_idents` good names? + +[PEP 3131]: https://www.python.org/dev/peps/pep-3131/ +[TR15]: https://www.unicode.org/reports/tr15/ +[TR31]: http://www.unicode.org/reports/tr31/ +[TR31Alternative]: http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax +[TR39Confusable]: https://www.unicode.org/reports/tr39/#Confusable_Detection +[C++]: https://en.cppreference.com/w/cpp/language/identifiers +[Julia Unicode PR]: https://github.com/JuliaLang/julia/pull/19464 +[Java]: https://docs.oracle.com/javase/specs/jls/se10/html/jls-3.html#jls-3.8 +[JavaScript]: http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords \ No newline at end of file From 4c1bda90b334a3ccae278fda6f111b5a414f6419 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Mon, 4 Jun 2018 09:22:37 +0200 Subject: [PATCH 02/26] Include expected Usage Notes and minor changes Raise two more questions. Suggest restriction levels as an alternative design. Describe the Go language identifier syntax. --- text/0000-unicode-idents.md | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/text/0000-unicode-idents.md b/text/0000-unicode-idents.md index 6fad20b1bff..8ab66cc1f84 100644 --- a/text/0000-unicode-idents.md +++ b/text/0000-unicode-idents.md @@ -36,10 +36,10 @@ Examples of valid identifiers are: Examples of invalid identifiers are: * Keywords: `impl`, `fn`, `_` (underscore), ... -* Identifiers starting with numbers or "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ... +* Identifiers starting with numbers or containing "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ... * Emojis: 🙂, 🦀, 💩, ... -Similar Unicode identifiers are normalized: `a1` and `a₁` refer to the same variable. This also applies to accented characters which can be represented in different ways. +Similar Unicode identifiers are normalized: `a1` and `a₁` (a<subscript 1>) refer to the same variable. This also applies to accented characters which can be represented in different ways. To disallow any Unicode identifiers in a project (for example to ease collaboration or for security reasons) limiting the accepted identifiers to ASCII add this lint to the `lib.rs` or `main.rs` file of your project: @@ -47,7 +47,15 @@ To disallow any Unicode identifiers in a project (for example to ease collaborat #![forbid(unicode_idents)] ``` -Some Unicode character look confusingly similiar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about easy to confuse names in the same scope. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_unicode_idents)]` annotation on the enclosing function or module. +Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about easy to confuse names in the same scope. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_unicode_idents)]` annotation on the enclosing function or module. + +## Usage notes + +All code written in the Rust Language Organization (*rustc*, tools, std, common crates) will continue to only use ASCII identifiers and the English language. + +For open source crates it is recommended to write them in English and use ASCII-only. An exception should be made if the application domain (e.g. math) benefits from Unicode and the target audience (e.g. for a crate interfacing with Russian passports) is comfortable with the used language and characters. Additionally crates should provide an ASCII-only API. + +Private projects can use any script and language the developer(s) desire. It is still a good idea (as with any language feature) not to overuse it. # Reference-level explanation [reference-level-explanation]: #reference-level-explanation @@ -88,12 +96,13 @@ The confusable detection algorithm is based on [Unicode® Technical Standard #39 * "ASCII is enough for anyone." As source code should be written in English and in English only (source: various people) no charactes outside the ASCII range are needed to express identifiers. Therefore support for Unicode identifiers introduces unnecceray complexity to the compiler. * "Foreign characters are hard to type." Usually computer keyboards provide access to the US-ASCII printable characters and the local language characters. Characters from other scripts are difficult to type, require entering numeric codes or are not available at all. These characters either need to be copy-pasted or entered with an alternative input method. * "Foreign characters are hard to read." If one is not familiar with the characters used it can be hard to tell them apart (e.g. φ and ψ) and one may not be able refer to the identifiers in an appropriate way (e.g. "loop" and "trident" instead of phi and psi) +* "My favorite terminal/text editor/web browser" has incomplete Unicode support." Even in 2018 some characters are not widely supported in all places where source code is usually displayed. * Homoglyph attacks are possible. Without confusable detection identifiers can be distinct for the compiler but visually the same. Even with confusable detection there are still similar looking characters that may be confused by the casual reader. # Rationale and alternatives [alternatives]: #alternatives -As stated in [Motivation](#motivation) allowing Unicode identifiers outside the ASCII range improves Rusts accessiability for developers not working in English. Especially in teaching and when the application domain vocabulary is not in English it can be beneficial to use names from the native language. To facilitate this it is necessary to allow a wide range of Unicode character in identifiers. The proposed implementation based on the Unicode TR31 is already used by other programming languages (e.g. Python 3) and is implemented behind the `non_ascii_idents` in *rustc* but lacks the NFKC normalization proposed. +As stated in [Motivation](#motivation) allowing Unicode identifiers outside the ASCII range improves Rusts accessibility for developers not working in English. Especially in teaching and when the application domain vocabulary is not in English it can be beneficial to use names from the native language. To facilitate this it is necessary to allow a wide range of Unicode character in identifiers. The proposed implementation based on the Unicode TR31 is already used by other programming languages (e.g. Python 3) and is implemented behind the `non_ascii_idents` in *rustc* but lacks the NFKC normalization proposed. Possible variants: @@ -101,6 +110,7 @@ Possible variants: 2. Two identifiers are only equal if their codepoints are equal. 3. Perform NFC mapping instead of NFKC mapping for identifiers. 4. Only a number of common scripts could be supported. +5. A [restriction level][TR39Restriction] is specified allowing only a subset of scripts and limit script-mixing within an identifier. An alternative design would use [Immutable Identifiers][TR31Alternative] as done in [C++]. In this case a list of Unicode codepoints is reserved for syntax (ASCII operators, braces, whitespace) and all other codepoints (including currently unassigned codepoints) are allowed in identifiers. The advantages are that the compiler does not need to know the Unicode character classes XID_Start and XID_Continue for each character and that the set of allowed identifiers never changes. It is disadvantageous that all not explicitly excluded characters at the time of creation can be used in identifiers. This allows developers to create identifiers that can't be recognized as such. It also impedes other uses of Unicode in Rust syntax like custom operators if they were not initially reserved. @@ -125,6 +135,8 @@ The [CPP reference][C++] describes the allowed Unicode identifiers it is based o [Java] also supports Unicode identifiers. Character must belong to a number of Unicode character classes similar to XID_start and XID_continue used in Python. Unlike in Python no normalization is performed. +The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\*** where **Letter** is a Unicode letter and **Number** is a Unicode decimal number. This is more restricted than the proposed design mainly as is does not allow combining characters needed to write some languages such as Hindi. + # Unresolved questions [unresolved]: #unresolved-questions @@ -133,13 +145,18 @@ The [CPP reference][C++] describes the allowed Unicode identifiers it is based o * How do Unicode names interact with the file system? * Are crates with Unicode names allowed and can they be published to crates.io? * Are `unicode_idents` and `confusable_unicode_idents` good names? +* Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]? +* Should *rustc* accept files in a different encoding than *UTF-8*? [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ [TR15]: https://www.unicode.org/reports/tr15/ [TR31]: http://www.unicode.org/reports/tr31/ [TR31Alternative]: http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax +[TR31Layout]: https://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters [TR39Confusable]: https://www.unicode.org/reports/tr39/#Confusable_Detection +[TR39Restriction]: https://www.unicode.org/reports/tr39/#Restriction_Level_Detection [C++]: https://en.cppreference.com/w/cpp/language/identifiers [Julia Unicode PR]: https://github.com/JuliaLang/julia/pull/19464 [Java]: https://docs.oracle.com/javase/specs/jls/se10/html/jls-3.html#jls-3.8 -[JavaScript]: http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords \ No newline at end of file +[JavaScript]: http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords +[Go]: https://golang.org/ref/spec#Identifiers \ No newline at end of file From 619f5b4ed000dffaa504aaf1502ac232d7bd2e52 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Mon, 4 Jun 2018 21:54:06 +0200 Subject: [PATCH 03/26] Improve descriptions and fix typos Thanks to SimonSapin for the suggestions. --- text/0000-unicode-idents.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/text/0000-unicode-idents.md b/text/0000-unicode-idents.md index 8ab66cc1f84..0841b130e3e 100644 --- a/text/0000-unicode-idents.md +++ b/text/0000-unicode-idents.md @@ -28,8 +28,7 @@ Identifiers include variable names, function and trait names and module names. T Examples of valid identifiers are: -* English language words: `color`, `image_width`, `line2`, `Photo`, `_unused`, ... -* ASCII words in foreign languages: `die_eisenbahn`, `el_tren`, `artikel_1_grundgesetz` +* ACII letters and digits: `image_width`, `line2`, `Photo`, `el_tren`, `_unused` * words containing accented characters: `garçon`, `hühnervögel` * identifiers in other scripts: `Москва`, `東京`, ... @@ -53,9 +52,9 @@ Some Unicode character look confusingly similar to each other or even identical All code written in the Rust Language Organization (*rustc*, tools, std, common crates) will continue to only use ASCII identifiers and the English language. -For open source crates it is recommended to write them in English and use ASCII-only. An exception should be made if the application domain (e.g. math) benefits from Unicode and the target audience (e.g. for a crate interfacing with Russian passports) is comfortable with the used language and characters. Additionally crates should provide an ASCII-only API. +For open source crates it is suggested to write them in English and use ASCII-only. An exception can be made if the application domain (e.g. math) benefits from Unicode and the target audience (e.g. for a crate interfacing with Russian passports) is comfortable with the used language and characters. Additionally crates should consider to provide an ASCII-only API. -Private projects can use any script and language the developer(s) desire. It is still a good idea (as with any language feature) not to overuse it. +Private projects can use any script and language the developer(s) desire. It is still a good idea (as with any language feature) not to overdo it. # Reference-level explanation [reference-level-explanation]: #reference-level-explanation @@ -74,7 +73,7 @@ The lexer defines identifiers as: `XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed. -Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y). +Two identifiers X, Y are considered to be equal if their [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y). A `unicode_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context. @@ -159,4 +158,4 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ [Julia Unicode PR]: https://github.com/JuliaLang/julia/pull/19464 [Java]: https://docs.oracle.com/javase/specs/jls/se10/html/jls-3.html#jls-3.8 [JavaScript]: http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords -[Go]: https://golang.org/ref/spec#Identifiers \ No newline at end of file +[Go]: https://golang.org/ref/spec#Identifiers From 142d0bc3ece44bc0773c269fafd9e5077b3bf287 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Mon, 4 Jun 2018 21:55:20 +0200 Subject: [PATCH 04/26] ACII -> ASCII --- text/0000-unicode-idents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-unicode-idents.md b/text/0000-unicode-idents.md index 0841b130e3e..ee10d27ee3b 100644 --- a/text/0000-unicode-idents.md +++ b/text/0000-unicode-idents.md @@ -28,7 +28,7 @@ Identifiers include variable names, function and trait names and module names. T Examples of valid identifiers are: -* ACII letters and digits: `image_width`, `line2`, `Photo`, `el_tren`, `_unused` +* ASCII letters and digits: `image_width`, `line2`, `Photo`, `el_tren`, `_unused` * words containing accented characters: `garçon`, `hühnervögel` * identifiers in other scripts: `Москва`, `東京`, ... From 6b2a94a58ef93bb80b7f9b859a2de6a39ec86431 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Thu, 7 Jun 2018 17:19:04 +0200 Subject: [PATCH 05/26] Typos, renames and a minor reference change unicode_idents -> non_ascii_idents Remove mention of exact spec revision Describe more how to implement confusable detection and remove mention of scope fix typo --- ...ode-idents.md => 0000-non-ascii-idents.md} | 26 +++++++++++-------- 1 file changed, 15 insertions(+), 11 deletions(-) rename text/{0000-unicode-idents.md => 0000-non-ascii-idents.md} (86%) diff --git a/text/0000-unicode-idents.md b/text/0000-non-ascii-idents.md similarity index 86% rename from text/0000-unicode-idents.md rename to text/0000-non-ascii-idents.md index ee10d27ee3b..059e9e62b8c 100644 --- a/text/0000-unicode-idents.md +++ b/text/0000-non-ascii-idents.md @@ -1,4 +1,4 @@ -- Feature Name: unicode_idents +- Feature Name: non_ascii_idents - Start Date: 2018-06-03 - RFC PR: (leave this empty) - Rust Issue: (leave this empty) @@ -36,17 +36,17 @@ Examples of invalid identifiers are: * Keywords: `impl`, `fn`, `_` (underscore), ... * Identifiers starting with numbers or containing "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ... -* Emojis: 🙂, 🦀, 💩, ... +* Many Emojis: 🙂, 🦀, 💩, ... Similar Unicode identifiers are normalized: `a1` and `a₁` (a<subscript 1>) refer to the same variable. This also applies to accented characters which can be represented in different ways. To disallow any Unicode identifiers in a project (for example to ease collaboration or for security reasons) limiting the accepted identifiers to ASCII add this lint to the `lib.rs` or `main.rs` file of your project: ```rust -#![forbid(unicode_idents)] +#![forbid(non_ascii_idents)] ``` -Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about easy to confuse names in the same scope. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_unicode_idents)]` annotation on the enclosing function or module. +Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about easy to confuse names in the same scope. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_non_ascii_idents)]` annotation on the enclosing function or module. ## Usage notes @@ -59,7 +59,9 @@ Private projects can use any script and language the developer(s) desire. It is # Reference-level explanation [reference-level-explanation]: #reference-level-explanation -Identifiers in Rust are based on the [Unicode® Standard Annex #31 Unicode Identifier and Pattern Syntax][TR31]. Rust compilers shall use at least Revision 27 of the standard. +Identifiers in Rust are based on the [Unicode® Standard Annex #31 Unicode Identifier and Pattern Syntax][UAX31]. + +Note: The supported Unicode version should be stated in the documentation. The lexer defines identifiers as: @@ -75,7 +77,7 @@ The lexer defines identifiers as: Two identifiers X, Y are considered to be equal if their [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y). -A `unicode_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context. +A `non_ascii_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context. ## Confusable detection @@ -83,11 +85,13 @@ Rust compilers should detect confusingly similar Unicode identifiers and warn th Note: This is *not* a mandatory for all Rust compilers as it requires considerable implementation effort and is not related to the core function of the compiler. It rather is a tool to detect accidental misspellings and intentional homograph attacks. -A new `confusable_unicode_idents` lint is added to the compiler. The default setting is `warn`. +A new `confusable_non_ascii_idents` lint is added to the compiler. The default setting is `warn`. Note: The confusable detection is set to `warn` instead of `deny` to enable forward compatibility. The list of confusable characters will be extended in the future and programs that were once valid would fail to compile. -The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X in the current scope execute the function `skeleton(X)`. If there exist two distinct identifiers X and Yin the same crate where `skeleton(X) = skeleton(Y)` report it. +The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X execute the function `skeleton(X)`. If there exist two distinct identifiers X and Y in the same crate where `skeleton(X) = skeleton(Y)` report it. + +Note: A fast way to implement this is to compute `skeleton` for each identifier once and place the result in a hashmap as a key. If one tries to insert a key that already exists check if the two identifiers differ from each other. If so report the two confusable identifiers. # Drawbacks [drawbacks]: #drawbacks @@ -121,7 +125,7 @@ It has been suggested that Unicode identifiers should be opt-in instead of opt-o The current design was chosen because the algorithm and list of similar characters are already provided by the Unicode Consortium. A different algorithm and list of characters could be created. I am not aware of any other programming language implementing confusable detection. The confusable detection was primarily included because homoglyph attacks are a huge concern for some member of the community. -Instead of offering confusable detection the lint `forbid(unicode_idents)` is sufficient to protect project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos. +Instead of offering confusable detection the lint `forbid(non_ascii_idents)` is sufficient to protect project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos. # Prior art [prior-art]: #prior-art @@ -143,13 +147,13 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ * Are Unicode characters allowed in `no_mangle` and `extern fn`s? * How do Unicode names interact with the file system? * Are crates with Unicode names allowed and can they be published to crates.io? -* Are `unicode_idents` and `confusable_unicode_idents` good names? +* Are `non_ascii_idents` and `confusable_non_ascii_idents` good names? * Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]? * Should *rustc* accept files in a different encoding than *UTF-8*? [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ +[UAX31]: http://www.unicode.org/reports/tr31/ [TR15]: https://www.unicode.org/reports/tr15/ -[TR31]: http://www.unicode.org/reports/tr31/ [TR31Alternative]: http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax [TR31Layout]: https://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters [TR39Confusable]: https://www.unicode.org/reports/tr39/#Confusable_Detection From 3e19d26e6e5998a795326a0a24e8338fe779f766 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Fri, 8 Jun 2018 20:27:13 +0200 Subject: [PATCH 06/26] Update Reference-level explanation --- text/0000-non-ascii-idents.md | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 059e9e62b8c..a530192c432 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -38,7 +38,7 @@ Examples of invalid identifiers are: * Identifiers starting with numbers or containing "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ... * Many Emojis: 🙂, 🦀, 💩, ... -Similar Unicode identifiers are normalized: `a1` and `a₁` (a<subscript 1>) refer to the same variable. This also applies to accented characters which can be represented in different ways. +[Composed characters] like those used in the word `ḱṷṓn` can be represented in different ways with Unicode. These different representations are all the same identifier in Rust. To disallow any Unicode identifiers in a project (for example to ease collaboration or for security reasons) limiting the accepted identifiers to ASCII add this lint to the `lib.rs` or `main.rs` file of your project: @@ -75,7 +75,7 @@ The lexer defines identifiers as: `XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed. -Two identifiers X, Y are considered to be equal if their [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y). +Parsers for Rust syntax normalize identifiers to [NFC][UAX15]. Every API accepting raw identifiers (such as `proc_macro::Ident::new` normalizes them to NFC and APIs returning them as strings (like `proc_macro::Ident::to_string`) return the normalized form. This means two identifiers are equal if their NFC forms are equal. A `non_ascii_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context. @@ -105,13 +105,19 @@ Note: A fast way to implement this is to compute `skeleton` for each identifier # Rationale and alternatives [alternatives]: #alternatives -As stated in [Motivation](#motivation) allowing Unicode identifiers outside the ASCII range improves Rusts accessibility for developers not working in English. Especially in teaching and when the application domain vocabulary is not in English it can be beneficial to use names from the native language. To facilitate this it is necessary to allow a wide range of Unicode character in identifiers. The proposed implementation based on the Unicode TR31 is already used by other programming languages (e.g. Python 3) and is implemented behind the `non_ascii_idents` in *rustc* but lacks the NFKC normalization proposed. +As stated in [Motivation](#motivation) allowing Unicode identifiers outside the ASCII range improves Rusts accessibility for developers not working in English. Especially in teaching and when the application domain vocabulary is not in English it can be beneficial to use names from the native language. To facilitate this it is necessary to allow a wide range of Unicode character in identifiers. The proposed implementation based on the Unicode TR31 is already used by other programming languages and is implemented behind the `non_ascii_idents` in *rustc* but lacks the NFC normalization proposed. + +NFC normalization was chosen over NFKC normalization for the following reasons: + +* [Mathematicians want to use symbols mapped to the same NFKC form](https://github.com/rust-lang/rfcs/pull/2457#issuecomment-394928432) like π and ϖ in the same context. +* [Some words are mangled by NFKC](https://github.com/rust-lang/rfcs/pull/2457#issuecomment-394922103) in surprising ways. +* Naive (search) tools can't find different variants of the same NFKC identifier. As most text is already in NFC form search tools work well. Possible variants: -1. Require all identifiers to be in NFKC or NFC form. +1. Require all identifiers to be already in NFC form. 2. Two identifiers are only equal if their codepoints are equal. -3. Perform NFC mapping instead of NFKC mapping for identifiers. +3. Perform NFKC mapping instead of NFC mapping for identifiers. 4. Only a number of common scripts could be supported. 5. A [restriction level][TR39Restriction] is specified allowing only a subset of scripts and limit script-mixing within an identifier. @@ -123,9 +129,9 @@ It has been suggested that Unicode identifiers should be opt-in instead of opt-o ## Confusable detection -The current design was chosen because the algorithm and list of similar characters are already provided by the Unicode Consortium. A different algorithm and list of characters could be created. I am not aware of any other programming language implementing confusable detection. The confusable detection was primarily included because homoglyph attacks are a huge concern for some member of the community. +The current design was chosen because the algorithm and list of similar characters are already provided by the Unicode Consortium. A different algorithm and list of characters could be created. I am not aware of any other programming language implementing confusable detection. The confusable detection was primarily included because homoglyph attacks are a huge concern for some members of the community. -Instead of offering confusable detection the lint `forbid(non_ascii_idents)` is sufficient to protect project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos. +Instead of offering confusable detection the lint `forbid(non_ascii_idents)` is sufficient to protect a project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos. # Prior art [prior-art]: #prior-art @@ -149,11 +155,10 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ * Are crates with Unicode names allowed and can they be published to crates.io? * Are `non_ascii_idents` and `confusable_non_ascii_idents` good names? * Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]? -* Should *rustc* accept files in a different encoding than *UTF-8*? [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ [UAX31]: http://www.unicode.org/reports/tr31/ -[TR15]: https://www.unicode.org/reports/tr15/ +[UAX15]: https://www.unicode.org/reports/tr15/ [TR31Alternative]: http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax [TR31Layout]: https://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters [TR39Confusable]: https://www.unicode.org/reports/tr39/#Confusable_Detection @@ -163,3 +168,4 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ [Java]: https://docs.oracle.com/javase/specs/jls/se10/html/jls-3.html#jls-3.8 [JavaScript]: http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords [Go]: https://golang.org/ref/spec#Identifiers +[Composed characters]: https://en.wikipedia.org/wiki/Precomposed_character From a4830a13a98e3832e96a11d5a58556af836defc7 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Sat, 9 Jun 2018 17:58:29 +0200 Subject: [PATCH 07/26] Consider identifiers for confusable detection Rewrite the Motivation section. --- text/0000-non-ascii-idents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index a530192c432..fac6a5a6af4 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -11,7 +11,7 @@ Allow non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, et # Motivation [motivation]: #motivation -Rust is written by many people who are not fluent in the English language. Using identifiers in ones native language eases writing and reading code for these developers. +Writing code using domain-specific terminology simplifies implementation and discussion as opposed to translating words from the project requirements. When the code is only intended for a limited audience such as with in-house projects or in teaching it can be beneficial to write code in the group's language as it boosts communication and helps people not fluent in English to participate and write Rust code themselves. The rationale from [PEP 3131] nicely explains it: @@ -89,7 +89,7 @@ A new `confusable_non_ascii_idents` lint is added to the compiler. The default s Note: The confusable detection is set to `warn` instead of `deny` to enable forward compatibility. The list of confusable characters will be extended in the future and programs that were once valid would fail to compile. -The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X execute the function `skeleton(X)`. If there exist two distinct identifiers X and Y in the same crate where `skeleton(X) = skeleton(Y)` report it. +The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X execute the function `skeleton(X)`. If there exist two distinct identifiers X and Y in the same crate where `skeleton(X) = skeleton(Y)` report it. The compiler uses the same mechanism to check if an identifier is too similar to a keyword. Note: A fast way to implement this is to compute `skeleton` for each identifier once and place the result in a hashmap as a key. If one tries to insert a key that already exists check if the two identifiers differ from each other. If so report the two confusable identifiers. From 12d0623ea47b9afe2002aa170df9f546823286aa Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Sun, 10 Jun 2018 10:36:50 +0200 Subject: [PATCH 08/26] Note difference between Python and Rust --- text/0000-non-ascii-idents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index fac6a5a6af4..db7063ff52d 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -136,7 +136,7 @@ Instead of offering confusable detection the lint `forbid(non_ascii_idents)` is # Prior art [prior-art]: #prior-art -"[Python PEP 3131][PEP 3131]: Supporting Non-ASCII Identifiers" is the Python equivalent to this proposal. The proposed identifier grammar **XID_Start XID_Continue\*** is identical to the one used in Python 3. +"[Python PEP 3131][PEP 3131]: Supporting Non-ASCII Identifiers" is the Python equivalent to this proposal. The proposed identifier grammar **XID_Start XID_Continue\*** is identical to the one used in Python 3. While Python uses KC normalization this proposes to use normalization form C. [JavaScript] supports Unicode identifiers based on the same Default Identifier Syntax but does not apply normalization. From 79bbc8e3ac2fce26fbd9acba5a1506b0f5a71ac4 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Sun, 10 Jun 2018 13:40:34 +0200 Subject: [PATCH 09/26] Remove mention of scope from guide explanation --- text/0000-non-ascii-idents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index db7063ff52d..884e414bee7 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -46,7 +46,7 @@ To disallow any Unicode identifiers in a project (for example to ease collaborat #![forbid(non_ascii_idents)] ``` -Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about easy to confuse names in the same scope. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_non_ascii_idents)]` annotation on the enclosing function or module. +Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about names that are easy to confuse with keywords, names from the same crate and imported items. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_non_ascii_idents)]` annotation on the enclosing function or module. ## Usage notes From 41f07232f9bcab41f394b1f2a8bd8fb6fd270369 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Sun, 10 Jun 2018 21:03:07 +0200 Subject: [PATCH 10/26] Rename confusable_non_ascii_idents to confusable_idents --- text/0000-non-ascii-idents.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 884e414bee7..5acfd8a2797 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -46,7 +46,7 @@ To disallow any Unicode identifiers in a project (for example to ease collaborat #![forbid(non_ascii_idents)] ``` -Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about names that are easy to confuse with keywords, names from the same crate and imported items. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_non_ascii_idents)]` annotation on the enclosing function or module. +Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about names that are easy to confuse with keywords, names from the same crate and imported items. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_idents)]` annotation on the enclosing function or module. ## Usage notes @@ -85,7 +85,7 @@ Rust compilers should detect confusingly similar Unicode identifiers and warn th Note: This is *not* a mandatory for all Rust compilers as it requires considerable implementation effort and is not related to the core function of the compiler. It rather is a tool to detect accidental misspellings and intentional homograph attacks. -A new `confusable_non_ascii_idents` lint is added to the compiler. The default setting is `warn`. +A new `confusable_idents` lint is added to the compiler. The default setting is `warn`. Note: The confusable detection is set to `warn` instead of `deny` to enable forward compatibility. The list of confusable characters will be extended in the future and programs that were once valid would fail to compile. @@ -153,7 +153,7 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ * Are Unicode characters allowed in `no_mangle` and `extern fn`s? * How do Unicode names interact with the file system? * Are crates with Unicode names allowed and can they be published to crates.io? -* Are `non_ascii_idents` and `confusable_non_ascii_idents` good names? +* Are `non_ascii_idents` and `confusable_idents` good names? * Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]? [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ From 3c96d812b47fe1f9cdedefdaf00750b6e0cdb9bb Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Tue, 12 Jun 2018 22:11:40 +0200 Subject: [PATCH 11/26] Conformance statement --- text/0000-non-ascii-idents.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 5acfd8a2797..2184ec6a1b1 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -91,12 +91,23 @@ Note: The confusable detection is set to `warn` instead of `deny` to enable forw The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X execute the function `skeleton(X)`. If there exist two distinct identifiers X and Y in the same crate where `skeleton(X) = skeleton(Y)` report it. The compiler uses the same mechanism to check if an identifier is too similar to a keyword. -Note: A fast way to implement this is to compute `skeleton` for each identifier once and place the result in a hashmap as a key. If one tries to insert a key that already exists check if the two identifiers differ from each other. If so report the two confusable identifiers. +Note: A fast way to implement this is to compute `skeleton` for each identifier once and place the result in a hashmap as a key. If one tries to insert a key that already exists check if the two identifiers differ from each other. If so report the two confusable identifiers. + +## Conformance Statement + +* UAX31-C1: The Rust language is conforms to the Unicode® Standard Annex #31 for Unicode Version 10.0.0. +* UAX31-C2: It observes the following requirements: + * UAX31-R1. Default Identifiers: To determine whether a string is an identifier it uses UAX31-D1 with the following profile: + * Start := XID_Start, plus `_` + * Continue := XID_Continue + * Medial := empty + * UAX31-R1b. Stable Identifiers: Once a string qualifies as an identifier, it does so in all future versions. + * UAX31-R4. Equivalent Normalized Identifiers: All identifiers are normalized according to normalization form C before comparison. # Drawbacks [drawbacks]: #drawbacks -* "ASCII is enough for anyone." As source code should be written in English and in English only (source: various people) no charactes outside the ASCII range are needed to express identifiers. Therefore support for Unicode identifiers introduces unnecceray complexity to the compiler. +* "ASCII is enough for anyone." As source code should be written in English and in English only (source: various people) no characters outside the ASCII range are needed to express identifiers. Therefore support for Unicode identifiers introduces unnecessary complexity to the compiler. * "Foreign characters are hard to type." Usually computer keyboards provide access to the US-ASCII printable characters and the local language characters. Characters from other scripts are difficult to type, require entering numeric codes or are not available at all. These characters either need to be copy-pasted or entered with an alternative input method. * "Foreign characters are hard to read." If one is not familiar with the characters used it can be hard to tell them apart (e.g. φ and ψ) and one may not be able refer to the identifiers in an appropriate way (e.g. "loop" and "trident" instead of phi and psi) * "My favorite terminal/text editor/web browser" has incomplete Unicode support." Even in 2018 some characters are not widely supported in all places where source code is usually displayed. From 940dab5843d23e51308361b31751c405f4d04018 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Tue, 12 Jun 2018 23:09:26 +0200 Subject: [PATCH 12/26] Remove stray "is" --- text/0000-non-ascii-idents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 2184ec6a1b1..ba662934495 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -95,7 +95,7 @@ Note: A fast way to implement this is to compute `skeleton` for each identifier ## Conformance Statement -* UAX31-C1: The Rust language is conforms to the Unicode® Standard Annex #31 for Unicode Version 10.0.0. +* UAX31-C1: The Rust language conforms to the Unicode® Standard Annex #31 for Unicode Version 10.0.0. * UAX31-C2: It observes the following requirements: * UAX31-R1. Default Identifiers: To determine whether a string is an identifier it uses UAX31-D1 with the following profile: * Start := XID_Start, plus `_` From da43d09137ff9fa492e595c659d1d60b0be89048 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Fri, 15 Jun 2018 14:56:37 +0200 Subject: [PATCH 13/26] Add that non-ASCII idents observe UAX31-R3 Thanks to eggrobin for checking. --- text/0000-non-ascii-idents.md | 1 + 1 file changed, 1 insertion(+) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index ba662934495..3d044b966f6 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -102,6 +102,7 @@ Note: A fast way to implement this is to compute `skeleton` for each identifier * Continue := XID_Continue * Medial := empty * UAX31-R1b. Stable Identifiers: Once a string qualifies as an identifier, it does so in all future versions. + * UAX31-R3. Pattern_White_Space and Pattern_Syntax Characters: Rust only uses characters from these categories for whitespace and syntax. Other characters may or may not be allowed in identifiers. * UAX31-R4. Equivalent Normalized Identifiers: All identifiers are normalized according to normalization form C before comparison. # Drawbacks From 0e0ca66e4d895d26c00610ab1d1e3cbaaaae69f9 Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Sat, 16 Jun 2018 21:25:36 +0200 Subject: [PATCH 14/26] Add details for fs, extern, lints Postpone file system issues Forbid non-ASCII characters in extern names. Describe "bad style" lints. Remove now resolved questions. --- text/0000-non-ascii-idents.md | 40 ++++++++++++++++++++++++++++++----- 1 file changed, 35 insertions(+), 5 deletions(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 3d044b966f6..b3a81f4d8fe 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -77,7 +77,23 @@ The lexer defines identifiers as: Parsers for Rust syntax normalize identifiers to [NFC][UAX15]. Every API accepting raw identifiers (such as `proc_macro::Ident::new` normalizes them to NFC and APIs returning them as strings (like `proc_macro::Ident::to_string`) return the normalized form. This means two identifiers are equal if their NFC forms are equal. -A `non_ascii_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context. +A `non_ascii_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context. + +## Remaining ASCII-only names + +Only ASCII identifiers are allowed within an external block and in the signature of a function declared `#[no_mangle]`. +Otherwise an error is reported. + +Note: These functions interface with other programming languages +and these may allow different characters or may not apply normalization to identifiers. +As this is a niche use-case it is excluded from this RFC. +A future RFC may lift the restriction. + +This RFC keeps out-of-line modules without a `#[path]` attribute ASCII-only. +The allowed character set for names on crates.io is not changed. + +Note: This is to avoid dealing with file systems on different systems *right now*. +A future RFC may allow non-ASCII characters after the file system issues are resolved. ## Confusable detection @@ -93,6 +109,23 @@ The confusable detection algorithm is based on [Unicode® Technical Standard #39 Note: A fast way to implement this is to compute `skeleton` for each identifier once and place the result in a hashmap as a key. If one tries to insert a key that already exists check if the two identifiers differ from each other. If so report the two confusable identifiers. +## Adjustments to the "bad style" lints + +Rust [RFC 0430] establishes naming conventions for Rust ASCII identifiers. The *rustc* compiler includes lints to promote these recommendations. + +The following names refer to Unicode character categories: + +* `Ll`: Letter, Lowercase +* `Lu`: Letter, Uppercase + +These are the three different naming conventions and how their corresponding lints are specified to accommodate non-ASCII codepoints: + +* UpperCamelCase/`non_camel_case_types`: The first codepoint must not be in `Ll`. Underscores are not allowed except as a word separator between two codepoints from neither `Lu` or `Ll`. +* snake_case/`non_snake_case`: Must not contain `Lu` codepoints. +* SCREAMING_SNAKE_CASE/`non_upper_case_globals`: Must not contain `Ll` codepoints. + +Note: Scripts with upper- and lowercase variants ("bicameral scripts") behave similar to ASCII. Scripts without this distinction ("unicameral scripts") are also usable but all identifiers look the same regardless if they refer to a type, variable or constant. Underscores can be used to separate words in unicameral scripts even in UpperCamelCase contexts. + ## Conformance Statement * UAX31-C1: The Rust language conforms to the Unicode® Standard Annex #31 for Unicode Version 10.0.0. @@ -162,10 +195,6 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ [unresolved]: #unresolved-questions * Which context is adequate for confusable detection: file, current scope, crate? -* Are Unicode characters allowed in `no_mangle` and `extern fn`s? -* How do Unicode names interact with the file system? -* Are crates with Unicode names allowed and can they be published to crates.io? -* Are `non_ascii_idents` and `confusable_idents` good names? * Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]? [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ @@ -181,3 +210,4 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ [JavaScript]: http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords [Go]: https://golang.org/ref/spec#Identifiers [Composed characters]: https://en.wikipedia.org/wiki/Precomposed_character +[RFC 0430]: http://rust-lang.github.io/rfcs/0430-finalizing-naming-conventions.html From 935c91774777d8f03553159e6288f2698525a26f Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Tue, 10 Jul 2018 21:53:44 +0200 Subject: [PATCH 15/26] Add two questions about debuggers and name mangling --- text/0000-non-ascii-idents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index b3a81f4d8fe..e72b04a4163 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -196,6 +196,8 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ * Which context is adequate for confusable detection: file, current scope, crate? * Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]? +* How are non-ASCII idents best supported in debuggers? +* Which name mangling scheme is used by the compiler? [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ [UAX31]: http://www.unicode.org/reports/tr31/ From 8d548d41d314c12225014da0c4dd2e19c9cb1dad Mon Sep 17 00:00:00 2001 From: Pyfisch Date: Wed, 15 Aug 2018 13:01:29 +0200 Subject: [PATCH 16/26] Add exotic codepoint detection and mixed script lints --- text/0000-non-ascii-idents.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index e72b04a4163..02f3538f16d 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -109,6 +109,22 @@ The confusable detection algorithm is based on [Unicode® Technical Standard #39 Note: A fast way to implement this is to compute `skeleton` for each identifier once and place the result in a hashmap as a key. If one tries to insert a key that already exists check if the two identifiers differ from each other. If so report the two confusable identifiers. +## Exotic codepoint detection + +A new `less_used_codepoints` lint is added to the compiler. The default setting is to `warn`. + +The lint is triggered by identifiers that contain a codepoint that is not part of the set of "Allowed" codepoints as described by [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 3.1 General Security Profile for Identifiers][TR39Allowed]. + +Note: New Unicode versions update the set of allowed codepoints. Additionally the compiler authors may decide to allow more codepoints or warn about those that have been found to cause confusion. + +## Mixed script detection + +A new `mixed_script_idents` lint is added to the compiler. The default setting is to `warn`. + +The lint is triggered by identifiers that do not qualify for the "Moderately Restrictive" identifier profile specified in [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 5.2 Restriction-Level Detection][TR39RestrictionLevel]. + +Note: The definition of "Moderately Restrictive" can be changed by future versions of the Unicode standard to reflect changes in the natural languages used or for other reasons. + ## Adjustments to the "bad style" lints Rust [RFC 0430] establishes naming conventions for Rust ASCII identifiers. The *rustc* compiler includes lints to promote these recommendations. @@ -198,6 +214,7 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ * Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]? * How are non-ASCII idents best supported in debuggers? * Which name mangling scheme is used by the compiler? +* Is there a better name for the `less_used_codepoints` lint? [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ [UAX31]: http://www.unicode.org/reports/tr31/ @@ -213,3 +230,5 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ [Go]: https://golang.org/ref/spec#Identifiers [Composed characters]: https://en.wikipedia.org/wiki/Precomposed_character [RFC 0430]: http://rust-lang.github.io/rfcs/0430-finalizing-naming-conventions.html +[TR39Allowed]: https://www.unicode.org/reports/tr39/#General_Security_Profile +[TR39RestrictionLevel]: https://www.unicode.org/reports/tr39/#Restriction_Level_Detection From 9356fc1cac4c34aeebdfc76f0c6e2dc8401086ec Mon Sep 17 00:00:00 2001 From: Manish Goregaokar Date: Mon, 15 Oct 2018 13:21:25 -0700 Subject: [PATCH 17/26] + Reusability --- text/0000-non-ascii-idents.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 02f3538f16d..1d16ca3705d 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -142,6 +142,17 @@ These are the three different naming conventions and how their corresponding lin Note: Scripts with upper- and lowercase variants ("bicameral scripts") behave similar to ASCII. Scripts without this distinction ("unicameral scripts") are also usable but all identifiers look the same regardless if they refer to a type, variable or constant. Underscores can be used to separate words in unicameral scripts even in UpperCamelCase contexts. +## Reusability + +The code used for implementing the various lints and checks will be released to crates.io. This includes: + + - Testing validity of an identifier + - Testing for `less_used_codepoints` ([UTS #39 Section 3.1][TR39Allowed]) + - Script identification and comparison for `mixed_script_detection` ([UTS #39 Section 5.2][TR39RestrictionLevel]) + - `skeleton(X)` algorithm for confusable detection ([UTS #39 Section 4][TR39Confusable]) + + + ## Conformance Statement * UAX31-C1: The Rust language conforms to the Unicode® Standard Annex #31 for Unicode Version 10.0.0. From 40d53f5158d1e9056ed76711862834e92963155c Mon Sep 17 00:00:00 2001 From: Manish Goregaokar Date: Mon, 15 Oct 2018 14:10:33 -0700 Subject: [PATCH 18/26] Global mixed script confusables lint --- text/0000-non-ascii-idents.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 1d16ca3705d..c4b7aaa975c 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -117,6 +117,8 @@ The lint is triggered by identifiers that contain a codepoint that is not part o Note: New Unicode versions update the set of allowed codepoints. Additionally the compiler authors may decide to allow more codepoints or warn about those that have been found to cause confusion. +For reference, a list of all the code points allowed by this lint can be found [here][unicode-set-allowed], with the script group mentioned on the right. + ## Mixed script detection A new `mixed_script_idents` lint is added to the compiler. The default setting is to `warn`. @@ -125,6 +127,23 @@ The lint is triggered by identifiers that do not qualify for the "Moderately Res Note: The definition of "Moderately Restrictive" can be changed by future versions of the Unicode standard to reflect changes in the natural languages used or for other reasons. +## Global mixed script detection with confusables + +As an additional measure, we try to detect cases where a codebase primarily using a certain script has identifiers from a different script confusable with that script. + +During `mixed_script_idents` computation, keep track of how often identifiers from various script groups crop up. If an identifier is from a less-common script group (say, <1% of identifiers), _and_ it is entirely confusable with the majority script in use (e.g. the string `"арр"` or `"роре"` in Cyrillic) + +This can trigger `confusable_idents`, `mixed_script_idents`, or a new lint. + +We identify sets of characters which are entirely confusable: For example, for Cyrillic-Latin, we have `а, е, о, р, с, у, х, ѕ, і, ј, ԛ, ԝ, ѐ, ё, ї, ӱ, ӧ, ӓ, ӕ, ӑ` amongst the lowercase letters (and more amongst the capitals). This list likely can be programmatically derived from the confusables data that Unicode already has. It may be worth filtering for exact confusables. For example, Cyrillic, Greek, and Latin have a lot of confusables that are almost indistinguishable in most fonts, whereas `ھ` and `ס` are noticeably different-looking from `o` even though they're marked as a confusables. + +The main confusable script pairs we have to worry about are Cyrillic/Latin/Greek, Armenian/Ethiopic, and a couple Armenian characters mapping to Greek/Latin. We can implement this lint conservatively at first by dealing with a blacklist of known confusables for these script pairs, and expand it if there is a need. + +There are many confusables _within_ scripts -- Arabic has a bunch of these as does Han (both with other Han characters and and with kana), but since these are within the same language group this is outside the scope of this RFC. Such confusables are equivalent to `l` vs `I` being confusable in some fonts. + +For reference, a list of all possible Rust identifier characters that do not trip `less_used_codepoints` but have confusables can be found [here][unicode-set-confusables], with their confusable skeleton and script group mentioned on the right. Note that in many cases the confusables are visually distinguishable, or are diacritic marks. + + ## Adjustments to the "bad style" lints Rust [RFC 0430] establishes naming conventions for Rust ASCII identifiers. The *rustc* compiler includes lints to promote these recommendations. @@ -151,7 +170,7 @@ The code used for implementing the various lints and checks will be released to - Script identification and comparison for `mixed_script_detection` ([UTS #39 Section 5.2][TR39RestrictionLevel]) - `skeleton(X)` algorithm for confusable detection ([UTS #39 Section 4][TR39Confusable]) - +Confusables detection works well when there are other identifiers to compare against, but in some cases there's only one instance of an identifier in the code. For example we have crates that use proc macros to expose command line options or REST endpoints. Crates that do things like these can use such algorithms to ensure better error handling; for example if we accidentally end up having an `/арр` endpoint (in Cyrillic) because of a `#[annotation] fn арр()`, visiting `/app` (in Latin) may show a comprehensive error (or pass-through, based on requirements) ## Conformance Statement @@ -165,6 +184,8 @@ The code used for implementing the various lints and checks will be released to * UAX31-R3. Pattern_White_Space and Pattern_Syntax Characters: Rust only uses characters from these categories for whitespace and syntax. Other characters may or may not be allowed in identifiers. * UAX31-R4. Equivalent Normalized Identifiers: All identifiers are normalized according to normalization form C before comparison. + + # Drawbacks [drawbacks]: #drawbacks @@ -226,6 +247,7 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ * How are non-ASCII idents best supported in debuggers? * Which name mangling scheme is used by the compiler? * Is there a better name for the `less_used_codepoints` lint? +* Which lint should the global mixed scripts confusables detection trigger? [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ [UAX31]: http://www.unicode.org/reports/tr31/ @@ -243,3 +265,5 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ [RFC 0430]: http://rust-lang.github.io/rfcs/0430-finalizing-naming-conventions.html [TR39Allowed]: https://www.unicode.org/reports/tr39/#General_Security_Profile [TR39RestrictionLevel]: https://www.unicode.org/reports/tr39/#Restriction_Level_Detection +[unicode-set-confusables]: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5B%3AIdentifier_Status%CE%B2%3DAllowed%3A%5D%26%5B%3AXID_Continue%3DYes%3A%5D%26%5B%3AConfMA%CE%B2%3A%5D%5D&g=&i=ConfMA%CE%B2%2CScript_Extensions +[unicode-set-allowed]: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5B%3AIdentifier_Status%CE%B2%3DAllowed%3A%5D%26%5B%3AXID_Continue%3DYes%3A%5D%5D&g=&i=Script_Extensions \ No newline at end of file From 77328101d81cfd2c12a4dee13e00f41dc4460637 Mon Sep 17 00:00:00 2001 From: Manish Goregaokar Date: Mon, 15 Oct 2018 14:25:11 -0700 Subject: [PATCH 19/26] notable code points for less_used_codepoints --- text/0000-non-ascii-idents.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index c4b7aaa975c..e043a7f738f 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -119,6 +119,12 @@ Note: New Unicode versions update the set of allowed codepoints. Additionally th For reference, a list of all the code points allowed by this lint can be found [here][unicode-set-allowed], with the script group mentioned on the right. +There are some specific interesting code points that we feel necessary to call out here: + + - `less_used_codepoints` will warn on U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER, despite these being useful in the Perso-Arabic and some Indic scripts. In Indic scripts these characters force different visual forms, which is not very necessary for programming. These have further semantic meaning in Arabic where they can be used to mark prefixes or mixed-script words, which will not crop up so often in programming (we're not able to use `-` in identifiers for marking pre/suffixes in Latin-script identifiers and it's fine). Persian seems to make the most use of these, with some compound words requiring use of these. For now this RFC does not attempt to deal with this and follows the recommendation of the specification, if there is a need for it in the future we can add this for Persian users. + - `less_used_codepoints` will not warn about U+02BB MODIFIER LETTER TURNED COMMA or U+02BC MODIFIER LETTER APOSTROPHE. These look somewhat like punctuation relevant to Rust's syntax, so they're a bit tricky. However, these code points are important in Ukranian, Hawaiian, and a bunch of other languages (U+02BB is considered a full-fledged letter in Hawaiian). For now this RFC follows the recommendation of the specification and allows these, however we can change this in the future. The hope is that syntax highlighting is enough to deal with confusions caused by such characters. + + ## Mixed script detection A new `mixed_script_idents` lint is added to the compiler. The default setting is to `warn`. From e3f3692c28d0c83a4246e7344afb3da7a7f2f389 Mon Sep 17 00:00:00 2001 From: Manish Goregaokar Date: Tue, 16 Oct 2018 09:49:21 -0700 Subject: [PATCH 20/26] Mention user-supplied strings --- text/0000-non-ascii-idents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index e043a7f738f..baae0defbb4 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -176,7 +176,7 @@ The code used for implementing the various lints and checks will be released to - Script identification and comparison for `mixed_script_detection` ([UTS #39 Section 5.2][TR39RestrictionLevel]) - `skeleton(X)` algorithm for confusable detection ([UTS #39 Section 4][TR39Confusable]) -Confusables detection works well when there are other identifiers to compare against, but in some cases there's only one instance of an identifier in the code. For example we have crates that use proc macros to expose command line options or REST endpoints. Crates that do things like these can use such algorithms to ensure better error handling; for example if we accidentally end up having an `/арр` endpoint (in Cyrillic) because of a `#[annotation] fn арр()`, visiting `/app` (in Latin) may show a comprehensive error (or pass-through, based on requirements) +Confusables detection works well when there are other identifiers to compare against, but in some cases there's only one instance of an identifier in the code, and it's compared with user-supplied strings. For example we have crates that use proc macros to expose command line options or REST endpoints. Crates that do things like these can use such algorithms to ensure better error handling; for example if we accidentally end up having an `/арр` endpoint (in Cyrillic) because of a `#[annotation] fn арр()`, visiting `/app` (in Latin) may show a comprehensive error (or pass-through, based on requirements) ## Conformance Statement From d389a9cbc4df14d6e7c9d32150962b99dd1212bc Mon Sep 17 00:00:00 2001 From: "Felix S. Klock II" Date: Fri, 19 Oct 2018 11:38:37 +0200 Subject: [PATCH 21/26] Add unresolved Q regarding const pat confusion (rust-lang/rust#7526). --- text/0000-non-ascii-idents.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index baae0defbb4..a33edafaf56 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -254,6 +254,10 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ * Which name mangling scheme is used by the compiler? * Is there a better name for the `less_used_codepoints` lint? * Which lint should the global mixed scripts confusables detection trigger? +* How badly do non-ASCII idents exacerbate const pattern confusion + (rust-lang/rust#7526, rust-lang/rust#49680)? + Can we improve precision of linting here? + [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ [UAX31]: http://www.unicode.org/reports/tr31/ From 70297a9f7762536c5392583fc61a48fb7c05f7ad Mon Sep 17 00:00:00 2001 From: Manish Goregaokar Date: Fri, 19 Oct 2018 12:32:17 -0700 Subject: [PATCH 22/26] Remove old mixed scripts lints --- text/0000-non-ascii-idents.md | 55 +++++++++++++++++++---------------- 1 file changed, 30 insertions(+), 25 deletions(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index a33edafaf56..381be3d4a0b 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -125,31 +125,6 @@ There are some specific interesting code points that we feel necessary to call o - `less_used_codepoints` will not warn about U+02BB MODIFIER LETTER TURNED COMMA or U+02BC MODIFIER LETTER APOSTROPHE. These look somewhat like punctuation relevant to Rust's syntax, so they're a bit tricky. However, these code points are important in Ukranian, Hawaiian, and a bunch of other languages (U+02BB is considered a full-fledged letter in Hawaiian). For now this RFC follows the recommendation of the specification and allows these, however we can change this in the future. The hope is that syntax highlighting is enough to deal with confusions caused by such characters. -## Mixed script detection - -A new `mixed_script_idents` lint is added to the compiler. The default setting is to `warn`. - -The lint is triggered by identifiers that do not qualify for the "Moderately Restrictive" identifier profile specified in [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 5.2 Restriction-Level Detection][TR39RestrictionLevel]. - -Note: The definition of "Moderately Restrictive" can be changed by future versions of the Unicode standard to reflect changes in the natural languages used or for other reasons. - -## Global mixed script detection with confusables - -As an additional measure, we try to detect cases where a codebase primarily using a certain script has identifiers from a different script confusable with that script. - -During `mixed_script_idents` computation, keep track of how often identifiers from various script groups crop up. If an identifier is from a less-common script group (say, <1% of identifiers), _and_ it is entirely confusable with the majority script in use (e.g. the string `"арр"` or `"роре"` in Cyrillic) - -This can trigger `confusable_idents`, `mixed_script_idents`, or a new lint. - -We identify sets of characters which are entirely confusable: For example, for Cyrillic-Latin, we have `а, е, о, р, с, у, х, ѕ, і, ј, ԛ, ԝ, ѐ, ё, ї, ӱ, ӧ, ӓ, ӕ, ӑ` amongst the lowercase letters (and more amongst the capitals). This list likely can be programmatically derived from the confusables data that Unicode already has. It may be worth filtering for exact confusables. For example, Cyrillic, Greek, and Latin have a lot of confusables that are almost indistinguishable in most fonts, whereas `ھ` and `ס` are noticeably different-looking from `o` even though they're marked as a confusables. - -The main confusable script pairs we have to worry about are Cyrillic/Latin/Greek, Armenian/Ethiopic, and a couple Armenian characters mapping to Greek/Latin. We can implement this lint conservatively at first by dealing with a blacklist of known confusables for these script pairs, and expand it if there is a need. - -There are many confusables _within_ scripts -- Arabic has a bunch of these as does Han (both with other Han characters and and with kana), but since these are within the same language group this is outside the scope of this RFC. Such confusables are equivalent to `l` vs `I` being confusable in some fonts. - -For reference, a list of all possible Rust identifier characters that do not trip `less_used_codepoints` but have confusables can be found [here][unicode-set-confusables], with their confusable skeleton and script group mentioned on the right. Note that in many cases the confusables are visually distinguishable, or are diacritic marks. - - ## Adjustments to the "bad style" lints Rust [RFC 0430] establishes naming conventions for Rust ASCII identifiers. The *rustc* compiler includes lints to promote these recommendations. @@ -232,6 +207,36 @@ The current design was chosen because the algorithm and list of similar characte Instead of offering confusable detection the lint `forbid(non_ascii_idents)` is sufficient to protect a project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos. + +## Alternative mixed script lints + +These are previously-proposed lints attempting to prevent problems caused by mixing scripts, which were ultimately replaced by the current mixed script confusables lint. + +### Mixed script detection + +A new `mixed_script_idents` lint would be added to the compiler. The default setting is to `warn`. + +The lint is triggered by identifiers that do not qualify for the "Moderately Restrictive" identifier profile specified in [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 5.2 Restriction-Level Detection][TR39RestrictionLevel]. + +Note: The definition of "Moderately Restrictive" can be changed by future versions of the Unicode standard to reflect changes in the natural languages used or for other reasons. + +### Global mixed script detection with confusables + +As an additional measure, we would try to detect cases where a codebase primarily using a certain script has identifiers from a different script confusable with that script. + +During `mixed_script_idents` computation, keep track of how often identifiers from various script groups crop up. If an identifier is from a less-common script group (say, <1% of identifiers), _and_ it is entirely confusable with the majority script in use (e.g. the string `"арр"` or `"роре"` in Cyrillic) + +This can trigger `confusable_idents`, `mixed_script_idents`, or a new lint. + +We identify sets of characters which are entirely confusable: For example, for Cyrillic-Latin, we have `а, е, о, р, с, у, х, ѕ, і, ј, ԛ, ԝ, ѐ, ё, ї, ӱ, ӧ, ӓ, ӕ, ӑ` amongst the lowercase letters (and more amongst the capitals). This list likely can be programmatically derived from the confusables data that Unicode already has. It may be worth filtering for exact confusables. For example, Cyrillic, Greek, and Latin have a lot of confusables that are almost indistinguishable in most fonts, whereas `ھ` and `ס` are noticeably different-looking from `o` even though they're marked as a confusables. + +The main confusable script pairs we have to worry about are Cyrillic/Latin/Greek, Armenian/Ethiopic, and a couple Armenian characters mapping to Greek/Latin. We can implement this lint conservatively at first by dealing with a blacklist of known confusables for these script pairs, and expand it if there is a need. + +There are many confusables _within_ scripts -- Arabic has a bunch of these as does Han (both with other Han characters and and with kana), but since these are within the same language group this is outside the scope of this RFC. Such confusables are equivalent to `l` vs `I` being confusable in some fonts. + +For reference, a list of all possible Rust identifier characters that do not trip `less_used_codepoints` but have confusables can be found [here][unicode-set-confusables], with their confusable skeleton and script group mentioned on the right. Note that in many cases the confusables are visually distinguishable, or are diacritic marks. + + # Prior art [prior-art]: #prior-art From 9bf90dfe501d5ef6678f9f2fc33f520dbcd82106 Mon Sep 17 00:00:00 2001 From: Manish Goregaokar Date: Fri, 19 Oct 2018 12:48:28 -0700 Subject: [PATCH 23/26] Add new mixed_script_confusables lint --- text/0000-non-ascii-idents.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 381be3d4a0b..05dfeb5f10b 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -142,13 +142,28 @@ These are the three different naming conventions and how their corresponding lin Note: Scripts with upper- and lowercase variants ("bicameral scripts") behave similar to ASCII. Scripts without this distinction ("unicameral scripts") are also usable but all identifiers look the same regardless if they refer to a type, variable or constant. Underscores can be used to separate words in unicameral scripts even in UpperCamelCase contexts. +## Mixed script confusables lint + +We keep track of the script groups in use in a document using the comparison heuristics in [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 5.2 Restriction-Level Detection][TR39RestrictionLevel]. + +We identify lists of code points which are `Allowed` by [UTS 39 section 3.1][TR39Allowed] (i.e., code points not already linted by `less_used_codepoints`) and are "exact" confusables between code points from other `Allowed` scripts. This is stuff like Cyrillic `о` (confusable with Latin `o`), but does not include things like Hebrew `ס` which is somewhat distinguishable from Latin `o`. This list of exact confusables can be modified in the future. + +We expect most of these to be between Cyrillic-Latin-Greek and some in Ethiopic-Armenian, but a proper review can be done before stabilization. There are also confusable modifiers between many script. + +In a code base, if the _only_ code points from a given script group (aside from `Latin`, `Common`, and `Inherited`) are such exact confusables, lint about it with `mixed_script_confusables` (lint name can be finalized later). + +As an implementation note, it may be worth dealing with confusable modifiers via a separate lint check -- if a modifier is from a different (non-`Common`/`Inherited`) script group from the thing preceding it. This has some behaviorial differences but should not increase the chance of false positives. + +The exception for `Latin` is made because the standard library is Latin-script. It could potentially be removed since a code base using the standard library (or any Latin-using library) is likely to be using enough of it that there will be non-confusable characters in use. (This is in unresolved questions) + + ## Reusability The code used for implementing the various lints and checks will be released to crates.io. This includes: - Testing validity of an identifier - Testing for `less_used_codepoints` ([UTS #39 Section 3.1][TR39Allowed]) - - Script identification and comparison for `mixed_script_detection` ([UTS #39 Section 5.2][TR39RestrictionLevel]) + - Script identification and comparison for `mixed_script_confusables` ([UTS #39 Section 5.2][TR39RestrictionLevel]) - `skeleton(X)` algorithm for confusable detection ([UTS #39 Section 4][TR39Confusable]) Confusables detection works well when there are other identifiers to compare against, but in some cases there's only one instance of an identifier in the code, and it's compared with user-supplied strings. For example we have crates that use proc macros to expose command line options or REST endpoints. Crates that do things like these can use such algorithms to ensure better error handling; for example if we accidentally end up having an `/арр` endpoint (in Cyrillic) because of a `#[annotation] fn арр()`, visiting `/app` (in Latin) may show a comprehensive error (or pass-through, based on requirements) @@ -262,6 +277,7 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ * How badly do non-ASCII idents exacerbate const pattern confusion (rust-lang/rust#7526, rust-lang/rust#49680)? Can we improve precision of linting here? +* In `mixed_script_confusables`, do we actually need to make an exception for `Latin` identifiers? [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ From a6da03a37780c7b1ee6aa79080bee9ddafc2f7da Mon Sep 17 00:00:00 2001 From: Manish Goregaokar Date: Fri, 19 Oct 2018 22:08:10 -0700 Subject: [PATCH 24/26] Add unresolved questions for RTL and terminal width --- text/0000-non-ascii-idents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 05dfeb5f10b..0fb5efc10ee 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -278,6 +278,8 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ (rust-lang/rust#7526, rust-lang/rust#49680)? Can we improve precision of linting here? * In `mixed_script_confusables`, do we actually need to make an exception for `Latin` identifiers? +* Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width? +* right-to-left scripts can lead to weird rendering in mixed contexts (depending on the software used), especially when mixed with operators. This is not something that should block stabilization, however we feel it is important to explicitly call out. Future RFCs (preferably put forth by RTL-using communities) may attempt to improve this situation (e.g. by allowing bidi control characters in specific contexts). [PEP 3131]: https://www.python.org/dev/peps/pep-3131/ From c4dff649985d802012cab7a661921601727859f9 Mon Sep 17 00:00:00 2001 From: Manish Goregaokar Date: Sat, 20 Oct 2018 15:04:05 -0700 Subject: [PATCH 25/26] Allow bare underscore identifiers --- text/0000-non-ascii-idents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0000-non-ascii-idents.md b/text/0000-non-ascii-idents.md index 0fb5efc10ee..84d4f1ec326 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/0000-non-ascii-idents.md @@ -68,12 +68,12 @@ The lexer defines identifiers as: > **Lexer:** > IDENTIFIER_OR_KEYWORD: >    XID_Start XID_Continue\* ->    | `_` XID_Continue+ +>    | `_` XID_Continue* > > IDENTIFIER : > IDENTIFIER_OR_KEYWORD *Except a [strict] or [reserved] keyword* -`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed. +`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed. We effectively are using UAX 31's default definition of valid identifier, with a tailoring that underscores are included with `XID_Start`. (Note that this allows bare underscores to be identifiers, that is currently also the case with `_` in identifier contexts being a reserved keyword) Parsers for Rust syntax normalize identifiers to [NFC][UAX15]. Every API accepting raw identifiers (such as `proc_macro::Ident::new` normalizes them to NFC and APIs returning them as strings (like `proc_macro::Ident::to_string`) return the normalized form. This means two identifiers are equal if their NFC forms are equal. From 0c7863140147a5b2e14d4c6f325d14896849a3f6 Mon Sep 17 00:00:00 2001 From: Mazdak Farrokhzad Date: Mon, 29 Oct 2018 11:29:15 +0100 Subject: [PATCH 26/26] RFC 2457 --- ...{0000-non-ascii-idents.md => 2457-non-ascii-idents.md} | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) rename text/{0000-non-ascii-idents.md => 2457-non-ascii-idents.md} (99%) diff --git a/text/0000-non-ascii-idents.md b/text/2457-non-ascii-idents.md similarity index 99% rename from text/0000-non-ascii-idents.md rename to text/2457-non-ascii-idents.md index 84d4f1ec326..0d7d0a42117 100644 --- a/text/0000-non-ascii-idents.md +++ b/text/2457-non-ascii-idents.md @@ -1,7 +1,7 @@ -- Feature Name: non_ascii_idents +- Feature Name: `non_ascii_idents` - Start Date: 2018-06-03 -- RFC PR: (leave this empty) -- Rust Issue: (leave this empty) +- RFC PR: [rust-lang/rfcs#2457](https://github.com/rust-lang/rfcs/pull/2457) +- Rust Issue: [rust-lang/rust#55467](https://github.com/rust-lang/rust/issues/55467) # Summary [summary]: #summary @@ -299,4 +299,4 @@ The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\ [TR39Allowed]: https://www.unicode.org/reports/tr39/#General_Security_Profile [TR39RestrictionLevel]: https://www.unicode.org/reports/tr39/#Restriction_Level_Detection [unicode-set-confusables]: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5B%3AIdentifier_Status%CE%B2%3DAllowed%3A%5D%26%5B%3AXID_Continue%3DYes%3A%5D%26%5B%3AConfMA%CE%B2%3A%5D%5D&g=&i=ConfMA%CE%B2%2CScript_Extensions -[unicode-set-allowed]: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5B%3AIdentifier_Status%CE%B2%3DAllowed%3A%5D%26%5B%3AXID_Continue%3DYes%3A%5D%5D&g=&i=Script_Extensions \ No newline at end of file +[unicode-set-allowed]: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5B%3AIdentifier_Status%CE%B2%3DAllowed%3A%5D%26%5B%3AXID_Continue%3DYes%3A%5D%5D&g=&i=Script_Extensions