From b35b0d8167b603506b25269c55cc8ede29a41ec2 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Thu, 13 Aug 2020 18:44:32 -0700 Subject: [PATCH 01/10] Unicode source files. --- proposals/README.md | 2 + proposals/p0142.md | 182 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 184 insertions(+) create mode 100644 proposals/p0142.md diff --git a/proposals/README.md b/proposals/README.md index 61e3bcf2bc214..1316d0a0b2241 100644 --- a/proposals/README.md +++ b/proposals/README.md @@ -38,6 +38,8 @@ request: - [0107 - Code and name organization](p0107.md) - [0120 - Add idiomatic code performance and developer-facing docs to goals](p0120.md) - [Decision](p0120_decision.md) +- [0142 - Unicode source files](p0142.md) + - [Decision](p0142_decision.md) - [0143 - Numeric literals](p0143.md) - [Decision](p0143_decision.md) - [0149 - Change documentation style guide](p0149.md) diff --git a/proposals/p0142.md b/proposals/p0142.md new file mode 100644 index 0000000000000..6d1e82aa4d1e1 --- /dev/null +++ b/proposals/p0142.md @@ -0,0 +1,182 @@ +# Unicode source files + + + +[Pull request](https://github.com/carbon-language/carbon-lang/pull/142) + +## Table of contents + + + +- [Problem](#problem) +- [Background](#background) +- [Proposal](#proposal) +- [Details](#details) + - [Character encoding](#character-encoding) + - [Source files](#source-files) + - [Normalization](#normalization) + - [Characters in identifiers and whitespace](#characters-in-identifiers-and-whitespace) +- [Alternatives considered](#alternatives-considered) + - [Character encoding](#character-encoding-1) + - [Byte order marks](#byte-order-marks) + - [Normalization forms](#normalization-forms) + + + +## Problem + +Portable use and maintenance of Carbon source files requires a common +understanding of how they are encoded on disk. Further, the decision as to what +characters are valid in names and what constitutes whitespace are a complex area +in which we do not expect to have local expertise. + +## Background + +[Unicode](https://www.unicode.org/versions/latest/) is a universal character +encoding, maintained by the +[Unicode Consortium](https://home.unicode.org/basic-info/overview/). It is the +canonical encoding used for textual information interchange across all modern +technology. + +The [Unicode Standard Annex 31](https://www.unicode.org/reports/tr31/), "Unicode +Identifier and Pattern Syntax", provides recommendations for the use of Unicode +in the definitions of general-purpose identifiers. + +## Proposal + +Carbon programs are represented as a sequence of Unicode code points. Carbon +source files are encoded in UTF-8. + +Carbon will follow lexical conventions for identifiers and whitespace based on +Unicode Annex 31. + +## Details + +### Character encoding + +Before being divided into tokens, a program starts as a sequence of characters. +Those characters are a sequence of Unicode code units -- integer values between +0 and 10FFFF16 -- whose meaning as characters or non-characters is +defined by the Unicode standard. + +Carbon is based on Unicode 13.0, which is currently the latest version of the +Unicode standard. Newer versions should be considered for adoption as they are +released. + +### Source files + +Program text can come from a variety of sources, such as an interactive +programming environment (a so-called "Read-Evaluate-Print-Loop" or REPL), a +database, or a command-line argument. However, the typical representation for +Carbon programs is in source files stored on disk, and such files are expected +to be encoded in UTF-8. + +Carbon source files may begin with an optional UTF-8 BOM, that is, the byte +sequence EF16,BB16,BF16. This prefix, if +present, is ignored. + +### Normalization + +Carbon source files, outside comments and string literals, are required to be in +Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will +convert source files to NFC as necessary to satisfy this constraint. + +### Characters in identifiers and whitespace + +We will largely follow Unicode Annex 31 in our selection of identifier and +whitespace characters. This Annex does not provide specific rules on lexical +syntax, instead providing a framework that permits a selection of choices of +concrete rules. + +This proposal does not specify concrete choices, nor that we will not deviate +from Annex 31 in any concrete area. We may find cases where we wish to take a +different direction than that of the Annex. However, we should use Annex 31 as a +basis for our decisions, and should expect strong justification for deviations +from it. + +## Alternatives considered + +### Character encoding + +We could restrict programs to ASCII. + +Pro: + +- Reduced implementation complexity. +- Avoids all problems relating to normalization, homoglyphs, text + directionality, and so on. +- We have no intention of using non-ASCII symbols outside Carbon programs. +- Provides assurance that all names in libraries can reliably be typed by all + developers -- we already require that keywords, and thus all ASCII letters, + can be typed. + +Con: + +- An overarching goal of the Carbon project is to provide a language that is + inclusive and welcoming. A language that does not permit names in programs + to be expressed in the developer's native language will not meet that goal + for at least some of our developers. + +### Byte order marks + +We could disallow byte order marks. + +Pro: + +- Marginal implementation simplicity. + +Con: + +- Several major editors, particularly on the Windows platform, insert UTF-8 + BOMs and use them to identify file encoding. + +### Normalization forms + +We could require a different normalization form. + +Pro: + +- Some environments might more naturally produce a different normalization + form. +- Normalization Form D is more uniform, in that characters are always + maximally decomposed into combining characters; in NFC, characters may or + may not be decomposed depending on whether a composed form is available. + +Con: + +- The C++ standard and community is moving towards using NFC: + - WG21 is in the process of adopting a NFC requirement for C++ + identifiers. + - GCC warns on C++ identifiers that aren't in NFC. + +We could require no normalization form, and normalize identifiers ourselves. + +Pro: + +- We could treat source text identically regardless of the normalization form. + +Con: + +- There is substantially more implementation cost involved in normalizing + identifiers than in detecting whether they are in normal form. + - This proposal would require the implementation complexity of converting + into NFC in the formatting tool, but would not require the conversion + cost to be paid during compilation. + +We could require no normalization form and compare identifiers by code point +sequence. + +Pro: + +- This is the rule currently in use in C++. + +Con: + +- This is not the rule planned for the near future of C++. +- Different representations of the same character may result in different + identifiers, in a way that is likely to be invisible in most programming + environments. From 29c8c45ac546615b1097a9405ee9039e244bb139 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Fri, 14 Aug 2020 11:53:56 -0700 Subject: [PATCH 02/10] Address review comments from @gribozavr. --- proposals/p0142.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/proposals/p0142.md b/proposals/p0142.md index 6d1e82aa4d1e1..d58f354fc96cd 100644 --- a/proposals/p0142.md +++ b/proposals/p0142.md @@ -59,7 +59,7 @@ Unicode Annex 31. ### Character encoding Before being divided into tokens, a program starts as a sequence of characters. -Those characters are a sequence of Unicode code units -- integer values between +Those characters are a sequence of Unicode code points -- integer values between 0 and 10FFFF16 -- whose meaning as characters or non-characters is defined by the Unicode standard. @@ -109,7 +109,8 @@ Pro: - Reduced implementation complexity. - Avoids all problems relating to normalization, homoglyphs, text directionality, and so on. -- We have no intention of using non-ASCII symbols outside Carbon programs. +- We have no intention of using non-ASCII characters in the language syntax or + in any library name. - Provides assurance that all names in libraries can reliably be typed by all developers -- we already require that keywords, and thus all ASCII letters, can be typed. @@ -162,17 +163,17 @@ Pro: Con: - There is substantially more implementation cost involved in normalizing - identifiers than in detecting whether they are in normal form. - - This proposal would require the implementation complexity of converting - into NFC in the formatting tool, but would not require the conversion - cost to be paid during compilation. + identifiers than in detecting whether they are in normal form. While this + proposal would require the implementation complexity of converting into NFC + in the formatting tool, it would not require the conversion cost to be paid + during compilation. We could require no normalization form and compare identifiers by code point sequence. Pro: -- This is the rule currently in use in C++. +- This is the rule in use in C++20 and before. Con: From 64c390465f208e6a1e5f26b8fb4e92fdd4b41be0 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Fri, 21 Aug 2020 16:33:52 -0700 Subject: [PATCH 03/10] Address more code review comments. --- proposals/p0142.md | 88 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 78 insertions(+), 10 deletions(-) diff --git a/proposals/p0142.md b/proposals/p0142.md index d58f354fc96cd..69bd2d194a5ad 100644 --- a/proposals/p0142.md +++ b/proposals/p0142.md @@ -71,20 +71,60 @@ released. Program text can come from a variety of sources, such as an interactive programming environment (a so-called "Read-Evaluate-Print-Loop" or REPL), a -database, or a command-line argument. However, the typical representation for -Carbon programs is in source files stored on disk, and such files are expected -to be encoded in UTF-8. +database, a memory buffer of an IDE, or a command-line argument. -Carbon source files may begin with an optional UTF-8 BOM, that is, the byte -sequence EF16,BB16,BF16. This prefix, if +The canonical representation for Carbon programs is in files stored as a +sequence of bytes in a file system on disk, and such files are expected to be +encoded in UTF-8. Such files may begin with an optional UTF-8 BOM, that is, the +byte sequence EF16,BB16,BF16. This prefix, if present, is ignored. +Regardless of how program text is concretely stored, the first step in +processing any such text is to convert it to a sequence of Unicode code points +-- although such conversion may be purely notional. The result of this +conversion is a Carbon _source file_. Depending on the needs of the language, we +may require each such source file to have an associated file name, even if the +source file does not originate in anything resembling a file system. + ### Normalization +Background: + +- [wikipedia article on Unicode normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) +- [Unicode Standard Annex #15: Unicode Normalization Forms](https://www.unicode.org/reports/tr15/tr15-50.html) + Carbon source files, outside comments and string literals, are required to be in Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will convert source files to NFC as necessary to satisfy this constraint. +The choice to require NFC is really three choices: + +1. Equivalence classes: we use a canonical normalization form rather than a + compatibility normalization form or no normalization form at all. + + - If we use no normalization, invisibly-different ways of representing the + same glyph, such as with pre-combined diacritics versus with diacritics + expressed as separate combining characters, or with combining characters + in a different order, would be considered different characters. + - If we use a canonical normalization form, all ways of encoding diacritics + are considered to form the same character, but ligatures such as `ffi` are + considered distinct from the character sequence that they decompose into. + - If we use a compatibility normalization form, ligatures are considered + equivalent to the character sequence that they decompose into. + + For a fixed-width font, a canonical normalization form is most likely to + consider characters to be the same if they look the same. + +2. Composition: we use a composed normalization form rather than a decomposed + normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL + LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER + O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results + in smaller representations whenever the two differ, but the decomposed form + is a little easier for algorithmic processing (for example, typo correction). + +3. We require source files to be in our chosen form, rather than converting to + that form as necessary. + ### Characters in identifiers and whitespace We will largely follow Unicode Annex 31 in our selection of identifier and @@ -92,14 +132,30 @@ whitespace characters. This Annex does not provide specific rules on lexical syntax, instead providing a framework that permits a selection of choices of concrete rules. -This proposal does not specify concrete choices, nor that we will not deviate -from Annex 31 in any concrete area. We may find cases where we wish to take a -different direction than that of the Annex. However, we should use Annex 31 as a -basis for our decisions, and should expect strong justification for deviations -from it. +The framework provided by Annex 31 includes suggested sets of characters that +may appear in identifier, including uppercase and lowercase ASCII letters, along +with reasonable extensions to many non-ASCII letters, with some characters +restricted to not appear as the first character. For example, this list includes +U+30EA (KATAKANA LETTER RI), but not U+2603 (SNOWMAN), both of which are +permitted in identifiers in C++20. Similarly, it indicates which characters +should be classified as whitespace, including all the ASCII whitespace +characters plus some non-ASCII whitespace characters. It also supports +language-specific "profiles" to alter these baseline character sets for the +needs of a particular language -- for instance, to permit underscores in +identifiers, or to include non-breaking spaces as whitespace characters. + +This proposal does not specify concrete choices for lexical rules, nor that we +will not deviate from conformance to Annex 31 in any concrete area. We may find +cases where we wish to take a different direction than that of the Annex. +However, we should use Annex 31 as a basis for our decisions, and should expect +strong justification for deviations from it. ## Alternatives considered +There are a number of different design choices we could make, as divergences +from the above proposal. Those choices, along with the arguments that led to +choosing the proposed design rather than each alternative, are presented below. + ### Character encoding We could restrict programs to ASCII. @@ -146,14 +202,26 @@ Pro: - Normalization Form D is more uniform, in that characters are always maximally decomposed into combining characters; in NFC, characters may or may not be decomposed depending on whether a composed form is available. + - NFD may be more suitable for certain uses such as typo correction or + code completion. Con: - The C++ standard and community is moving towards using NFC: + - WG21 is in the process of adopting a NFC requirement for C++ identifiers. - GCC warns on C++ identifiers that aren't in NFC. + As a consequence, we should expect that the tooling and development + environments that C++ developers are using will provide good support for + authoring NFC-encoded source files. + +- The W3C recommends using NFC for all content, so code samples distributed on + webpages may be canonicalized into NFC by some web authoring tools. + +- NFC produces smaller encodings than NFD in all cases where they differ. + We could require no normalization form, and normalize identifiers ourselves. Pro: From 1ed8c19a3a5e8bfbfe60ef22724ca25dd9158404 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Fri, 21 Aug 2020 16:44:47 -0700 Subject: [PATCH 04/10] Addressing more review comments. --- proposals/p0142.md | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/proposals/p0142.md b/proposals/p0142.md index 69bd2d194a5ad..9767ee1ec9260 100644 --- a/proposals/p0142.md +++ b/proposals/p0142.md @@ -174,9 +174,11 @@ Pro: Con: - An overarching goal of the Carbon project is to provide a language that is - inclusive and welcoming. A language that does not permit names in programs - to be expressed in the developer's native language will not meet that goal - for at least some of our developers. + inclusive and welcoming. A language that does not permit names and comments + in programs to be expressed in the developer's native language will not meet + that goal for at least some of our developers. +- Quoted strings will be substantially less readable if non-ASCII printable + characters are required to be written as escape sequences. ### Byte order marks @@ -235,6 +237,11 @@ Con: proposal would require the implementation complexity of converting into NFC in the formatting tool, it would not require the conversion cost to be paid during compilation. + - Caveat: a high-quality implementation may choose to accept this cost + anyway, in order to better recover from errors. Moreover, it is possible + to + [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization) + and do the conversion only when necessary. We could require no normalization form and compare identifiers by code point sequence. From 7dcfa95862b36b08d00ad6ca550b93173770a34f Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Wed, 26 Aug 2020 15:00:16 -0700 Subject: [PATCH 05/10] Move "NFC versus normalize it ourselves" to an open question. Add a section on homoglyph attacks, demonstrating how the problem could be solved under the rules proposed here. --- proposals/p0142.md | 92 ++++++++++++++++++++++++++++++++++------------ 1 file changed, 68 insertions(+), 24 deletions(-) diff --git a/proposals/p0142.md b/proposals/p0142.md index 9767ee1ec9260..6b650397d0593 100644 --- a/proposals/p0142.md +++ b/proposals/p0142.md @@ -19,7 +19,9 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception - [Character encoding](#character-encoding) - [Source files](#source-files) - [Normalization](#normalization) + - [Open question](#open-question) - [Characters in identifiers and whitespace](#characters-in-identifiers-and-whitespace) + - [Homoglyphs](#homoglyphs) - [Alternatives considered](#alternatives-considered) - [Character encoding](#character-encoding-1) - [Byte order marks](#byte-order-marks) @@ -113,7 +115,12 @@ The choice to require NFC is really three choices: equivalent to the character sequence that they decompose into. For a fixed-width font, a canonical normalization form is most likely to - consider characters to be the same if they look the same. + consider characters to be the same if they look the same. Unicode annexes + [UAX#15](https://www.unicode.org/reports/tr15/tr15-18.html#Programming%20Language%20Identifiers) + and + [UAX#31](https://www.unicode.org/reports/tr31/tr31-33.html#normalization_and_case) + both recommend the use of Normalization Form C for case-sensitive + identifiers in programming languages. 2. Composition: we use a composed normalization form rather than a decomposed normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL @@ -125,12 +132,47 @@ The choice to require NFC is really three choices: 3. We require source files to be in our chosen form, rather than converting to that form as necessary. +#### Open question + +As an alternative to the rule proposed above, we could require no normalization +form, and normalize identifiers ourselves: + +```diff +-Carbon source files, outside comments and string literals, are required to be in +-Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will +-convert source files to NFC as necessary to satisfy this constraint. ++Carbon source files, outside comments and string literals, are converted to ++Unicode Normalization Form C ("NFC"). The Carbon source formatting tool should ++also convert identifiers in source files to NFC. +``` + +Pro: + +- We would treat source text identically regardless of the normalization form. +- Developers would not be responsible for ensuring that their editing + environment produces and preserves the proper normalization form. + +Con: + +- There is substantially more implementation cost involved in normalizing + identifiers than in detecting whether they are in normal form. While this + proposal would require the implementation complexity of converting into NFC + in the formatting tool, it would not require the conversion cost to be paid + during compilation. + +A high-quality implementation may choose to accept this cost anyway, in order to +better recover from errors. Moreover, it is possible to +[detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization) +and do the conversion only when necessary. However, if non-canonical source is +formally valid, there are more stringent performance constraints on such +conversion than if it is only done for error recovery. + ### Characters in identifiers and whitespace -We will largely follow Unicode Annex 31 in our selection of identifier and -whitespace characters. This Annex does not provide specific rules on lexical -syntax, instead providing a framework that permits a selection of choices of -concrete rules. +We will largely follow [Unicode Annex 31](https://www.unicode.org/reports/tr31/) +in our selection of identifier and whitespace characters. This Annex does not +provide specific rules on lexical syntax, instead providing a framework that +permits a selection of choices of concrete rules. The framework provided by Annex 31 includes suggested sets of characters that may appear in identifier, including uppercase and lowercase ASCII letters, along @@ -150,6 +192,27 @@ cases where we wish to take a different direction than that of the Annex. However, we should use Annex 31 as a basis for our decisions, and should expect strong justification for deviations from it. +#### Homoglyphs + +The sets of identifier characters suggested by Annex 31's `ID_Start` / +`XID_Start` / `ID_Continue` / `XID_Continue` characters include many pairs of +homoglyphs and near-homoglyphs -- characters that would be interpreted +differently but may render identically or very similarly. This problem would +also be present if we restricted the character set to ASCII -- for example, +`kBa11Offset` and `kBall0ffset` may be very hard to distinguish in some fonts -- +but there are many more ways to introduce such problems with the broader +identifier character set suggested by Annex 31. + +One way to handle this problem would be by adding a restriction to name lookup: +if a lookup for a name is performed in a scope and that lookup would have found +nothing, but there is a confusable identifier, as defined by +[UAX#39](http://www.unicode.org/reports/tr39/#Confusable_Detection), in the same +scope, the program is ill-formed. However, this idea is only provided as weak +guidance to future proposals and to demonstrate that UAX#31's approach is +compatible with at least one possible solution for the homoglyph problem. The +concrete rules for handling homoglyphs are considered out of scope for this +proposal. + ## Alternatives considered There are a number of different design choices we could make, as divergences @@ -224,25 +287,6 @@ Con: - NFC produces smaller encodings than NFD in all cases where they differ. -We could require no normalization form, and normalize identifiers ourselves. - -Pro: - -- We could treat source text identically regardless of the normalization form. - -Con: - -- There is substantially more implementation cost involved in normalizing - identifiers than in detecting whether they are in normal form. While this - proposal would require the implementation complexity of converting into NFC - in the formatting tool, it would not require the conversion cost to be paid - during compilation. - - Caveat: a high-quality implementation may choose to accept this cost - anyway, in order to better recover from errors. Moreover, it is possible - to - [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization) - and do the conversion only when necessary. - We could require no normalization form and compare identifiers by code point sequence. From 31ee9af6cb0d3836a0c0fa263b23fe940cc36000 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Wed, 26 Aug 2020 15:35:43 -0700 Subject: [PATCH 06/10] Add homoglyph detection to the list of things that might be easier in NFD. --- proposals/p0142.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/proposals/p0142.md b/proposals/p0142.md index 6b650397d0593..53a5cfe450f82 100644 --- a/proposals/p0142.md +++ b/proposals/p0142.md @@ -127,7 +127,8 @@ The choice to require NFC is really three choices: LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results in smaller representations whenever the two differ, but the decomposed form - is a little easier for algorithmic processing (for example, typo correction). + is a little easier for algorithmic processing (for example, typo correction + and homoglyph detection). 3. We require source files to be in our chosen form, rather than converting to that form as necessary. @@ -267,8 +268,8 @@ Pro: - Normalization Form D is more uniform, in that characters are always maximally decomposed into combining characters; in NFC, characters may or may not be decomposed depending on whether a composed form is available. - - NFD may be more suitable for certain uses such as typo correction or - code completion. + - NFD may be more suitable for certain uses such as typo correction, + homoglyph detection, or code completion. Con: From 0763099cf151204ac0187c9e2462159a561674c8 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Thu, 3 Sep 2020 15:30:32 -0700 Subject: [PATCH 07/10] Expand and clarify discussion of the open question regarding use of NFC. --- proposals/p0142.md | 33 ++++++++++++++++++++------------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/proposals/p0142.md b/proposals/p0142.md index 53a5cfe450f82..19592aad4bff9 100644 --- a/proposals/p0142.md +++ b/proposals/p0142.md @@ -147,13 +147,7 @@ form, and normalize identifiers ourselves: +also convert identifiers in source files to NFC. ``` -Pro: - -- We would treat source text identically regardless of the normalization form. -- Developers would not be responsible for ensuring that their editing - environment produces and preserves the proper normalization form. - -Con: +Arguments in favor of requiring pre-normalized inputs (as proposed): - There is substantially more implementation cost involved in normalizing identifiers than in detecting whether they are in normal form. While this @@ -161,12 +155,25 @@ Con: in the formatting tool, it would not require the conversion cost to be paid during compilation. -A high-quality implementation may choose to accept this cost anyway, in order to -better recover from errors. Moreover, it is possible to -[detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization) -and do the conversion only when necessary. However, if non-canonical source is -formally valid, there are more stringent performance constraints on such -conversion than if it is only done for error recovery. + A high-quality implementation may choose to accept this cost anyway, in + order to better recover from errors. Moreover, it is possible to + [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization) + and do the conversion only when necessary. However, if non-canonical source + is formally valid, there are more stringent performance constraints on such + conversion than if it is only done for error recovery. + +- Tools such as `grep` do not perform normalization themselves, and so would + be unreliable when applied to a codebase with inconsistent normalization. +- GCC already diagnoses identifiers that are not in NFC, and WG21 is in the + process of adopting an NFC requirement for C++ identifiers, so development + environments should be expected to increasingly accommodate production of + text in NFC. + +Arguments in favor of performing noralization ourselves (alternative): + +- We would treat source text identically regardless of the normalization form. +- Developers would not be responsible for ensuring that their editing + environment produces and preserves the proper normalization form. ### Characters in identifiers and whitespace From a1cdba04d5d4a5be4b70ee3d5e7645b5bb5134b2 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Tue, 6 Oct 2020 19:12:43 -0700 Subject: [PATCH 08/10] Respond to feedback from discussion: - Based on established consensus, apply normalization restriction to the entire source file, including comments and string literals. - Move 'open question' to 'alternatives considered' based on consensus that we want to require pre-normalized source files. --- proposals/p0142.md | 108 ++++++++++++++++++++++++++------------------- 1 file changed, 63 insertions(+), 45 deletions(-) diff --git a/proposals/p0142.md b/proposals/p0142.md index 19592aad4bff9..5df9aec1f31f0 100644 --- a/proposals/p0142.md +++ b/proposals/p0142.md @@ -19,7 +19,6 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception - [Character encoding](#character-encoding) - [Source files](#source-files) - [Normalization](#normalization) - - [Open question](#open-question) - [Characters in identifiers and whitespace](#characters-in-identifiers-and-whitespace) - [Homoglyphs](#homoglyphs) - [Alternatives considered](#alternatives-considered) @@ -95,11 +94,11 @@ Background: - [wikipedia article on Unicode normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) - [Unicode Standard Annex #15: Unicode Normalization Forms](https://www.unicode.org/reports/tr15/tr15-50.html) -Carbon source files, outside comments and string literals, are required to be in -Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will +Carbon source files, including comments and string literals, are required to be +in Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will convert source files to NFC as necessary to satisfy this constraint. -The choice to require NFC is really three choices: +The choice to require NFC is really four choices: 1. Equivalence classes: we use a canonical normalization form rather than a compatibility normalization form or no normalization form at all. @@ -133,47 +132,9 @@ The choice to require NFC is really three choices: 3. We require source files to be in our chosen form, rather than converting to that form as necessary. -#### Open question - -As an alternative to the rule proposed above, we could require no normalization -form, and normalize identifiers ourselves: - -```diff --Carbon source files, outside comments and string literals, are required to be in --Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will --convert source files to NFC as necessary to satisfy this constraint. -+Carbon source files, outside comments and string literals, are converted to -+Unicode Normalization Form C ("NFC"). The Carbon source formatting tool should -+also convert identifiers in source files to NFC. -``` - -Arguments in favor of requiring pre-normalized inputs (as proposed): - -- There is substantially more implementation cost involved in normalizing - identifiers than in detecting whether they are in normal form. While this - proposal would require the implementation complexity of converting into NFC - in the formatting tool, it would not require the conversion cost to be paid - during compilation. - - A high-quality implementation may choose to accept this cost anyway, in - order to better recover from errors. Moreover, it is possible to - [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization) - and do the conversion only when necessary. However, if non-canonical source - is formally valid, there are more stringent performance constraints on such - conversion than if it is only done for error recovery. - -- Tools such as `grep` do not perform normalization themselves, and so would - be unreliable when applied to a codebase with inconsistent normalization. -- GCC already diagnoses identifiers that are not in NFC, and WG21 is in the - process of adopting an NFC requirement for C++ identifiers, so development - environments should be expected to increasingly accommodate production of - text in NFC. - -Arguments in favor of performing noralization ourselves (alternative): - -- We would treat source text identically regardless of the normalization form. -- Developers would not be responsible for ensuring that their editing - environment produces and preserves the proper normalization form. +4. We require that the entire contents of the file be normalized, rather than + restricting our attention to only identifiers, or only identifiers and string + literals. ### Characters in identifiers and whitespace @@ -308,3 +269,60 @@ Con: - Different representations of the same character may result in different identifiers, in a way that is likely to be invisible in most programming environments. + +We could require no normalization form, and normalize the source code ourselves: + +Pro: + +- We would treat source text identically regardless of the normalization form. +- Developers would not be responsible for ensuring that their editing + environment produces and preserves the proper normalization form. + +Con: + +- There is substantially more implementation cost involved in normalizing + identifiers than in detecting whether they are in normal form. While this + proposal would require the implementation complexity of converting into NFC + in the formatting tool, it would not require the conversion cost to be paid + during compilation. + + A high-quality implementation may choose to accept this cost anyway, in + order to better recover from errors. Moreover, it is possible to + [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization) + and do the conversion only when necessary. However, if non-canonical source + is formally valid, there are more stringent performance constraints on such + conversion than if it is only done for error recovery. + +- Tools such as `grep` do not perform normalization themselves, and so would + be unreliable when applied to a codebase with inconsistent normalization. +- GCC already diagnoses identifiers that are not in NFC, and WG21 is in the + process of adopting an NFC requirement for C++ identifiers, so development + environments should be expected to increasingly accommodate production of + text in NFC. +- The byte representation of a source file may be unstable if different + editing environments make different normalization choices, creating problems + for revision control systems, patch files, and the like. +- Normalizing the contents of string literals, rather than using their + contents unaltered, will introduce a risk of user surprise. + +We could require only identifiers, or only identifiers and comments, to be +normalized, rather than the entire input file. + +Pro: + +- This would provide more freedom in comments to use arbitrary text. +- String literals could contain intentionally non-normalized text in order to + represent non-normalized strings. + +Con: + +- Within string literals, this would result in invisible semantic differences: + strings that render identically can have different meanings. +- The semantics of the program could vary if its sources are normalized, which + an editing environment might do invisibly and automatically. +- If an editing environment were to automatically normalize text, it would + introduce spurious diffs into changes. +- We would need to be careful to ensure that no string or comment delimiter + ends with a code point sequence that is a prefix of a decomposition of + another code point, otherwise different normalizations of the same source + file could tokenize differently. From ed03f2475d0378a6066a812d1b5b229cc9a952f9 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Tue, 13 Oct 2020 09:23:21 -0700 Subject: [PATCH 09/10] Apply editorial simplification from code review Co-authored-by: austern --- proposals/p0142.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/proposals/p0142.md b/proposals/p0142.md index 5df9aec1f31f0..e933395b9a082 100644 --- a/proposals/p0142.md +++ b/proposals/p0142.md @@ -59,8 +59,8 @@ Unicode Annex 31. ### Character encoding -Before being divided into tokens, a program starts as a sequence of characters. -Those characters are a sequence of Unicode code points -- integer values between +Before being divided into tokens, a program starts as a sequence of +Unicode code points -- integer values between 0 and 10FFFF16 -- whose meaning as characters or non-characters is defined by the Unicode standard. From 1022aa682ece1985d12fffd8b39045e34fde607f Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Wed, 28 Oct 2020 17:17:38 -0700 Subject: [PATCH 10/10] Add some links as requested by review comments. --- proposals/p0142.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/proposals/p0142.md b/proposals/p0142.md index e933395b9a082..d55b69be43b0b 100644 --- a/proposals/p0142.md +++ b/proposals/p0142.md @@ -59,10 +59,9 @@ Unicode Annex 31. ### Character encoding -Before being divided into tokens, a program starts as a sequence of -Unicode code points -- integer values between -0 and 10FFFF16 -- whose meaning as characters or non-characters is -defined by the Unicode standard. +Before being divided into tokens, a program starts as a sequence of Unicode code +points -- integer values between 0 and 10FFFF16 -- whose meaning as +characters or non-characters is defined by the Unicode standard. Carbon is based on Unicode 13.0, which is currently the latest version of the Unicode standard. Newer versions should be considered for adoption as they are @@ -121,6 +120,8 @@ The choice to require NFC is really four choices: both recommend the use of Normalization Form C for case-sensitive identifiers in programming languages. + See also the discussion of [homoglyphs](#homoglyphs) below. + 2. Composition: we use a composed normalization form rather than a decomposed normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER @@ -161,6 +162,9 @@ cases where we wish to take a different direction than that of the Annex. However, we should use Annex 31 as a basis for our decisions, and should expect strong justification for deviations from it. +Note that this aligns with the current direction for C++, as described in WG21 +paper [P1949R6](http://wg21.link/P1949R6). + #### Homoglyphs The sets of identifier characters suggested by Annex 31's `ID_Start` / @@ -296,9 +300,10 @@ Con: - Tools such as `grep` do not perform normalization themselves, and so would be unreliable when applied to a codebase with inconsistent normalization. - GCC already diagnoses identifiers that are not in NFC, and WG21 is in the - process of adopting an NFC requirement for C++ identifiers, so development - environments should be expected to increasingly accommodate production of - text in NFC. + process of adopting an + [NFC requirement for C++ identifiers](http://wg21.link/P1949R6), so + development environments should be expected to increasingly accommodate + production of text in NFC. - The byte representation of a source file may be unstable if different editing environments make different normalization choices, creating problems for revision control systems, patch files, and the like.