-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Unicode source files #142
Changes from all commits
b35b0d8
29c8c45
64c3904
1ed8c19
7dcfa95
31ee9af
0763099
a1cdba0
ed03f24
1022aa6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,333 @@ | ||
# Unicode source files | ||
|
||
<!-- | ||
Part of the Carbon Language project, under the Apache License v2.0 with LLVM | ||
Exceptions. See /LICENSE for license information. | ||
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
--> | ||
|
||
[Pull request](https://github.com/carbon-language/carbon-lang/pull/142) | ||
|
||
## Table of contents | ||
|
||
<!-- toc --> | ||
|
||
- [Problem](#problem) | ||
- [Background](#background) | ||
- [Proposal](#proposal) | ||
- [Details](#details) | ||
- [Character encoding](#character-encoding) | ||
- [Source files](#source-files) | ||
- [Normalization](#normalization) | ||
- [Characters in identifiers and whitespace](#characters-in-identifiers-and-whitespace) | ||
- [Homoglyphs](#homoglyphs) | ||
- [Alternatives considered](#alternatives-considered) | ||
- [Character encoding](#character-encoding-1) | ||
- [Byte order marks](#byte-order-marks) | ||
- [Normalization forms](#normalization-forms) | ||
|
||
<!-- tocstop --> | ||
|
||
## Problem | ||
|
||
Portable use and maintenance of Carbon source files requires a common | ||
understanding of how they are encoded on disk. Further, the decision as to what | ||
characters are valid in names and what constitutes whitespace are a complex area | ||
in which we do not expect to have local expertise. | ||
|
||
## Background | ||
|
||
[Unicode](https://www.unicode.org/versions/latest/) is a universal character | ||
encoding, maintained by the | ||
[Unicode Consortium](https://home.unicode.org/basic-info/overview/). It is the | ||
canonical encoding used for textual information interchange across all modern | ||
technology. | ||
|
||
The [Unicode Standard Annex 31](https://www.unicode.org/reports/tr31/), "Unicode | ||
Identifier and Pattern Syntax", provides recommendations for the use of Unicode | ||
in the definitions of general-purpose identifiers. | ||
|
||
## Proposal | ||
|
||
Carbon programs are represented as a sequence of Unicode code points. Carbon | ||
source files are encoded in UTF-8. | ||
|
||
Carbon will follow lexical conventions for identifiers and whitespace based on | ||
Unicode Annex 31. | ||
|
||
## Details | ||
|
||
### Character encoding | ||
|
||
Before being divided into tokens, a program starts as a sequence of Unicode code | ||
points -- integer values between 0 and 10FFFF<sub>16</sub> -- whose meaning as | ||
characters or non-characters is defined by the Unicode standard. | ||
|
||
Carbon is based on Unicode 13.0, which is currently the latest version of the | ||
Unicode standard. Newer versions should be considered for adoption as they are | ||
released. | ||
|
||
### Source files | ||
|
||
Program text can come from a variety of sources, such as an interactive | ||
programming environment (a so-called "Read-Evaluate-Print-Loop" or REPL), a | ||
database, a memory buffer of an IDE, or a command-line argument. | ||
|
||
The canonical representation for Carbon programs is in files stored as a | ||
sequence of bytes in a file system on disk, and such files are expected to be | ||
encoded in UTF-8. Such files may begin with an optional UTF-8 BOM, that is, the | ||
byte sequence EF<sub>16</sub>,BB<sub>16</sub>,BF<sub>16</sub>. This prefix, if | ||
present, is ignored. | ||
jfbastien marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Regardless of how program text is concretely stored, the first step in | ||
processing any such text is to convert it to a sequence of Unicode code points | ||
-- although such conversion may be purely notional. The result of this | ||
conversion is a Carbon _source file_. Depending on the needs of the language, we | ||
may require each such source file to have an associated file name, even if the | ||
source file does not originate in anything resembling a file system. | ||
|
||
### Normalization | ||
|
||
Background: | ||
|
||
- [wikipedia article on Unicode normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) | ||
- [Unicode Standard Annex #15: Unicode Normalization Forms](https://www.unicode.org/reports/tr15/tr15-50.html) | ||
|
||
Carbon source files, including comments and string literals, are required to be | ||
in Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will | ||
convert source files to NFC as necessary to satisfy this constraint. | ||
|
||
The choice to require NFC is really four choices: | ||
|
||
1. Equivalence classes: we use a canonical normalization form rather than a | ||
compatibility normalization form or no normalization form at all. | ||
|
||
- If we use no normalization, invisibly-different ways of representing the | ||
same glyph, such as with pre-combined diacritics versus with diacritics | ||
expressed as separate combining characters, or with combining characters | ||
in a different order, would be considered different characters. | ||
- If we use a canonical normalization form, all ways of encoding diacritics | ||
are considered to form the same character, but ligatures such as `ffi` are | ||
considered distinct from the character sequence that they decompose into. | ||
- If we use a compatibility normalization form, ligatures are considered | ||
equivalent to the character sequence that they decompose into. | ||
Comment on lines
+112
to
+113
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reference the discussion around homoglyphs below here? It seems relevant. See my comments there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cross-reference to the homoglyphs section added. |
||
|
||
For a fixed-width font, a canonical normalization form is most likely to | ||
consider characters to be the same if they look the same. Unicode annexes | ||
[UAX#15](https://www.unicode.org/reports/tr15/tr15-18.html#Programming%20Language%20Identifiers) | ||
and | ||
[UAX#31](https://www.unicode.org/reports/tr31/tr31-33.html#normalization_and_case) | ||
both recommend the use of Normalization Form C for case-sensitive | ||
identifiers in programming languages. | ||
|
||
See also the discussion of [homoglyphs](#homoglyphs) below. | ||
|
||
2. Composition: we use a composed normalization form rather than a decomposed | ||
normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL | ||
LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER | ||
O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results | ||
in smaller representations whenever the two differ, but the decomposed form | ||
is a little easier for algorithmic processing (for example, typo correction | ||
and homoglyph detection). | ||
|
||
3. We require source files to be in our chosen form, rather than converting to | ||
that form as necessary. | ||
|
||
4. We require that the entire contents of the file be normalized, rather than | ||
restricting our attention to only identifiers, or only identifiers and string | ||
literals. | ||
|
||
### Characters in identifiers and whitespace | ||
|
||
We will largely follow [Unicode Annex 31](https://www.unicode.org/reports/tr31/) | ||
in our selection of identifier and whitespace characters. This Annex does not | ||
provide specific rules on lexical syntax, instead providing a framework that | ||
permits a selection of choices of concrete rules. | ||
|
||
The framework provided by Annex 31 includes suggested sets of characters that | ||
may appear in identifier, including uppercase and lowercase ASCII letters, along | ||
with reasonable extensions to many non-ASCII letters, with some characters | ||
restricted to not appear as the first character. For example, this list includes | ||
U+30EA (KATAKANA LETTER RI), but not U+2603 (SNOWMAN), both of which are | ||
permitted in identifiers in C++20. Similarly, it indicates which characters | ||
should be classified as whitespace, including all the ASCII whitespace | ||
characters plus some non-ASCII whitespace characters. It also supports | ||
language-specific "profiles" to alter these baseline character sets for the | ||
needs of a particular language -- for instance, to permit underscores in | ||
identifiers, or to include non-breaking spaces as whitespace characters. | ||
|
||
This proposal does not specify concrete choices for lexical rules, nor that we | ||
will not deviate from conformance to Annex 31 in any concrete area. We may find | ||
cases where we wish to take a different direction than that of the Annex. | ||
However, we should use Annex 31 as a basis for our decisions, and should expect | ||
strong justification for deviations from it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this align with http://wg21.link/P1949 ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P1949 is proposing that C++ follows Unicode Annex 31, so to the extent that later Carbon proposals do as this proposal suggests, yes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, I just want to have a reference to show the alignment to what C++ is doing, and to say that it's on purpose. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note added, with a link to the C++ proposal. |
||
|
||
Note that this aligns with the current direction for C++, as described in WG21 | ||
paper [P1949R6](http://wg21.link/P1949R6). | ||
|
||
#### Homoglyphs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like this should also reference some amount of the similar-appearing cases handled by the compatibility normalization forms. My understanding is that these are distinct from the "confusable" set of glyphs defined by Unicode (UAX#39 below). Essentially, if we care about homoglyphs, i feel like we should also care about the compatibility normalization. I can see two approaches here:
I feel like we should take a similar approach for both of these. If we solve through normalization, we should use normalization form KC, but either canonicalize or directly reject even the use of confusable glyphs (they're always available via escape codes). If we solve through detection, I think we should detect both KC != C cases, as well as confusables. I'm worried about the cost of detecting KC != C cases sadly. As a consequence, I wonder if we should use normalization form KC specifically justified by our desire to not have potential confusion similar to that of homoglyphs. And I wonder if we should simply reject confusable characters completely and require their usage to route through escape codes. This approach would have the advantage of moving us further towards the byte sequence of string literals being directly reflected visually as well as differences in identifiers being reliably detected visually. I'm not even suggesting we have to commit to a specific thing around the confusables here, but I think we need to pick a strategy as that seems likely to influence the normalization form desired. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think compatibility characters are going to be a problem. ICU's confusability-detection routines already check for them, and also, nearly all those characters are things we can just ban because you shouldn't be using them in code anyway. I'd love to make string literals visually unambiguous, so you can determine their byte sequence by looking at them. In particular, this is important for strings that are going to be interpreted by code, like shell commands or HTTP headers. Banning confusable characters outright is pretty heavy-handed, though -- it would make it nearly impossible to write text in Russian, for example, because so many Cyrillic letters are confusable with Latin letters. Also, it'll be a nontrivial task just to enumerate all the characters that might be confusable when used in a string literal. As far as I know, most real-world confusability work so far has just dealt with URLs; string literals allow a broader set of characters, like spaces and more kinds of punctuation, so there are going to be new opportunities for confusion that haven't been explored. And even in the limited world of domain names, where a lot of work has gone into preventing spoofing, people discover new spoofing opportunities pretty regularly. This isn't a solved problem. In general, I think normalization isn't what we should be focusing on here. Normalization is easy, well-understood, and largely irrelevant in practice because most real-world text is NFC anyway; whatever normalization option we pick will be fine. Confusability detection is difficult, poorly understood, and causes security problems. That's the part we need to think about. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, the proposal largely sets this aside for now, so there doesn't seem to be anything more to do here... I do continue to think if we want to do anything for confusability in the language, we'll need to do it through some kind of "normalization", although clearly the ones defined by Unicode are not sufficient to address most confusability concerns. I think just looking at language level confusability through the normalization lens helps make clear the scope we can realistically handle: we could probably normalize away accidentally pasted ligatures and similar things, but maybe not much else. Whether that is worth doing from a QoI perspective (its clearly insufficient for any kind of security) is something I think we don't currently know, but also I don't think needs to be decided right away, so I'm fine proceeding with the proposal as is. |
||
|
||
The sets of identifier characters suggested by Annex 31's `ID_Start` / | ||
`XID_Start` / `ID_Continue` / `XID_Continue` characters include many pairs of | ||
homoglyphs and near-homoglyphs -- characters that would be interpreted | ||
differently but may render identically or very similarly. This problem would | ||
also be present if we restricted the character set to ASCII -- for example, | ||
`kBa11Offset` and `kBall0ffset` may be very hard to distinguish in some fonts -- | ||
but there are many more ways to introduce such problems with the broader | ||
identifier character set suggested by Annex 31. | ||
|
||
One way to handle this problem would be by adding a restriction to name lookup: | ||
if a lookup for a name is performed in a scope and that lookup would have found | ||
nothing, but there is a confusable identifier, as defined by | ||
josh11b marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[UAX#39](http://www.unicode.org/reports/tr39/#Confusable_Detection), in the same | ||
scope, the program is ill-formed. However, this idea is only provided as weak | ||
guidance to future proposals and to demonstrate that UAX#31's approach is | ||
compatible with at least one possible solution for the homoglyph problem. The | ||
concrete rules for handling homoglyphs are considered out of scope for this | ||
proposal. | ||
|
||
## Alternatives considered | ||
chandlerc marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
There are a number of different design choices we could make, as divergences | ||
from the above proposal. Those choices, along with the arguments that led to | ||
choosing the proposed design rather than each alternative, are presented below. | ||
|
||
### Character encoding | ||
|
||
We could restrict programs to ASCII. | ||
|
||
Pro: | ||
|
||
- Reduced implementation complexity. | ||
- Avoids all problems relating to normalization, homoglyphs, text | ||
directionality, and so on. | ||
- We have no intention of using non-ASCII characters in the language syntax or | ||
in any library name. | ||
- Provides assurance that all names in libraries can reliably be typed by all | ||
developers -- we already require that keywords, and thus all ASCII letters, | ||
can be typed. | ||
|
||
Con: | ||
geoffromer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- An overarching goal of the Carbon project is to provide a language that is | ||
inclusive and welcoming. A language that does not permit names and comments | ||
in programs to be expressed in the developer's native language will not meet | ||
that goal for at least some of our developers. | ||
- Quoted strings will be substantially less readable if non-ASCII printable | ||
characters are required to be written as escape sequences. | ||
|
||
### Byte order marks | ||
|
||
We could disallow byte order marks. | ||
|
||
Pro: | ||
|
||
- Marginal implementation simplicity. | ||
|
||
Con: | ||
|
||
- Several major editors, particularly on the Windows platform, insert UTF-8 | ||
BOMs and use them to identify file encoding. | ||
|
||
### Normalization forms | ||
|
||
We could require a different normalization form. | ||
|
||
Pro: | ||
|
||
- Some environments might more naturally produce a different normalization | ||
form. | ||
- Normalization Form D is more uniform, in that characters are always | ||
maximally decomposed into combining characters; in NFC, characters may or | ||
may not be decomposed depending on whether a composed form is available. | ||
- NFD may be more suitable for certain uses such as typo correction, | ||
homoglyph detection, or code completion. | ||
|
||
Con: | ||
|
||
- The C++ standard and community is moving towards using NFC: | ||
|
||
- WG21 is in the process of adopting a NFC requirement for C++ | ||
identifiers. | ||
- GCC warns on C++ identifiers that aren't in NFC. | ||
|
||
As a consequence, we should expect that the tooling and development | ||
environments that C++ developers are using will provide good support for | ||
authoring NFC-encoded source files. | ||
|
||
- The W3C recommends using NFC for all content, so code samples distributed on | ||
webpages may be canonicalized into NFC by some web authoring tools. | ||
|
||
- NFC produces smaller encodings than NFD in all cases where they differ. | ||
|
||
We could require no normalization form and compare identifiers by code point | ||
sequence. | ||
|
||
Pro: | ||
|
||
- This is the rule in use in C++20 and before. | ||
|
||
Con: | ||
|
||
- This is not the rule planned for the near future of C++. | ||
- Different representations of the same character may result in different | ||
identifiers, in a way that is likely to be invisible in most programming | ||
environments. | ||
|
||
We could require no normalization form, and normalize the source code ourselves: | ||
|
||
Pro: | ||
|
||
- We would treat source text identically regardless of the normalization form. | ||
- Developers would not be responsible for ensuring that their editing | ||
environment produces and preserves the proper normalization form. | ||
|
||
Con: | ||
|
||
- There is substantially more implementation cost involved in normalizing | ||
identifiers than in detecting whether they are in normal form. While this | ||
proposal would require the implementation complexity of converting into NFC | ||
in the formatting tool, it would not require the conversion cost to be paid | ||
during compilation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The cost will need to be paid by a compiler that cares about good error messages -- we would need to accept non-NFC identifiers, produce an error, but then normalize them to NFC, and ensure that name lookup works. Maybe we can make NFC the fast path (http://unicode.org/reports/tr15/#NFC_QC_Optimization), but a good compiler would still need to normalize in erroneous cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good point. I'll add that to the document, but do you think we should actually reverse the decision and allow arbitrary un-normalized text? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was only pointing out that this disadvantage might not be material in practice. We could add another disadvantage if we want, explicitly calling out naive tools that process bytes. Do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Neither There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that I think the complexity is still significantly reduced by requiring normalization. That means the compiler can take an approach such as a fallback path that renormalizes and re-starts lexing (or other potentially costly fallback strategy) rather than needing to normalize in the "expected" path. But perhaps the greatest reason to require it is to get diff stability and byte-equivalence for things like string literals. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The document was revised to incorporate @gribozavr's comments, but to require normalized text everywhere based in part on @chandlerc's observations. I consider this thread resolved. |
||
|
||
A high-quality implementation may choose to accept this cost anyway, in | ||
order to better recover from errors. Moreover, it is possible to | ||
[detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization) | ||
and do the conversion only when necessary. However, if non-canonical source | ||
is formally valid, there are more stringent performance constraints on such | ||
conversion than if it is only done for error recovery. | ||
|
||
- Tools such as `grep` do not perform normalization themselves, and so would | ||
be unreliable when applied to a codebase with inconsistent normalization. | ||
- GCC already diagnoses identifiers that are not in NFC, and WG21 is in the | ||
process of adopting an | ||
[NFC requirement for C++ identifiers](http://wg21.link/P1949R6), so | ||
development environments should be expected to increasingly accommodate | ||
production of text in NFC. | ||
- The byte representation of a source file may be unstable if different | ||
editing environments make different normalization choices, creating problems | ||
for revision control systems, patch files, and the like. | ||
- Normalizing the contents of string literals, rather than using their | ||
contents unaltered, will introduce a risk of user surprise. | ||
|
||
We could require only identifiers, or only identifiers and comments, to be | ||
normalized, rather than the entire input file. | ||
|
||
Pro: | ||
|
||
- This would provide more freedom in comments to use arbitrary text. | ||
- String literals could contain intentionally non-normalized text in order to | ||
represent non-normalized strings. | ||
|
||
Con: | ||
|
||
- Within string literals, this would result in invisible semantic differences: | ||
strings that render identically can have different meanings. | ||
- The semantics of the program could vary if its sources are normalized, which | ||
an editing environment might do invisibly and automatically. | ||
- If an editing environment were to automatically normalize text, it would | ||
introduce spurious diffs into changes. | ||
- We would need to be careful to ensure that no string or comment delimiter | ||
ends with a code point sequence that is a prefix of a decomposition of | ||
another code point, otherwise different normalizations of the same source | ||
file could tokenize differently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we currently have, or plan to have, raw literals in Carbon, and if so do we need to consider an exemption to the requirement of valid UTF-8 for the content of such literals?
I'm imagining use cases such as a [local encoding]->utf8 translation unit test wanting to have a literal for its input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I thought about raising the same question, but came to the conclusion that in such situations, you really should use something like unicode escape sequences rather than literal data. Otherwise you wind up in a situation where your source code has semantically-relevant properties that are completely invisible unless you open your source code in a hex editor, which seems like a really bad place to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do plan to have raw literals, but I don't think we intend for them to be used to support arbitrary binary data in source files. I'm inclined to say that the C++ rules that retroactively undo some parts of the early phases of translation for raw literals are a mistake -- we should view raw literals as a way to include special characters such as
\
and"
in string literals more easily, and anything beyond that should be considered out-of-scope. (Newline characters are also a concern here; the string literals proposal suggests having single-line and multi-line string literals, orthogonal to raw and non-raw literals.)For an encoding test of the kind you describe, I think the most reasonable source encoding (from a source readability point of view and to ensure that the source file is stable when processed with tools -- and that the literal contents don't get replaced with U+FFFDs) would be to use escape sequences or similar to express the input, and language rules that largely force that choice might even be a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@je4d I consider this thread to be resolved. Please can you close it if you agree?