diff --git a/proposals/p1964.md b/proposals/p1964.md new file mode 100644 index 0000000000000..35a3d77fe8842 --- /dev/null +++ b/proposals/p1964.md @@ -0,0 +1,407 @@ +# Character literals + + + +[Pull request](https://github.com/carbon-language/carbon-lang/pull/1964) + + + +## Table of contents + +- [Abstract](#abstract) +- [Problem](#problem) +- [Background](#background) +- [Proposal](#proposal) +- [Details](#details) + - [Types](#types) + - [Operations](#operations) +- [Rationale](#rationale) +- [Alternatives considered](#alternatives-considered) + - [No distinct character types](#no-distinct-character-types) + - [No distinct character literal](#no-distinct-character-literal) + - [Supporting prefix declarations](#supporting-prefix-declarations) + - [Allowing numeric escape sequences](#allowing-numeric-escape-sequences) + - [Supporting formulations of grapheme clusters and non-code-point code-units](#supporting-formulations-of-grapheme-clusters-and-non-code-point-code-units) +- [Future Work](#future-work) + - [UTF code unit types proposal](#utf-code-unit-types-proposal) + + + +## Abstract + +This proposal specifies lexical rules for constant characters in Carbon: + +Put character literals in single quotes, like `'a'`. Character literals work +like numeric literals: + +- Every different literal value has its own type. +- The literal itself doesn't have a bit width as a consequence. Instead, + variables use explicitly sized character types and character literals can be + converted to these types when representable. +- A character literal must contain exactly one code point. + +Follows the plan from open design idea +[#1934: Character Literals](https://github.com/carbon-language/carbon-lang/issues/1934). + +## Problem + +Carbon currently has no lexical syntax for character literals, and only provides +string literals and numeric literals. We wish to provide a distinct lexical +syntax for character literals versus string literals. + +The advantage of having an explicit character type fundamentally comes down to +characters being represented as integers whereas strings are represented as +buffers. This will allow characters to have different operations, and be more +familiar to use. For example: + +``` +if (c >= 'A' and c <= 'Z') { + c += 'a' - 'A'; +} +``` + +The example above shows how we would be able to use operations similar to +integers. Being able to use the comparison operations and supporting arithmetic +operations provides an intuitive approach to using characters. This allows us to +remove unnecessary logic of type conversion and other control flow logic, that +is needed to work with a single element string. See [Rationale](#rationale) for +more examples showing more appropriate use of characters over using strings. + +## Background + +Character Literals by definition is a type of literal in programming for the +representation of a single character's value within the source code of a +computer program. Character literals between languages have some minor nuances +but are fundamentally designed for the same purpose. Languages that have a +dedicated character data type generally include character literals, for example +C++, Java, Swift to name a few. Whereas other languages that lack distinct +character type, like Python use strings of length one to serve the same purpose +a character data type. For more information see +[Character Literals Wiki](https://en.wikipedia.org/wiki/Character_literal), +[Character Literals DBpedia](https://dbpedia.org/page/Character_literal) + +## Proposal + +Put character literals in single quotes, like `'a'`. Character literals work +like numeric literals: + +- Every different literal value has its own type. +- The literal itself doesn't have a bit width as a consequence. Instead, + variables use explicitly sized character types and character literals can be + converted to these types when representable. Follows the plan from #1934. +- A character literal will model single Unicode code points that have a single + concrete numerical representation. We will not be supporting other + formulations like code unit sequences or grapheme clusters as these will be + modeled with normal string literals. + +## Details + +- A character literal is a sequence enclosed with single quotes delimiter ('), + of UTF-8 code units that must be a valid encoding. This matches + [the UTF-8 encoding of Carbon source files](https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0142.md#character-encoding). +- A character literal must encode exactly one code point. +- It supports addition and subtraction, [as described below](#operations). +- Character literals will support the relevant subset of the backslash (`\`) + escape sequences in string literals, including `\t`, `\n`, `\r`, `\"`, `\'`, + `\\`, `\0`, and `\u{HHHH...}`. See + [String Literals: Escape sequence](https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0199.md#escape-sequences). + - Escape sequences which would result in non-UTF-8 encodings or more than + one code point are not included. + - The escape of an embedded newline is also excluded as it isn't expected + to be relevant for character literals. + +We will not support: + +- character literals that don't contain exactly one Unicode code point; +- multi-line literals; +- "raw" literals (using #'x'#); +- `\x` escape sequences; +- character literals with a single quote (`'`) or back-slash (`\`), except as + part of an escape sequence; +- empty character literals (`''`); +- a backslash followed by an (unescaped) newline; +- ASCII control codes (0...31), including whitespace characters other than + word space (tab, line feed, carriage return, form feed, and vertical tab), + except when specified with an escape sequence. + +### Types + +For the time being, Carbon will support three character types: `Char8`, +`Char16`, and `Char32`. These types are capable of representing both code units +and code points. It’s important to note that the support for different +UTF-encoding code unit types will be addressed in a separate proposal. Please +refer to the [UTF code unit types proposal](#utf-code-unit-types-proposal)for +more information on that topic. + +In Carbon, the type `CharN` represents a character, where `N` corresponds to the +bit size of the character type (`8`, `16`, or `32`). We will only allow +character literals that map directly to a complete value of a code point. Here +are examples of character literals for each specific type: + +- `Char8`: The character literal consists of a single Unicode code point that + can be represented within 8 bits. For example: + +`let allowed: Char8 = ‘a’ ` + +In this example, the character literal `’a’` corresponds to the Unicode code +point `97`, which is within the valid range of `Char8` since `97` is less than +or equal to `0x7F`. + +- `Char16`: The character literal represents a Unicode code point that can be + represented within 16 bits. Here’s an example: + +`let smiley: Char16 = ‘\u{1F600}’` + +The character literal `’\u{1F600}’` represents the smiley face emoji, which has +the Unicode code point `128512`. Since `128512` can be represented within 16 +bits, it can be assigned to a variable of type `Char16`. + +- `Char32`: This character type allows the representation of Unicode code + points within 32 bits. Here’s an example: + +`let musicalNote: Char32 = ‘🎵’` + +In this case, the character literal `’🎵’` corresponds to the musical note emoji +with the Unicode code point `127925`. Since `127925` falls within the range that +can be represented by `Char32`, it can be assigned to a variable of type +`Char32`. + +By restricting character literals to those that can be directly mapped to code +points within the specific character types, we ensure accurate representation +and compatibility with the chosen character encoding scheme. + +### Operations + +Character literals representing a single code point support the following +operators: + +- Comparison: `<`, `>`, `<=`, `>=` `==` +- Plus: `+`. This doesn't concatenate, but allows numerically adjusting the + value: + - Only one operand may be a character literal, the other must be an + integer literal. + - The result is the character literal whose numeric value is the sum of + numeric value of the operands. If that sum is not a valid Unicode code + point, it is an error. +- Subtract: `-`. This will subtract the value of the two characters, or a + character followed by an integer literal: + - If the `-` is used between two character literals, the result will be an + integer constant. For example, `'z' - 'a'` is equivalent to `25`. + - If the `-` is used between a character literal followed by a integer + literal, this will produce a character constant. For example `'z' - 4` + is equivalent to `'v'`. + - If the `-` is used between a integer literal followed by a character + literal `100 - 'a'`, this will be rejected unless the integer is cast to + a character. + +There is intentionally no implicit conversion from character literals to integer +types, but explicit conversions are permitted between character literals and +integer types. Carbon will separate the integer types from character types +entirely. + +## Rationale + +This proposal supports the goal of making Carbon code +[easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write). +Adding support for a specific character literal supports clean, readable, +concise use and is a much more familiar concept that will make it easier to +adopt Carbon coming from other languages. Have a distinct character literal will +also allow us support useful operations designed to manipulate the literal's +value. When working with an explicit character type we can use operators that +have unique behavior, for example say we wanted to advance a character to the +next literal. In other languages the `+` operator is often used for +concatenation, so using a `String` will produce a type error: `"a" + 1`. However +with a character literal, we can support operations for these use cases: + +``` +var b: u8; + +b = 'a' + 1; +b + 1 == 'c'; +``` + +See [Operations](#operations) and +[No Distinct Character Literal](#no-distinct-character-literal) for more +information. + +Further, this design follows other standards set in place by previous proposals. +For example following the +[String Literals: Escaping Sequence](https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0199.md#escape-sequences-1) +and representing characters as integers with the behaviour inline with +[Integer Literals](https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0143.md). + +This also supports our goal for +[Interoperability with and migration from existing C++ code](/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code) +by ensuring that every kind of character literal that exists in C++ can be +represented in a Carbon character literal. This is done in a way that is natural +to adopt, understand, easy to read by having explicit character types mapped to +the C++ character types and the correct associated encoding. + +Finally, the choice to use Unicode and UTF-8 by default reflects the Carbon goal +to prioritize +[modern OS platforms, hardware architectures, and environments](/docs/project/goals.md#modern-os-platforms-hardware-architectures-and-environments). +This reflects the +[growing adoption of UTF-8](https://en.wikipedia.org/wiki/UTF-8#Adoption). + +## Alternatives considered + +### No distinct character types + +Unlike C++, Carbon will separate the integer and the character types. We +considered using `u8`, `u16`, and `u32` instead of `Char8`, `Char16`, and +`Char32` to reduce the number of different types users needed to be aware of and +convert between. We decided against it because it came with a number of +disadvantages: + +- `u8`, `u16`, and `u32` have the wrong arithmetic semantics: we don't want + wrapping, and many `uN` operations, like multiplication, division, and + shift, are not meaningful on code units. There may be rare cases where you + want to use those operations, such as if you're implementing a conversion to + or from code units. But in those rare cases it would be reasonable for the + user to convert to an integer type to perform that operation and convert + back when done. +- Some operations want to be able to tell the difference between values that + are intended to be UTF-8 instead of having no specified encoding. +- Some operations want to be able to know that they've been given text rather + than random bytes of data. For example, `Print(0x41 as u8)` would be + expected to print `"65"` while `Print('\u{41}')` and `Print(0x41 as Char8)` + would be expected to print `"A"`. +- It's useful for developers to document the intended meaning of a value, and + using a distinct type is one way to do that. + +See [UTF code unit types proposal](#utf-code-unit-types-proposal) for more +information about UTF encoding types for a future proposal. + +### No distinct character literal + +In principle, a character literal can be represented by reusing string literals +similar to how Python handles character literals, however this would prevent +performing operations on characters as integers. For example, the `+` operator +on strings is used for concatenation, but `+` on a character would change its +value. + +``` +// `digit` must be in the range 0..9. +fn DigitToChar(digit: i32) -> Char8 { + return '0' + digit; +} +``` + +Furthermore, many properties of Unicode characters are defined on ranges of code +points, motivating supporting comparison operators on code points. + +``` +fn IsDingBatCodePoint(c: Char32) -> bool { + return c >= '\u{2700}' and c <= '\u{27BF}'; +} +``` + +### Supporting prefix declarations + +No support is proposed for prefix declarations like `u`, `U`, or `L`. In +practice they are used to specify the character literal types and their encoding +in languages like C and C++. There are a several benefits to omitting prefix +declarations; improved readablitly, simplifying how a character's type is +determined, and how we are encoding character literals. When declaring a +character literal, the type is based on the contents of the character so that +`var c: u8 = 'a'` is a valid character that can be converted to `u8`, in order +to support prefix declarations we would need to extend our type system to have +other exlpicit type checks like in C++; a UTF-16 `u'`, UTF-32 `U'`, and wide +characters `L'`. This would be more familiar for individuals coming to Carbon +from a C++ background, and simplify our approach for C++ Interoperability. At +the cost of diverge from existing standards, for example +[Proposal 142](https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0142.md#character-encoding) +states all of Carbon source code should be UTF-8 encoded. Prefix declarations +would detract the readability of the character literals and increase the +complexity of character literal [Types](#types). + +### Allowing numeric escape sequences + +This proposal does not support numeric escape sequences using `\x`. This +simplifies the design of character types and literals, making them only +represent code points and not code units. However this does come with the +disadvantage of less consistency of character literals with string literals, +since they now accept different escape sequences. We don't want to remove +numeric escape sequence from string literals, so we can support string use cases +like representing invalid encodings. + +This approach has the additional concern that if character literals don't +support numeric escape sequences, developers may choose to use numeric literals +instead, at a cost of type-safety and readability. For example, it isn't clear +in `var first_digit: Char8 = 0;` whether `0` is supposed to be a `NUL` character +or the encoding of the `'0'` character (48). We addressed this concern, and type +safety concerns about distinguishing numbers and characters, by making the +integer to character conversions explicit. + +### Supporting formulations of grapheme clusters and non-code-point code-units + +Rather than explicitly limiting characters literals to a more integer-like +representation of a single Unicode code point, we could represent characters +literal formulations of grapheme clusters and non-code-point code units. What +humans tend to think of as a "character" corresponds to a "grapheme cluster." +The encoding of a grapheme cluster can be arbitrarily long and complex, which +would sacrifice the ability to perform integer operations. If we wanted to add +support for other character formulations, we would need to use separate +spellings to represent a small set of operations that are today expressed with +integer-based math on C++'s character literals. This includes things like +converting an integer between 0 and 9 into the corresponding digit character, or +computing the difference between two digits/two other characters. For these +reasons, we have decided to start out by representing character literals as +single Unicode code points following a more integer-like model. However this +topic should be revisited if we find that there is a significant need for the +additional functionality and attendant complexity for these other character +formulations. + +## Future Work + +### UTF code unit types proposal + +There have been several ideas and discussions around how we would like to handle +UTF code units. This section will hopefully provide some guidance for a future +proposal when the topic is revisited for how we would like to build out +encoding/decoding for character literals. + +We will have the types `Char8`, `Char16`, and `Char32` representing code units +in UTF-8, UTF-16, and UTF-32, but we will not support all code units, but only +those which map directly to the complete value of a code point. However, +character literals will use their own types distinct from these: + +- We will support value preserving implicit conversions from character + literals to code point or code unit types. In particular, a character + literal converts to a `Char8` UTF-8 code unit if it is less than or equal to + 0x7F, and `Char16` UTF-16 code unit if it is less than or equal to 0xFFFF. +- Conversions from string or character literals to a non-value-preserving + encoding must be explicit. +- Conversions from string literals to Unicode strings are implicit, even + though the numeric values of the encoding may change. + +We can see whether the particular literal is represented in the variable's type +by only looking at the types. + +``` +let allowed: Char8 = 'a'; +``` + +The above is allowed because the type of `'a'` is the character literal +consisting of the single Unicode code point 97, which can be converted to +`Char8` since 97 is less than or equal to 0x7F. + +``` +let error1: Char8 = '😃'; +let error2: Char8 = 'AB'; +``` + +However these should produce errors. The type of `'😃'` is the character literal +consisting of the single Unicode code point `0x1F603`, which is greater than +0x7F. The type of `'AB'` is a character literal that is a sequence of two +Unicode code points, which has no conversion to a type that only handles a +single UTF-8 code unit. + +All of `'\n'`, and `'\u{A}'` represent the same character and so have the same +type. However, explicitly converting this character literal to another character +set might result in a character with a different value, but that still +represents the newline character.