Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character Literals (#1934) #1964

Merged
Merged
Changes from 6 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
d2538d8
Character Literals
cabmeurer Sep 4, 2022
4c38941
Update proposals/p1964.md
cabmeurer Sep 11, 2022
df98ff4
Apply suggestions from code review
cabmeurer Sep 11, 2022
05b9a18
Remove restate of 'Why' from Support prefix decl
cabmeurer Sep 11, 2022
c4e54c1
Remove reduant bullet
cabmeurer Sep 11, 2022
3fde33a
Update Prefix support section and add TODO
cabmeurer Sep 11, 2022
5d1d2f9
Update Operations section with types of operands and results
cabmeurer Sep 19, 2022
d618da8
Fix code format for Problem section
cabmeurer Sep 19, 2022
8dfd0d7
Example showing that using a string is less appropriate than a character
cabmeurer Sep 20, 2022
a72e990
Example showing that using a string is less appropriate than a character
cabmeurer Sep 20, 2022
5d0df06
Update Abstract and Problem sections
cabmeurer Sep 20, 2022
a147db0
Update Abstract and Problem sections
cabmeurer Sep 20, 2022
b700b7f
Type section describing variable's type
cabmeurer Sep 20, 2022
334a7a0
Format Type section
cabmeurer Sep 20, 2022
bdf44ea
Format Type section
cabmeurer Sep 20, 2022
b4daeb0
Update Background section
cabmeurer Sep 21, 2022
398f742
Update Alternatives considered with Disallowing numeric escape sequences
cabmeurer Sep 21, 2022
7e01c75
Update Alternatives considered with Disallowing numeric escape sequences
cabmeurer Sep 21, 2022
fbbbccb
Format Type section
cabmeurer Sep 21, 2022
2d7c899
Update operations and alternatives section
cabmeurer Sep 24, 2022
ebf2548
Update proposals/p1964.md
cabmeurer Sep 27, 2022
d95e3b1
Update proposals/p1964.md
cabmeurer Sep 27, 2022
d8111ba
Update proposals/p1964.md
cabmeurer Sep 27, 2022
b4d3ec2
Update proposals/p1964.md
cabmeurer Sep 27, 2022
b44c751
Update proposals/p1964.md
cabmeurer Sep 27, 2022
aafeac0
Update proposals/p1964.md
cabmeurer Sep 27, 2022
def0273
Update proposals/p1964.md
cabmeurer Sep 27, 2022
21e53b3
Update proposals/p1964.md
cabmeurer Sep 27, 2022
f1d5602
Update proposals/p1964.md
cabmeurer Sep 27, 2022
091ba9c
Update proposals/p1964.md
cabmeurer Sep 27, 2022
50470e8
Update proposals/p1964.md
cabmeurer Sep 27, 2022
7c0829e
Update operations section
cabmeurer Sep 27, 2022
e8454f7
Update operations section
cabmeurer Sep 27, 2022
a2c0a90
Add link to design idea
cabmeurer Sep 27, 2022
9e5ecb9
Explicit disallow other whitespace characters other than word space
cabmeurer Sep 27, 2022
5f2ce1e
Provide details for No Distinct Character Literal alternative
cabmeurer Sep 28, 2022
951b6b7
Fix typo; tilde -> acute accent mark
cabmeurer Sep 28, 2022
b6345d9
Fix spacing
cabmeurer Sep 28, 2022
4f7d678
Better example for Rationale section
cabmeurer Sep 28, 2022
09ca59e
Fix grammer
cabmeurer Sep 28, 2022
7f91aa4
Provide details for not supporting prefix declarations alternative
cabmeurer Sep 28, 2022
4dcc633
Typo
cabmeurer Sep 28, 2022
153e1a5
Grammer
cabmeurer Sep 28, 2022
8b15cbc
Provide details for Disallowing Numeric Escape Sequences in Alternati…
cabmeurer Oct 3, 2022
d321b37
Provide details for Disallowing Numeric Escape Sequences in Alternati…
cabmeurer Oct 3, 2022
8eacb58
Update proposals/p1964.md
cabmeurer Oct 5, 2022
377096c
Update proposals/p1964.md
cabmeurer Oct 5, 2022
8d34805
Update proposals/p1964.md
cabmeurer Oct 5, 2022
0caa49f
Update proposals/p1964.md
cabmeurer Oct 5, 2022
df0fb9c
Update proposals/p1964.md
cabmeurer Oct 5, 2022
0b4451c
Update proposals/p1964.md
cabmeurer Oct 5, 2022
1a53051
Formatting
cabmeurer Oct 5, 2022
2f4eb91
Apply suggestions from code review
cabmeurer Oct 8, 2022
61180df
Apply suggestions from review
cabmeurer Oct 8, 2022
174a551
Update details section with disucssion from discord
cabmeurer Oct 10, 2022
d9a1b99
Update details section with disucssion from discord
cabmeurer Oct 10, 2022
4d3204d
Fix operations statment
cabmeurer Oct 10, 2022
98864dd
Update Details section from discord conslusion
cabmeurer Oct 14, 2022
4aff5f2
Apply suggestions from code review
cabmeurer Oct 20, 2022
680ed7c
Elaborate on 'No distinct character types'
cabmeurer Oct 20, 2022
c119ee9
Add suggestion from review
cabmeurer Oct 20, 2022
be7cbf0
Remove encoding section and add to details section
cabmeurer Oct 20, 2022
5418553
Remove encoding section and add to details section
cabmeurer Oct 20, 2022
abea9b7
Fix typo
cabmeurer Oct 20, 2022
051b100
Add example of comparison
cabmeurer Oct 20, 2022
7c7950e
Typo
cabmeurer Oct 20, 2022
c53a4ce
Apply suggestions from code review
cabmeurer Oct 22, 2022
223aaa5
Apply suggestions from code review
cabmeurer Oct 22, 2022
63f2f41
Add suggestions from review
cabmeurer Oct 22, 2022
7667e25
Apply suggestions from code review
cabmeurer Nov 15, 2022
d35c1f6
Update to for more integer based approach, disallow numeric escape se…
cabmeurer Nov 15, 2022
e1810da
Update details
cabmeurer Nov 16, 2022
10b660d
Consolidate proposla, update details, add alternatives section for su…
cabmeurer Nov 22, 2022
606219e
Apply suggestions from code review
cabmeurer Nov 23, 2022
1abad92
Format header
cabmeurer Nov 23, 2022
d165370
Add suggestions from review
cabmeurer Nov 23, 2022
78192d3
Update types section: Value code point representation
cabmeurer Dec 3, 2022
f6eeff6
Apply suggestions from code review
cabmeurer Dec 9, 2022
cb7e896
Apply suggestions from code review
cabmeurer Dec 18, 2022
7a94dfc
Update types section, create future work section
cabmeurer Jan 2, 2023
afad667
Update operators section
cabmeurer Jan 2, 2023
2393390
Update rationale section
cabmeurer Jan 2, 2023
e222263
Apply suggestions from code review
cabmeurer Jan 2, 2023
d862fdb
Update alternatives considered
cabmeurer Jan 2, 2023
cb1c1e5
Update alternatives considered format
cabmeurer Jan 2, 2023
5d15fec
Apply suggestions from code review
cabmeurer Mar 11, 2023
2ebf115
Update format
cabmeurer Mar 11, 2023
d1af3a7
Apply suggestions from code review
cabmeurer Jun 3, 2023
44ff724
Apply suggestions from code review
cabmeurer Jun 15, 2023
a86d571
Apply suggestions from review - examples for Types
cabmeurer Jun 15, 2023
842ab4d
Apply suggestions from review - format
cabmeurer Jun 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 85 additions & 69 deletions proposals/p1964.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
- [Supporting prefix declarations](#supporting-prefix-declarations)
- [Allowing numeric escape sequences](#allowing-numeric-escape-sequences)
- [Supporting formulations of grapheme clusters and non-code-point code-units](#supporting-formulations-of-grapheme-clusters-and-non-code-point-code-units)
- [Future Work](#future-work)
- [UTF code unit types proposal](#utf-code-unit-types-proposal)

<!-- tocstop -->

cabmeurer marked this conversation as resolved.
Show resolved Hide resolved
Expand Down Expand Up @@ -105,8 +107,6 @@ like numeric literals:
- Character literals support some back-slash (`\`) escape sequences, including
`\t`, `\n`, `\r`, `\"`, `\'`, `\\`, `\0`, and `\u{HHHH...}`. See
[String Literals: Escape sequence](https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0199.md#escape-sequences).
- Character literals implicitly convert to Unicode strings, even though the
numeric values of the encoding may change.

We will not support:

Expand All @@ -115,89 +115,52 @@ We will not support:
- "raw" literals (using #'x'#);
- `\x` escape sequences;
- character literals with a single quote (`'`) or back-slash (`\`), except as
part of an escape sequence;
part of an escape sequence
cabmeurer marked this conversation as resolved.
Show resolved Hide resolved
- empty character literals (`''`);
- ASCII control codes (0...31), including whitespace characters other than
word space (tab, line feed, carriage return, form feed, and vertical tab),
except when specified with an escape sequence.

### Types
cabmeurer marked this conversation as resolved.
Show resolved Hide resolved

We will have the types `Char8`, `Char16`, and `Char32` representing code units
in UTF-8, UTF-16, and UTF-32, but we will not support all code units, but only
those which map directly to the complete value of a code point. However,
character literals will use their own types distinct from these:
For the time being we will support a type `CharN` that will hold both code units
cabmeurer marked this conversation as resolved.
Show resolved Hide resolved
and code points, and will leave the different UTF-encoding code unit types to
another proposal. See
[UTF code unit types proposal](#utf-code-unit-types-proposal))
cabmeurer marked this conversation as resolved.
Show resolved Hide resolved

- We will support value preserving implicit conversions from character
literals to code point or code unit types. In particular, a character
literal converts to a `Char8` UTF-8 code unit if it is less than or equal to
0x7F, and `Char16` UTF-16 code unit if it is less than or equal to 0xFFFF.
- Conversions from string or character literals to a non-value-preserving
encoding must be explicit.
- Conversions from string literals to Unicode strings are implicit, even
though the numeric values of the encoding may change.

We can see whether the particular literal is represented in the variable's type
by only looking at the types.
We will have the type `CharN` and only support literals that map directly to the
complete value of a code point.

```
let allowed: Char8 = 'a';
let allowed: CharN = 'a';
```

The above is allowed because the type of `'a'` is the character literal
consisting of the single Unicode code point 97, which can be converted to
`Char8` since 97 is less than or equal to 0x7F.

```
let error1: Char8 = '😃';
let error2: Char8 = 'AB';
```

However these should produce errors. The type of `'😃'` is the character literal
consisting of the single Unicode code point `0x1F603`, which is greater than
0x7F. The type of `'AB'` is a character literal that is a sequence of two
Unicode code points, which is not valid.

Literals `'\n'` and `'\u{A}'` represent the same character and so have the same
type. However, explicitly converting this character literal to another character
set might result in a character with a different value, but that still
represents the newline character.

```
// 0x15 is the new line character in EBCDIC.
Assert('\n' as EBCDICChar == EBCDICChar.Make(0x15));
Assert('\u{A}' as EBCDICChar != EBCDICChar.Make(0x0A));
```
`CharN` since 97 is less than or equal to 0x7F.

cabmeurer marked this conversation as resolved.
Show resolved Hide resolved
### Operations

Character literals representing a single code point support the following
operators:

- Comparison: `<`, `>`, `<=`, `>=` `==`
cabmeurer marked this conversation as resolved.
Show resolved Hide resolved
- Plus: `+`. However the behavior of this operator is different from its use
with strings. Rather than a concatenation, this will add the value of the
two characters:
- If the `+` is used between a character literal representing a single
Unicode code point and an integer literal, this should produce a
character literal if the result fits in a Unicode code point.
- If the `+` is used between two character literals this will produce an
error.
- Subtract: `-` Similar to the plus operator, this will subtract the value of
the two characters.

- If the `-` is used between a character literal representing a single
code point or code unit and an integer literal, this should produce a
character literal as long as the result of the difference is in range.
- If the `-` is used between two character literals, which are both a
single code point, this should produce an integer literal representing
the difference between the code points.

In cases where the `+` and `-` operators are used with a character literal
and a non-integer non-literal, the character literal should be converted to
the type of the other operand if possible. For example, in the expression
`w - 'a'`, where `w` is of type `Char8`, the `'a'` literal will be converted
to type `Char8`.
- Plus: `+`. This doesn't concatenate, but allows numerically adjusting the
value:
- Only one operand may be a character literal, the other must be an
integer literal.
Comment on lines +185 to +186
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One consequence appears to be that this is invalid:

fn Digit(n: i8) -> Char32 {
  return '0' + n;
}

... and something like return ('0' as Char32) + n; would be needed instead. I think I'm OK with that, but I expect it to be a minor source of friction as that kind of usage is fairly common in C++.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that '0' + n has a value that is only known at runtime, what type should it be? Using the type of n here is a bit worrisome, due to overflow. I would be fine with saying the result would be Char32, but maybe that would only make sense for some types of n? For example if n: i64, a Char32 result would be surprising.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like keeping this super-explicit for now (requiring a cast to a specific sized type). We can try to add better defaults if in practice this friction is something users dislike. I'm somewhat hopeful that instead we can have a really easily discovered (and targeted in migration) API for mapping to digits, and avoid how much this comes up in practice. But it seems easy to address if it does come up.

- The result is the character literal whose numeric value is the sum of
numeric value of the operands. If that sum is not a valid Unicode code
point, it is an error.
- Subtract: `-`. This will subtract the value of the two characters, or a
character followed by an integer literal:
- If the `-` is used between two character literals, the result will be a
character constant `'z' - 4`.
- If the `-` is used between a character literal, followed by a integer
literal this will produce an integer constant `'z' - 'a'`.
cabmeurer marked this conversation as resolved.
Show resolved Hide resolved
cabmeurer marked this conversation as resolved.
Show resolved Hide resolved
- If the `-` is used between a integer literal followed by a character
literal `100 - 'a'`, this will be rejected unless the integer is cast to
a character.

There is intentionally no implicit conversion from character literals to integer
types, but explicit conversions are permitted between character literals and
Expand All @@ -212,11 +175,11 @@ Adding support for a specific character literal supports clean, readable,
concise use and is a much more familiar concept that will make it easier to
adopt Carbon coming from other languages. Have a distinct character literal will
also allow us support useful operations designed to manipulate the literal's
value. When working with `String`, we use the `+` operator to concatenate
multiple `String`s, but say we wanted to advance a character to the next
literal. Using a `String` will produce a type error, as we are misusing the `+`
operator: `"a" + 1`. However with a character literal, we can support operations
for these use cases:
value. When working with an explicit character type we can use operators that
have unique behavior, for example say we wanted to advance a character to the
next literal. In other languages the `+` operator is often used for
concatenation, so using a `String` will produce a type error: `"a" + 1`. However
with a character literal, we can support operations for these use cases:

```
var b: u8;
Expand Down Expand Up @@ -275,6 +238,9 @@ disadvantages:
- It's useful for developers to document the intended meaning of a value, and
using a distinct type is one way to do that.

See [UTF code unit types proposal](#utf-code-unit-types-proposal) for more
information about UTF encoding types for a future proposal.

### No distinct character literal

In principle, a character literal can be represented by reusing string literals
Expand Down Expand Up @@ -354,3 +320,53 @@ single Unicode code points following a more integer-like model. However this
topic should be revisited if we find that there is a significant need for the
additional functionality and attendant complexity for these other character
formulations.

## Future Work

### UTF code unit types proposal

There have been several ideas and discussions around how we would like to handle
UTF code units. This section will hopefully provide some guidance for a future
proposal when the topic is revisited for how we would like to build out
encoding/decoding for character literals.

We will have the types `Char8`, `Char16`, and `Char32` representing code units
in UTF-8, UTF-16, and UTF-32, but we will not support all code units, but only
those which map directly to the complete value of a code point. However,
character literals will use their own types distinct from these:

- We will support value preserving implicit conversions from character
literals to code point or code unit types. In particular, a character
literal converts to a `Char8` UTF-8 code unit if it is less than or equal to
0x7F, and `Char16` UTF-16 code unit if it is less than or equal to 0xFFFF.
- Conversions from string or character literals to a non-value-preserving
encoding must be explicit.
- Conversions from string literals to Unicode strings are implicit, even
though the numeric values of the encoding may change.

We can see whether the particular literal is represented in the variable's type
by only looking at the types.

```
let allowed: Char8 = 'a';
```

The above is allowed because the type of `'a'` is the character literal
consisting of the single Unicode code point 97, which can be converted to
`Char8` since 97 is less than or equal to 0x7F.

```
let error1: Char8 = '😃';
let error2: Char8 = 'AB';
```

However these should produce errors. The type of `'😃'` is the character literal
consisting of the single Unicode code point `0x1F603`, which is greater than
0x7F. The type of `'AB'` is a character literal that is a sequence of two
Unicode code points, which has no conversion to a type that only handles a
single UTF-8 code unit.

All of `'\n'`, and `'\u{A}'` represent the same character and so have the same
type. However, explicitly converting this character literal to another character
set might result in a character with a different value, but that still
represents the newline character.