Skip to content

Commit d4eac74

Browse files
authored
Merge pull request #1452 from mattheww/2024-01_string_literal_expr
String literal expressions
2 parents 8c77e8b + 00a2ac6 commit d4eac74

File tree

2 files changed

+223
-30
lines changed

2 files changed

+223
-30
lines changed

Diff for: src/expressions/literal-expr.md

+212-5
Original file line numberDiff line numberDiff line change
@@ -26,29 +26,227 @@ Each of the lexical [literal][literal tokens] forms described earlier can make u
2626
5; // integer type
2727
```
2828

29+
In the descriptions below, the _string representation_ of a token is the sequence of characters from the input which matched the token's production in a *Lexer* grammar snippet.
30+
31+
> **Note**: this string representation never includes a character `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
32+
33+
## Escapes
34+
35+
The descriptions of textual literal expressions below make use of several forms of _escape_.
36+
37+
Each form of escape is characterised by:
38+
* an _escape sequence_: a sequence of characters, which always begins with `U+005C` (`\`)
39+
* an _escaped value_: either a single character or an empty sequence of characters
40+
41+
In the definitions of escapes below:
42+
* An _octal digit_ is any of the characters in the range \[`0`-`7`].
43+
* A _hexadecimal digit_ is any of the characters in the ranges \[`0`-`9`], \[`a`-`f`], or \[`A`-`F`].
44+
45+
### Simple escapes
46+
47+
Each sequence of characters occurring in the first column of the following table is an escape sequence.
48+
49+
In each case, the escaped value is the character given in the corresponding entry in the second column.
50+
51+
| Escape sequence | Escaped value |
52+
|-----------------|--------------------------|
53+
| `\0` | U+0000 (NUL) |
54+
| `\t` | U+0009 (HT) |
55+
| `\n` | U+000A (LF) |
56+
| `\r` | U+000D (CR) |
57+
| `\"` | U+0022 (QUOTATION MARK) |
58+
| `\'` | U+0027 (APOSTROPHE) |
59+
| `\\` | U+005C (REVERSE SOLIDUS) |
60+
61+
### 8-bit escapes
62+
63+
The escape sequence consists of `\x` followed by two hexadecimal digits.
64+
65+
The escaped value is the character whose [Unicode scalar value] is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by [`u8::from_str_radix`] with radix 16.
66+
67+
> **Note**: the escaped value therefore has a [Unicode scalar value] in the range of [`u8`][numeric types].
68+
69+
### 7-bit escapes
70+
71+
The escape sequence consists of `\x` followed by an octal digit then a hexadecimal digit.
72+
73+
The escaped value is the character whose [Unicode scalar value] is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by [`u8::from_str_radix`] with radix 16.
74+
75+
### Unicode escapes
76+
77+
The escape sequence consists of `\u{`, followed by a sequence of characters each of which is a hexadecimal digit or `_`, followed by `}`.
78+
79+
The escaped value is the character whose [Unicode scalar value] is the result of interpreting the hexadecimal digits contained in the escape sequence as a hexadecimal integer, as if by [`u8::from_str_radix`] with radix 16.
80+
81+
> **Note**: the permitted forms of a [CHAR_LITERAL] or [STRING_LITERAL] token ensure that there is such a character.
82+
83+
### String continuation escapes
84+
85+
The escape sequence consists of `\` followed immediately by `U+000A` (LF), and all following whitespace characters before the next non-whitespace character.
86+
For this purpose, the whitespace characters are `U+0009` (HT), `U+000A` (LF), `U+000D` (CR), and `U+0020` (SPACE).
87+
88+
The escaped value is an empty sequence of characters.
89+
90+
> **Note**: The effect of this form of escape is that a string continuation skips following whitespace, including additional newlines.
91+
> Thus `a`, `b` and `c` are equal:
92+
> ```rust
93+
> let a = "foobar";
94+
> let b = "foo\
95+
> bar";
96+
> let c = "foo\
97+
>
98+
> bar";
99+
>
100+
> assert_eq!(a, b);
101+
> assert_eq!(b, c);
102+
> ```
103+
>
104+
> Skipping additional newlines (as in example c) is potentially confusing and unexpected.
105+
> This behavior may be adjusted in the future.
106+
> Until a decision is made, it is recommended to avoid relying on skipping multiple newlines with line continuations.
107+
> See [this issue](https://github.com/rust-lang/reference/pull/1042) for more information.
108+
29109
## Character literal expressions
30110
31111
A character literal expression consists of a single [CHAR_LITERAL] token.
32112
33-
> **Note**: This section is incomplete.
113+
The expression's type is the primitive [`char`][textual types] type.
114+
115+
The token must not have a suffix.
116+
117+
The token's _literal content_ is the sequence of characters following the first `U+0027` (`'`) and preceding the last `U+0027` (`'`) in the string representation of the token.
118+
119+
The literal expression's _represented character_ is derived from the literal content as follows:
120+
121+
* If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
122+
* [Simple escapes]
123+
* [7-bit escapes]
124+
* [Unicode escapes]
125+
126+
* Otherwise the represented character is the single character that makes up the literal content.
127+
128+
The expression's value is the [`char`][textual types] corresponding to the represented character's [Unicode scalar value].
129+
130+
> **Note**: the permitted forms of a [CHAR_LITERAL] token ensure that these rules always produce a single character.
131+
132+
Examples of character literal expressions:
133+
134+
```rust
135+
'R'; // R
136+
'\''; // '
137+
'\x52'; // R
138+
'\u{00E6}'; // LATIN SMALL LETTER AE (U+00E6)
139+
```
34140
35141
## String literal expressions
36142

37143
A string literal expression consists of a single [STRING_LITERAL] or [RAW_STRING_LITERAL] token.
38144

39-
> **Note**: This section is incomplete.
145+
The expression's type is a shared reference (with `static` lifetime) to the primitive [`str`][textual types] type.
146+
That is, the type is `&'static str`.
147+
148+
The token must not have a suffix.
149+
150+
The token's _literal content_ is the sequence of characters following the first `U+0022` (`"`) and preceding the last `U+0022` (`"`) in the string representation of the token.
151+
152+
The literal expression's _represented string_ is a sequence of characters derived from the literal content as follows:
153+
154+
* If the token is a [STRING_LITERAL], each escape sequence of any of the following forms occurring in the literal content is replaced by the escape sequence's escaped value.
155+
* [Simple escapes]
156+
* [7-bit escapes]
157+
* [Unicode escapes]
158+
* [String continuation escapes]
159+
160+
These replacements take place in left-to-right order.
161+
For example, the token `"\\x41"` is converted to the characters `\` `x` `4` `1`.
162+
163+
* If the token is a [RAW_STRING_LITERAL], the represented string is identical to the literal content.
164+
165+
The expression's value is a reference to a statically allocated [`str`][textual types] containing the UTF-8 encoding of the represented string.
166+
167+
Examples of string literal expressions:
168+
169+
```rust
170+
"foo"; r"foo"; // foo
171+
"\"foo\""; r#""foo""#; // "foo"
172+
173+
"foo #\"# bar";
174+
r##"foo #"# bar"##; // foo #"# bar
175+
176+
"\x52"; "R"; r"R"; // R
177+
"\\x52"; r"\x52"; // \x52
178+
```
40179

41180
## Byte literal expressions
42181

43182
A byte literal expression consists of a single [BYTE_LITERAL] token.
44183

45-
> **Note**: This section is incomplete.
184+
The expression's type is the primitive [`u8`][numeric types] type.
185+
186+
The token must not have a suffix.
187+
188+
The token's _literal content_ is the sequence of characters following the first `U+0027` (`'`) and preceding the last `U+0027` (`'`) in the string representation of the token.
189+
190+
The literal expression's _represented character_ is derived from the literal content as follows:
191+
192+
* If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
193+
* [Simple escapes]
194+
* [8-bit escapes]
195+
196+
* Otherwise the represented character is the single character that makes up the literal content.
197+
198+
The expression's value is the represented character's [Unicode scalar value].
199+
200+
> **Note**: the permitted forms of a [BYTE_LITERAL] token ensure that these rules always produce a single character, whose Unicode scalar value is in the range of [`u8`][numeric types].
201+
202+
Examples of byte literal expressions:
203+
204+
```rust
205+
b'R'; // 82
206+
b'\''; // 39
207+
b'\x52'; // 82
208+
b'\xA0'; // 160
209+
```
46210

47211
## Byte string literal expressions
48212

49-
A string literal expression consists of a single [BYTE_STRING_LITERAL] or [RAW_BYTE_STRING_LITERAL] token.
213+
A byte string literal expression consists of a single [BYTE_STRING_LITERAL] or [RAW_BYTE_STRING_LITERAL] token.
50214

51-
> **Note**: This section is incomplete.
215+
The expression's type is a shared reference (with `static` lifetime) to an array whose element type is [`u8`][numeric types].
216+
That is, the type is `&'static [u8; N]`, where `N` is the number of bytes in the represented string described below.
217+
218+
The token must not have a suffix.
219+
220+
The token's _literal content_ is the sequence of characters following the first `U+0022` (`"`) and preceding the last `U+0022` (`"`) in the string representation of the token.
221+
222+
The literal expression's _represented string_ is a sequence of characters derived from the literal content as follows:
223+
224+
* If the token is a [BYTE_STRING_LITERAL], each escape sequence of any of the following forms occurring in the literal content is replaced by the escape sequence's escaped value.
225+
* [Simple escapes]
226+
* [8-bit escapes]
227+
* [String continuation escapes]
228+
229+
These replacements take place in left-to-right order.
230+
For example, the token `b"\\x41"` is converted to the characters `\` `x` `4` `1`.
231+
232+
* If the token is a [RAW_BYTE_STRING_LITERAL], the represented string is identical to the literal content.
233+
234+
The expression's value is a reference to a statically allocated array containing the [Unicode scalar values] of the characters in the represented string, in the same order.
235+
236+
> **Note**: the permitted forms of [BYTE_STRING_LITERAL] and [RAW_BYTE_STRING_LITERAL] tokens ensure that these rules always produce array element values in the range of [`u8`][numeric types].
237+
238+
Examples of byte string literal expressions:
239+
240+
```rust
241+
b"foo"; br"foo"; // foo
242+
b"\"foo\""; br#""foo""#; // "foo"
243+
244+
b"foo #\"# bar";
245+
br##"foo #"# bar"##; // foo #"# bar
246+
247+
b"\x52"; b"R"; br"R"; // R
248+
b"\\x52"; br"\x52"; // \x52
249+
```
52250

53251
## C string literal expressions
54252

@@ -167,6 +365,11 @@ The expression's type is the primitive [boolean type], and its value is:
167365
* false if the keyword is `false`
168366

169367

368+
[Simple escapes]: #simple-escapes
369+
[8-bit escapes]: #8-bit-escapes
370+
[7-bit escapes]: #7-bit-escapes
371+
[Unicode escapes]: #unicode-escapes
372+
[String continuation escapes]: #string-continuation-escapes
170373
[boolean type]: ../types/boolean.md
171374
[constant expression]: ../const_eval.md#constant-expressions
172375
[floating-point types]: ../types/numeric.md#floating-point-types
@@ -177,12 +380,16 @@ The expression's type is the primitive [boolean type], and its value is:
177380
[suffix]: ../tokens.md#suffixes
178381
[negation operator]: operator-expr.md#negation-operators
179382
[overflow]: operator-expr.md#overflow
383+
[textual types]: ../types/textual.md
384+
[Unicode scalar value]: http://www.unicode.org/glossary/#unicode_scalar_value
385+
[Unicode scalar values]: http://www.unicode.org/glossary/#unicode_scalar_value
180386
[`f32::from_str`]: ../../core/primitive.f32.md#method.from_str
181387
[`f32::INFINITY`]: ../../core/primitive.f32.md#associatedconstant.INFINITY
182388
[`f32::NAN`]: ../../core/primitive.f32.md#associatedconstant.NAN
183389
[`f64::from_str`]: ../../core/primitive.f64.md#method.from_str
184390
[`f64::INFINITY`]: ../../core/primitive.f64.md#associatedconstant.INFINITY
185391
[`f64::NAN`]: ../../core/primitive.f64.md#associatedconstant.NAN
392+
[`u8::from_str_radix`]: ../../core/primitive.u8.md#method.from_str_radix
186393
[`u128::from_str_radix`]: ../../core/primitive.u128.md#method.from_str_radix
187394
[CHAR_LITERAL]: ../tokens.md#character-literals
188395
[STRING_LITERAL]: ../tokens.md#string-literals

Diff for: src/tokens.md

+11-25
Original file line numberDiff line numberDiff line change
@@ -156,30 +156,13 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two
156156
`U+0022` (double-quote) characters, with the exception of `U+0022` itself,
157157
which must be _escaped_ by a preceding `U+005C` character (`\`).
158158

159-
Line-breaks are allowed in string literals. A line-break is either a newline
160-
(`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`). Both
161-
byte sequences are normally translated to `U+000A`, but as a special exception,
162-
when an unescaped `U+005C` character (`\`) occurs immediately before a line
163-
break, then the line break character(s), and all immediately following
164-
` ` (`U+0020`), `\t` (`U+0009`), `\n` (`U+000A`) and `\r` (`U+0000D`) characters
165-
are ignored. Thus `a`, `b` and `c` are equal:
159+
Line-breaks are allowed in string literals.
160+
A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`).
161+
Both byte sequences are translated to `U+000A`.
166162

167-
```rust
168-
let a = "foobar";
169-
let b = "foo\
170-
bar";
171-
let c = "foo\
172-
173-
bar";
174-
175-
assert_eq!(a, b);
176-
assert_eq!(b, c);
177-
```
163+
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
164+
See [String continuation escapes] for details.
178165

179-
> Note: Rust skipping additional newlines (like in example `c`) is potentially confusing and
180-
> unexpected. This behavior may be adjusted in the future. Until a decision is made, it is
181-
> recommended to avoid relying on this, i.e. skipping multiple newlines with line continuations.
182-
> See [this issue](https://github.com/rust-lang/reference/pull/1042) for more information.
183166

184167
#### Character escapes
185168

@@ -274,7 +257,7 @@ preceded by the characters `U+0062` (`b`) and `U+0022` (double-quote), and
274257
followed by the character `U+0022`. If the character `U+0022` is present within
275258
the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
276259
Alternatively, a byte string literal can be a _raw byte string literal_, defined
277-
below. The type of a byte string literal of length `n` is `&'static [u8; n]`.
260+
below.
278261

279262
Some additional _escapes_ are available in either byte or non-raw byte string
280263
literals. An escape starts with a `U+005C` (`\`) and continues with one of the
@@ -479,7 +462,7 @@ An _integer literal_ has one of four forms:
479462

480463
Like any literal, an integer literal may be followed (immediately, without any spaces) by a suffix as described above.
481464
The suffix may not begin with `e` or `E`, as that would be interpreted as the exponent of a floating-point literal.
482-
See [literal expressions] for the effect of these suffixes.
465+
See [Integer literal expressions] for the effect of these suffixes.
483466

484467
Examples of integer literals which are accepted as literal expressions:
485468

@@ -576,7 +559,7 @@ A _floating-point literal_ has one of two forms:
576559
Like integer literals, a floating-point literal may be followed by a
577560
suffix, so long as the pre-suffix part does not end with `U+002E` (`.`).
578561
The suffix may not begin with `e` or `E` if the literal does not include an exponent.
579-
See [literal expressions] for the effect of these suffixes.
562+
See [Floating-point literal expressions] for the effect of these suffixes.
580563

581564
Examples of floating-point literals which are accepted as literal expressions:
582565

@@ -784,12 +767,14 @@ Similarly the `r`, `b`, `br`, `c`, and `cr` prefixes used in raw string literals
784767
[extern crates]: items/extern-crates.md
785768
[extern]: items/external-blocks.md
786769
[field]: expressions/field-expr.md
770+
[Floating-point literal expressions]: expressions/literal-expr.md#floating-point-literal-expressions
787771
[floating-point types]: types/numeric.md#floating-point-types
788772
[function pointer type]: types/function-pointer.md
789773
[functions]: items/functions.md
790774
[generics]: items/generics.md
791775
[identifier]: identifiers.md
792776
[if let]: expressions/if-expr.md#if-let-expressions
777+
[Integer literal expressions]: expressions/literal-expr.md#integer-literal-expressions
793778
[keywords]: keywords.md
794779
[lazy-bool]: expressions/operator-expr.md#lazy-boolean-operators
795780
[literal expressions]: expressions/literal-expr.md
@@ -808,6 +793,7 @@ Similarly the `r`, `b`, `br`, `c`, and `cr` prefixes used in raw string literals
808793
[raw pointers]: types/pointer.md#raw-pointers-const-and-mut
809794
[references]: types/pointer.md
810795
[sized]: trait-bounds.md#sized
796+
[String continuation escapes]: expressions/literal-expr.md#string-continuation-escapes
811797
[struct expressions]: expressions/struct-expr.md
812798
[trait bounds]: trait-bounds.md
813799
[tuple index]: expressions/tuple-expr.md#tuple-indexing-expressions

0 commit comments

Comments
 (0)