From f4e3da2b4afeff9bbeeb439cafa6fa30a21916d8 Mon Sep 17 00:00:00 2001 From: Lukas Kalbertodt Date: Fri, 25 Jul 2025 14:19:15 +0200 Subject: [PATCH] Fix and clarify CR LF normalization and CR in string literals MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This was slightly incorrect before. Relevant commits changing this: - fa56fdba0e9dba35eb29d11c95c7a009ed67cb35 - 27e1ec97a75267d3c9efb9c91c7509eff98d11db The normalization is not applied repeatedly, so CR LF pairs can still exist. Further, given that the normalization happens before lexing, the part "other than as part of such a string continuation escape" is not useful. Either it was CR LF in the raw input, but has already been transformed already (so the lexical grammar does not see CR). Or there is a surviving CR LF pair after the normalization, which is disallowed tho. Here are two test programs showing this behavior: printf 'fn main() { "a\r\r\n\nb"; }' > code.rs | rustc - Results in: error: bare CR not allowed in string, use `\r` instead --> :1:15 | 1 | fn main() { "a␍ | ^ | help: escape the character | 1 | fn main() { "a\r | ++ And printf 'fn main() { "a\\\r\r\n\nb"; }' > code.rs | rustc - Results in error: unknown character escape: `\r` --> :1:16 | 1 | fn main() { "a\␍ | ^ unknown character escape | = help: this is an isolated carriage return; consider checking your editor and version control settings --- src/input-format.md | 1 + src/tokens.md | 8 +++----- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/src/input-format.md b/src/input-format.md index d4a8fe480..afdf8ac37 100644 --- a/src/input-format.md +++ b/src/input-format.md @@ -24,6 +24,7 @@ r[input.crlf] ## CRLF normalization Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF). +This happens once, not repeatedly, so after the normalization, there can still exist `U+000D` (CR) immediately followed by `U+000A` (LF) in the input (e.g. if the raw input contained "CR CR LF LF"). Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]). diff --git a/src/tokens.md b/src/tokens.md index 8f8ae10df..88b4c0b5c 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -60,8 +60,6 @@ Literals are tokens used in [literal expressions]. [^nsets]: The number of `#`s on each side of the same literal must be equivalent. -> [!NOTE] -> Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF). #### ASCII escapes @@ -198,9 +196,9 @@ which must be _escaped_ by a preceding `U+005C` character (`\`). r[lex.token.literal.str.linefeed] Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals. +The character `U+000D` (CR) may not appear in a string literal. When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token. See [String continuation escapes] for details. -The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape. r[lex.token.literal.char-escape] #### Character escapes @@ -323,9 +321,9 @@ below. r[lex.token.str-byte.linefeed] Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals. +The character `U+000D` (CR) may not appear in a byte string literal. When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token. See [String continuation escapes] for details. -The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape. r[lex.token.str-byte.escape] Some additional _escapes_ are available in either byte or non-raw byte string @@ -429,9 +427,9 @@ permitted within a C string. r[lex.token.str-c.linefeed] Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals. +The character `U+000D` (CR) may not appear in a C string literal. When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token. See [String continuation escapes] for details. -The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape. r[lex.token.str-c.escape] Some additional _escapes_ are available in non-raw C string literals. An escape