From 5d1950799abab6174d8a797952889851dbe2774b Mon Sep 17 00:00:00 2001 From: John Millikin Date: Wed, 1 Nov 2023 16:09:43 +0900 Subject: [PATCH] Document C string literal tokens. --- src/expressions/literal-expr.md | 10 ++++ src/tokens.md | 88 ++++++++++++++++++++++++++++++--- 2 files changed, 90 insertions(+), 8 deletions(-) diff --git a/src/expressions/literal-expr.md b/src/expressions/literal-expr.md index e5bc2dff4..703b55880 100644 --- a/src/expressions/literal-expr.md +++ b/src/expressions/literal-expr.md @@ -8,6 +8,8 @@ >    | [BYTE_LITERAL]\ >    | [BYTE_STRING_LITERAL]\ >    | [RAW_BYTE_STRING_LITERAL]\ +>    | [C_STRING_LITERAL]\ +>    | [RAW_C_STRING_LITERAL]\ >    | [INTEGER_LITERAL]\ >    | [FLOAT_LITERAL]\ >    | `true` | `false` @@ -48,6 +50,12 @@ A string literal expression consists of a single [BYTE_STRING_LITERAL] or [RAW_B > **Note**: This section is incomplete. +## C string literal expressions + +A C string literal expression consists of a single [C_STRING_LITERAL] or [RAW_C_STRING_LITERAL] token. + +> **Note**: This section is incomplete. + ## Integer literal expressions An integer literal expression consists of a single [INTEGER_LITERAL] token. @@ -182,5 +190,7 @@ The expression's type is the primitive [boolean type], and its value is: [BYTE_LITERAL]: ../tokens.md#byte-literals [BYTE_STRING_LITERAL]: ../tokens.md#byte-string-literals [RAW_BYTE_STRING_LITERAL]: ../tokens.md#raw-byte-string-literals +[C_STRING_LITERAL]: ../tokens.md#c-string-literals +[RAW_C_STRING_LITERAL]: ../tokens.md#raw-c-string-literals [INTEGER_LITERAL]: ../tokens.md#integer-literals [FLOAT_LITERAL]: ../tokens.md#floating-point-literals diff --git a/src/tokens.md b/src/tokens.md index 0067b647d..0e6a4a0b9 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -24,14 +24,16 @@ Literals are tokens used in [literal expressions]. #### Characters and strings -| | Example | `#` sets\* | Characters | Escapes | -|----------------------------------------------|-----------------|------------|-------------|---------------------| -| [Character](#character-literals) | `'H'` | 0 | All Unicode | [Quote](#quote-escapes) & [ASCII](#ascii-escapes) & [Unicode](#unicode-escapes) | -| [String](#string-literals) | `"hello"` | 0 | All Unicode | [Quote](#quote-escapes) & [ASCII](#ascii-escapes) & [Unicode](#unicode-escapes) | -| [Raw string](#raw-string-literals) | `r#"hello"#` | <256 | All Unicode | `N/A` | -| [Byte](#byte-literals) | `b'H'` | 0 | All ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) | -| [Byte string](#byte-string-literals) | `b"hello"` | 0 | All ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) | -| [Raw byte string](#raw-byte-string-literals) | `br#"hello"#` | <256 | All ASCII | `N/A` | +| | Example | `#` sets\* | Characters | Escapes | +|----------------------------------------------|-----------------|------------|-----------------|---------------------| +| [Character](#character-literals) | `'H'` | 0 | All Unicode | [Quote](#quote-escapes) & [ASCII](#ascii-escapes) & [Unicode](#unicode-escapes) | +| [String](#string-literals) | `"hello"` | 0 | All Unicode | [Quote](#quote-escapes) & [ASCII](#ascii-escapes) & [Unicode](#unicode-escapes) | +| [Raw string](#raw-string-literals) | `r#"hello"#` | <256 | All Unicode | `N/A` | +| [Byte](#byte-literals) | `b'H'` | 0 | All ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) | +| [Byte string](#byte-string-literals) | `b"hello"` | 0 | All ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) | +| [Raw byte string](#raw-byte-string-literals) | `br#"hello"#` | <256 | All ASCII | `N/A` | +| [C string](#c-string-literals) | `c"hello"` | 0 | non-`NUL` ASCII | [Quote](#quote-escapes) & [Byte](#byte-escapes) | +| [Raw C string](#raw-c-string-literals) | `cr#"hello"#` | <256 | non-`NUL` ASCII | `N/A` | \* The number of `#`s on each side of the same literal must be equivalent. @@ -328,6 +330,76 @@ b"\x52"; b"R"; br"R"; // R b"\\x52"; br"\x52"; // \x52 ``` +### C string and raw C string literals + +#### C string literals + +> **Lexer**\ +> C_STRING_LITERAL :\ +>    `c"` ( ASCII_FOR_C_STRING | BYTE_ESCAPE | STRING_CONTINUE )\* `"` SUFFIX? +> +> ASCII_FOR_C_STRING :\ +>    _any non-NUL ASCII (i.e 0x01 to 0x7F), except_ `"`, `\` _and IsolatedCR_ + +A non-raw _C string literal_ is a sequence of ASCII characters and _escapes_, +preceded by the characters `U+0063` (`c`) and `U+0022` (double-quote), and +followed by the character `U+0022`. If the character `U+0022` is present within +the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character. +Alternatively, a C string literal can be a _raw C string literal_, defined +below. The type of a C string literal is `&core::ffi::CStr`. + +Some additional _escapes_ are available in either C or non-raw C string +literals. An escape starts with a `U+005C` (`\`) and continues with one of the +following forms: + +* A _byte escape_ escape starts with `U+0078` (`x`) and is followed by exactly + two _hex digits_. It denotes the byte equal to the provided hex value. The + byte escape sequence `\x00` is forbidden, as C strings may not contain `NUL`. +* A _whitespace escape_ is one of the characters `U+006E` (`n`), `U+0072` + (`r`), or `U+0074` (`t`), denoting the bytes values `0x0A` (ASCII LF), + `0x0D` (ASCII CR) or `0x09` (ASCII HT) respectively. +* The _backslash escape_ is the character `U+005C` (`\`) which must be + escaped in order to denote its ASCII encoding `0x5C`. + +#### Raw C string literals + +> **Lexer**\ +> RAW_C_STRING_LITERAL :\ +>    `cr` RAW_C_STRING_CONTENT SUFFIX? +> +> RAW_C_STRING_CONTENT :\ +>       `"` ASCII_EXCEPT_NUL* (non-greedy) `"`\ +>    | `#` RAW_C_STRING_CONTENT `#` +> +> ASCII_EXCEPT_NUL :\ +>    _any non-NUL ASCII (i.e. 0x01 to 0x7F)_ + +Raw C string literals do not process any escapes. They start with the +character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256 +of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The +_raw string body_ can contain any sequence of non-`NUL` ASCII characters and is terminated +only by another `U+0022` (double-quote) character, followed by the same number of +`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) +character. A raw C string literal can not contain any non-ASCII byte. + +All characters contained in the raw string body represent their ASCII encoding, +the characters `U+0022` (double-quote) (except when followed by at least as +many `U+0023` (`#`) characters as were used to start the raw string literal) or +`U+005C` (`\`) do not have any special meaning. + +Examples for C string literals: + +```rust +c"foo"; cr"foo"; // foo +c"\"foo\""; cr#""foo""#; // "foo" + +c"foo #\"# bar"; +cr##"foo #"# bar"##; // foo #"# bar + +c"\x52"; c"R"; cr"R"; // R +c"\\x52"; cr"\x52"; // \x52 +``` + ### Number literals A _number literal_ is either an _integer literal_ or a _floating-point