diff --git a/src/lexical-structure.md b/src/lexical-structure.md index 5e1388e0d..d70e97ac3 100644 --- a/src/lexical-structure.md +++ b/src/lexical-structure.md @@ -1 +1,3 @@ # Lexical structure + + diff --git a/src/tokens.md b/src/tokens.md index d94464f9f..774a73b9f 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -1,5 +1,8 @@ # Tokens +r[lex.token] + +r[lex.token.intro] Tokens are primitive productions in the grammar defined by regular (non-recursive) languages. Rust source input can be broken down into the following kinds of tokens: @@ -18,6 +21,7 @@ table production] form, and appear in `monospace` font. ## Literals +r[lex.token.literal] Literals are tokens used in [literal expressions]. ### Examples @@ -88,13 +92,17 @@ Literals are tokens used in [literal expressions]. #### Suffixes -A suffix is a sequence of characters following the primary part of a literal (without intervening whitespace), of the same form as a non-raw identifier or keyword. +r[lex.token.literal.suffix] +r[lex.token.literal.literal.suffix.intro] +A suffix is a sequence of characters following the primary part of a literal (without intervening whitespace), of the same form as a non-raw identifier or keyword. +r[lex.token.literal.suffix.syntax] > **Lexer**\ > SUFFIX : IDENTIFIER_OR_KEYWORD\ > SUFFIX_NO_E : SUFFIX _not beginning with `e` or `E`_ +r[lex.token.literal.suffix.validity] Any kind of literal (string, integer, etc) with any suffix is valid as a token. A literal token with any suffix can be passed to a macro without producing an error. @@ -109,6 +117,7 @@ blackhole!("string"suffix); // OK blackhole_lit!(1suffix); // OK ``` +r[lex.token.literal.suffix.parse] However, suffixes on literal tokens which are interpreted as literal expressions or patterns are restricted. Any suffixes are rejected on non-numeric literal tokens, and numeric literal tokens are accepted only with suffixes from the list below. @@ -121,6 +130,9 @@ and numeric literal tokens are accepted only with suffixes from the list below. #### Character literals +r[lex.token.literal.char] + +r[lex.token.literal.char.syntax] > **Lexer**\ > CHAR_LITERAL :\ >    `'` ( ~\[`'` `\` \\n \\r \\t] | QUOTE_ESCAPE | ASCII_ESCAPE | UNICODE_ESCAPE ) `'` SUFFIX? @@ -135,12 +147,16 @@ and numeric literal tokens are accepted only with suffixes from the list below. > UNICODE_ESCAPE :\ >    `\u{` ( HEX_DIGIT `_`\* )1..6 `}` +r[lex.token.literal.char.intro] A _character literal_ is a single Unicode character enclosed within two `U+0027` (single-quote) characters, with the exception of `U+0027` itself, which must be _escaped_ by a preceding `U+005C` character (`\`). #### String literals +r[lex.token.literal.str] + +r[lex.token.literal.str.syntax] > **Lexer**\ > STRING_LITERAL :\ >    `"` (\ @@ -154,10 +170,12 @@ which must be _escaped_ by a preceding `U+005C` character (`\`). > STRING_CONTINUE :\ >    `\` _followed by_ \\n +r[lex.token.literal.str.intro] A _string literal_ is a sequence of any Unicode characters enclosed within two `U+0022` (double-quote) characters, with the exception of `U+0022` itself, which must be _escaped_ by a preceding `U+005C` character (`\`). +r[lex.token.literal.str.linefeed] Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals. When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token. See [String continuation escapes] for details. @@ -165,28 +183,43 @@ The character `U+000D` (CR) may not appear in a string literal other than as par #### Character escapes +r[lex.token.literal.char-escape] + +r[lex.token.literal.char-escape.intro] Some additional _escapes_ are available in either character or non-raw string literals. An escape starts with a `U+005C` (`\`) and continues with one of the following forms: +r[lex.token.literal.char-escape.ascii] * A _7-bit code point escape_ starts with `U+0078` (`x`) and is followed by exactly two _hex digits_ with value up to `0x7F`. It denotes the ASCII character with value equal to the provided hex value. Higher values are not permitted because it is ambiguous whether they mean Unicode code points or byte values. + +r[lex.token.literal.char-escape.unicode] * A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value. + +r[lex.token.literal.char-escape.whitespace] * A _whitespace escape_ is one of the characters `U+006E` (`n`), `U+0072` (`r`), or `U+0074` (`t`), denoting the Unicode values `U+000A` (LF), `U+000D` (CR) or `U+0009` (HT) respectively. + +r[lex.token.literal.char-escape.null] * The _null escape_ is the character `U+0030` (`0`) and denotes the Unicode value `U+0000` (NUL). + +r[lex.token.literal.char-escape.slash] * The _backslash escape_ is the character `U+005C` (`\`) which must be escaped in order to denote itself. #### Raw string literals +r[lex.token.literal.str-raw] + +r[lex.token.literal.str-raw.syntax] > **Lexer**\ > RAW_STRING_LITERAL :\ >    `r` RAW_STRING_CONTENT SUFFIX? @@ -195,13 +228,16 @@ following forms: >       `"` ( ~ _IsolatedCR_ )* (non-greedy) `"`\ >    | `#` RAW_STRING_CONTENT `#` +r[lex.token.literal.str-raw.intro] Raw string literals do not process any escapes. They start with the character `U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`) and a `U+0022` (double-quote) character. +r[lex.token.literal.str-raw.body] The _raw string body_ can contain any sequence of Unicode characters other than `U+000D` (CR). It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character. +r[lex.token.literal.str-raw.content] All Unicode characters contained in the raw string body represent themselves, the characters `U+0022` (double-quote) (except when followed by at least as many `U+0023` (`#`) characters as were used to start the raw string literal) or @@ -224,6 +260,9 @@ r##"foo #"# bar"##; // foo #"# bar #### Byte literals +r[lex.token.byte] + +r[lex.token.byte.syntax] > **Lexer**\ > BYTE_LITERAL :\ >    `b'` ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX? @@ -235,6 +274,7 @@ r##"foo #"# bar"##; // foo #"# bar >       `\x` HEX_DIGIT HEX_DIGIT\ >    | `\n` | `\r` | `\t` | `\\` | `\0` | `\'` | `\"` +r[lex.token.byte.intro] A _byte literal_ is a single ASCII character (in the `U+0000` to `U+007F` range) or a single _escape_ preceded by the characters `U+0062` (`b`) and `U+0027` (single-quote), and followed by the character `U+0027`. If the character @@ -244,6 +284,9 @@ _number literal_. #### Byte string literals +r[lex.token.str-byte] + +r[lex.token.str-byte.syntax] > **Lexer**\ > BYTE_STRING_LITERAL :\ >    `b"` ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )\* `"` SUFFIX? @@ -251,6 +294,7 @@ _number literal_. > ASCII_FOR_STRING :\ >    _any ASCII (i.e 0x00 to 0x7F), except_ `"`, `\` _and IsolatedCR_ +r[lex.token.str-byte.intro] A non-raw _byte string literal_ is a sequence of ASCII characters and _escapes_, preceded by the characters `U+0062` (`b`) and `U+0022` (double-quote), and followed by the character `U+0022`. If the character `U+0022` is present within @@ -258,28 +302,40 @@ the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character. Alternatively, a byte string literal can be a _raw byte string literal_, defined below. +r[lex.token.str-byte.linefeed] Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals. When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token. See [String continuation escapes] for details. The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape. +r[lex.token.str-byte.escape] Some additional _escapes_ are available in either byte or non-raw byte string literals. An escape starts with a `U+005C` (`\`) and continues with one of the following forms: +r[lex.token.str-byte.escape-byte] * A _byte escape_ escape starts with `U+0078` (`x`) and is followed by exactly two _hex digits_. It denotes the byte equal to the provided hex value. + +r[lex.token.str-byte.escape-whitespace] * A _whitespace escape_ is one of the characters `U+006E` (`n`), `U+0072` (`r`), or `U+0074` (`t`), denoting the bytes values `0x0A` (ASCII LF), `0x0D` (ASCII CR) or `0x09` (ASCII HT) respectively. + +r[lex.token.str-byte.escape-null] * The _null escape_ is the character `U+0030` (`0`) and denotes the byte value `0x00` (ASCII NUL). + +r[lex.token.str-byte.escape-slash] * The _backslash escape_ is the character `U+005C` (`\`) which must be escaped in order to denote its ASCII encoding `0x5C`. #### Raw byte string literals +r[lex.token.str-byte-raw] + +r[lex.token.str-byte-raw.syntax] > **Lexer**\ > RAW_BYTE_STRING_LITERAL :\ >    `br` RAW_BYTE_STRING_CONTENT SUFFIX? @@ -291,14 +347,17 @@ following forms: > ASCII_FOR_RAW :\ >    _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_ +r[lex.token.str-byte-raw.intro] Raw byte string literals do not process any escapes. They start with the character `U+0062` (`b`), followed by `U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. +r[lex.token.str-byte-raw.body] The _raw string body_ can contain any sequence of ASCII characters other than `U+000D` (CR). It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character. A raw byte string literal can not contain any non-ASCII byte. +r[lex.token.literal.str-byte-raw.content] All characters contained in the raw string body represent their ASCII encoding, the characters `U+0022` (double-quote) (except when followed by at least as many `U+0023` (`#`) characters as were used to start the raw string literal) or @@ -321,6 +380,9 @@ b"\\x52"; br"\x52"; // \x52 #### C string literals +r[lex.token.str-c] + +r[lex.token.str-c.syntax] > **Lexer**\ > C_STRING_LITERAL :\ >    `c"` (\ @@ -330,6 +392,7 @@ b"\\x52"; br"\x52"; // \x52 >       | STRING_CONTINUE\ >    )\* `"` SUFFIX? +r[lex.token.str-c.intro] A _C string literal_ is a sequence of Unicode characters and _escapes_, preceded by the characters `U+0063` (`c`) and `U+0022` (double-quote), and followed by the character `U+0022`. If the character `U+0022` is present within @@ -338,31 +401,42 @@ Alternatively, a C string literal can be a _raw C string literal_, defined below [CStr]: core::ffi::CStr +r[lex.token.str-c.null] C strings are implicitly terminated by byte `0x00`, so the C string literal `c""` is equivalent to manually constructing a `&CStr` from the byte string literal `b"\x00"`. Other than the implicit terminator, byte `0x00` is not permitted within a C string. +r[lex.token.str-c.linefeed] Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals. When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token. See [String continuation escapes] for details. The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape. +r[lex.token.str-c.escape] Some additional _escapes_ are available in non-raw C string literals. An escape starts with a `U+005C` (`\`) and continues with one of the following forms: +r[lex.token.str-c.escape-byte] * A _byte escape_ escape starts with `U+0078` (`x`) and is followed by exactly two _hex digits_. It denotes the byte equal to the provided hex value. + +r[lex.token.str-c.escape-unicode] * A _24-bit code point escape_ starts with `U+0075` (`u`) and is followed by up to six _hex digits_ surrounded by braces `U+007B` (`{`) and `U+007D` (`}`). It denotes the Unicode code point equal to the provided hex value, encoded as UTF-8. + +r[lex.token.str-c.escape-whitespace] * A _whitespace escape_ is one of the characters `U+006E` (`n`), `U+0072` (`r`), or `U+0074` (`t`), denoting the bytes values `0x0A` (ASCII LF), `0x0D` (ASCII CR) or `0x09` (ASCII HT) respectively. + +r[lex.token.str-c.escape-slash] * The _backslash escape_ is the character `U+005C` (`\`) which must be escaped in order to denote its ASCII encoding `0x5C`. +r[lex.token.str-c.char-unicode] A C string represents bytes with no defined encoding, but a C string literal may contain Unicode characters above `U+007F`. Such characters will be replaced with the bytes of that character's UTF-8 representation. @@ -375,11 +449,15 @@ c"\u{00E6}"; c"\xC3\xA6"; ``` +r[lex.token.str-c.edition2021] > **Edition differences**: C string literals are accepted in the 2021 edition or > later. In earlier additions the token `c""` is lexed as `c ""`. #### Raw C string literals +r[lex.token.str-c-raw] + +r[lex.token.str-c-raw.syntax] > **Lexer**\ > RAW_C_STRING_LITERAL :\ >    `cr` RAW_C_STRING_CONTENT SUFFIX? @@ -388,18 +466,22 @@ c"\xC3\xA6"; >       `"` ( ~ _IsolatedCR_ _NUL_ )* (non-greedy) `"`\ >    | `#` RAW_C_STRING_CONTENT `#` +r[lex.token.str-c-raw.intro] Raw C string literals do not process any escapes. They start with the character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. +r[lex.token.str-c-raw.body] The _raw C string body_ can contain any sequence of Unicode characters other than `U+0000` (NUL) and `U+000D` (CR). It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character. +r[lex.token.str-c-raw.content] All characters contained in the raw C string body represent themselves in UTF-8 encoding. The characters `U+0022` (double-quote) (except when followed by at least as many `U+0023` (`#`) characters as were used to start the raw C string literal) or `U+005C` (`\`) do not have any special meaning. +r[lex.token.str-c-raw.edition2021] > **Edition differences**: Raw C string literals are accepted in the 2021 > edition or later. In earlier additions the token `cr""` is lexed as `cr ""`, > and `cr#""#` is lexed as `cr #""#` (which is non-grammatical). @@ -419,11 +501,16 @@ c"\\x52"; cr"\x52"; // \x52 ### Number literals +r[lex.token.literal.num] + A _number literal_ is either an _integer literal_ or a _floating-point literal_. The grammar for recognizing the two kinds of literals is mixed. #### Integer literals +r[lex.token.literal.int] + +r[lex.token.literal.int.syntax] > **Lexer**\ > INTEGER_LITERAL :\ >    ( DEC_LITERAL | BIN_LITERAL | OCT_LITERAL | HEX_LITERAL ) @@ -449,20 +536,29 @@ literal_. The grammar for recognizing the two kinds of literals is mixed. > > HEX_DIGIT : \[`0`-`9` `a`-`f` `A`-`F`] +r[lex.token.literal.int.kind] An _integer literal_ has one of four forms: +r[lex.token.literal.int.kind-dec] * A _decimal literal_ starts with a *decimal digit* and continues with any mixture of *decimal digits* and _underscores_. + +r[lex.token.literal.int.kind-hex] * A _hex literal_ starts with the character sequence `U+0030` `U+0078` (`0x`) and continues as any mixture (with at least one digit) of hex digits and underscores. + +r[lex.token.literal.int.kind-oct] * An _octal literal_ starts with the character sequence `U+0030` `U+006F` (`0o`) and continues as any mixture (with at least one digit) of octal digits and underscores. + +r[lex.token.literal.int.kind-bin] * A _binary literal_ starts with the character sequence `U+0030` `U+0062` (`0b`) and continues as any mixture (with at least one digit) of binary digits and underscores. +r[lex.token.literal.int.restriction] Like any literal, an integer literal may be followed (immediately, without any spaces) by a suffix as described above. The suffix may not begin with `e` or `E`, as that would be interpreted as the exponent of a floating-point literal. See [Integer literal expressions] for the effect of these suffixes. @@ -515,13 +611,18 @@ Examples of integer literals which are not accepted as literal expressions: #### Tuple index +r[lex.token.literal.int.tuple-field] + +r[lex.token.literal.int.tuple-field.syntax] > **Lexer**\ > TUPLE_INDEX: \ >    INTEGER_LITERAL +r[lex.token.literal.int.tuple-field.intro] A tuple index is used to refer to the fields of [tuples], [tuple structs], and [tuple variants]. +r[lex.token.literal.int.tuple-field.eq] Tuple indices are compared with the literal token directly. Tuple indices start with `0` and each successive index increments the value by `1` as a decimal value. Thus, only decimal values will match, and the value must not @@ -541,6 +642,9 @@ let horse = example.0b10; // ERROR no field named `0b10` #### Floating-point literals +r[lex.token.literal.float] + +r[lex.token.literal.float.syntax] > **Lexer**\ > FLOAT_LITERAL :\ >       DEC_LITERAL `.` @@ -553,12 +657,14 @@ let horse = example.0b10; // ERROR no field named `0b10` > (DEC_DIGIT|`_`)\* DEC_DIGIT (DEC_DIGIT|`_`)\* > +r[lex.token.literal.float.form] A _floating-point literal_ has one of two forms: * A _decimal literal_ followed by a period character `U+002E` (`.`). This is optionally followed by another decimal literal, with an optional _exponent_. * A single _decimal literal_ followed by an _exponent_. +r[lex.token.literal.float.suffix] Like integer literals, a floating-point literal may be followed by a suffix, so long as the pre-suffix part does not end with `U+002E` (`.`). The suffix may not begin with `e` or `E` if the literal does not include an exponent. @@ -575,7 +681,7 @@ let x: f64 = 2.; ``` This last example is different because it is not possible to use the suffix -syntax with a floating point literal ending in a period. `2.f64` would attempt +syntax with a floating point literal end.token.ing in a period. `2.f64` would attempt to call a method named `f64` on `2`. Note that `-1.0`, for example, is analyzed as two tokens: `-` followed by `1.0`. @@ -594,6 +700,8 @@ Examples of floating-point literals which are not accepted as literal expression #### Reserved forms similar to number literals +r[lex.token.literal.reserved] + > **Lexer**\ > RESERVED_NUMBER :\ >       BIN_LITERAL \[`2`-`9`​]\ @@ -606,17 +714,23 @@ Examples of floating-point literals which are not accepted as literal expression >    | `0x` `_`\* _end of input or not HEX_DIGIT_\ >    | DEC_LITERAL ( . DEC_LITERAL)? (`e`|`E`) (`+`|`-`)? _end of input or not DEC_DIGIT_ +r[lex.token.literal.reserved.intro] The following lexical forms similar to number literals are _reserved forms_. Due to the possible ambiguity these raise, they are rejected by the tokenizer instead of being interpreted as separate tokens. +r[lex.token.literal.reserved.out-of-range] * An unsuffixed binary or octal literal followed, without intervening whitespace, by a decimal digit out of the range for its radix. +r[lex.token.literal.reserved.period] * An unsuffixed binary, octal, or hexadecimal literal followed, without intervening whitespace, by a period character (with the same restrictions on what follows the period as for floating-point literals). +r[lex.token.literal.reserved.exp] * An unsuffixed binary or octal literal followed, without intervening whitespace, by the character `e` or `E`. +r[lex.token.literal.reserved.empty-with-radix] * Input which begins with one of the radix prefixes but is not a valid binary, octal, or hexadecimal literal (because it contains no digits). +r[lex.token.literal.reserved.empty-exp] * Input which has the form of a floating-point literal with no digits in the exponent. Examples of reserved forms: @@ -636,6 +750,9 @@ Examples of reserved forms: ## Lifetimes and loop labels +r[lex.token.life] + +r[lex.token.life.syntax] > **Lexer**\ > LIFETIME_TOKEN :\ >       `'` [IDENTIFIER_OR_KEYWORD][identifier] @@ -649,12 +766,16 @@ Examples of reserved forms: >    | `'_` > _(not immediately followed by `'`)_ +r[lex.token.life.intro] Lifetime parameters and [loop labels] use LIFETIME_OR_LABEL tokens. Any LIFETIME_TOKEN will be accepted by the lexer, and for example, can be used in macros. ## Punctuation +r[lex.token.punct] + +r[lex.token.punct.intro] Punctuation symbol tokens are listed here for completeness. Their individual usages and meanings are defined in the linked pages. @@ -710,6 +831,8 @@ usages and meanings are defined in the linked pages. ## Delimiters +r[lex.token.delim] + Bracket punctuation is used in various parts of the grammar. An open bracket must always be paired with a close bracket. Brackets and the tokens within them are referred to as "token trees" in [macros]. The three types of brackets are: @@ -722,19 +845,27 @@ them are referred to as "token trees" in [macros]. The three types of brackets ## Reserved prefixes +r[lex.token.reserved-prefix] + +r[lex.token.reserved-prefix.syntax] > **Lexer 2021+**\ > RESERVED_TOKEN_DOUBLE_QUOTE : ( IDENTIFIER_OR_KEYWORD _Except `b` or `c` or `r` or `br` or `cr`_ | `_` ) `"`\ > RESERVED_TOKEN_SINGLE_QUOTE : ( IDENTIFIER_OR_KEYWORD _Except `b`_ | `_` ) `'`\ > RESERVED_TOKEN_POUND : ( IDENTIFIER_OR_KEYWORD _Except `r` or `br` or `cr`_ | `_` ) `#` +r[lex.token.reserved-prefix.intro] Some lexical forms known as _reserved prefixes_ are reserved for future use. +r[lex.token.reserved-prefix.id] Source input which would otherwise be lexically interpreted as a non-raw identifier (or a keyword or `_`) which is immediately followed by a `#`, `'`, or `"` character (without intervening whitespace) is identified as a reserved prefix. +r[lex.token.reserved-prefix.raw-token] Note that raw identifiers, raw string literals, and raw byte string literals may contain a `#` character but are not interpreted as containing a reserved prefix. +r[lex.token.reserved-prefix.strings] Similarly the `r`, `b`, `br`, `c`, and `cr` prefixes used in raw string literals, byte literals, byte string literals, raw byte string literals, C string literals, and raw C string literals are not interpreted as reserved prefixes. +r[lex.token.reserved-prefix.edition2021] > **Edition differences**: Starting with the 2021 edition, reserved prefixes are reported as an error by the lexer (in particular, they cannot be passed to macros). > > Before the 2021 edition, reserved prefixes are accepted by the lexer and interpreted as multiple tokens (for example, one token for the identifier or keyword, followed by a `#` token). diff --git a/src/whitespace.md b/src/whitespace.md index a93bdcbdb..cd099946b 100644 --- a/src/whitespace.md +++ b/src/whitespace.md @@ -1,5 +1,8 @@ # Whitespace +r[lex.whitespace] + +r[lex.whitespace.intro] Whitespace is any non-empty string containing only characters that have the [`Pattern_White_Space`] Unicode property, namely: @@ -15,9 +18,11 @@ Whitespace is any non-empty string containing only characters that have the - `U+2028` (line separator) - `U+2029` (paragraph separator) +r[lex.whitespace.token-sep] Rust is a "free-form" language, meaning that all forms of whitespace serve only to separate _tokens_ in the grammar, and have no semantic significance. +r[lex.whitespace.replacement] A Rust program has identical meaning if each whitespace element is replaced with any other legal whitespace element, such as a single space character.