Output of the lexer

If the plan is to document the lexer separately from the grammar, there are choices to make about how to describe its output.

One choice is whether the lexer's output for punctuation should be described as consisting of fine-grained (single-character) tokens (like the ones procedural macros use), compound tokens (like the `tt` fragment specifier in macros-by-example uses), or some mixture.

If the lexer doesn't use compound tokens, there's then a choice of how to allow the lexer's clients to distinguish (say) `&&` from `& &` in cases where it's necessary.

There's also a choice of whether lifetimes use fine-grained or compound tokens.


## Clients of the lexer

There are three parts of Rust which consume the lexer's output:

 - the main Rust parser
 - the Procedural Macros system
 - the "Macros by example" system

The proc-macro system works in terms of fine-grained tokens; macros-by-example (when using the `tt` fragment specifier) works in terms of compound tokens.

The parser could be specified using either fine-grained or compound tokens, or a mixture of the two.

I think in practice this means that, whatever choice is made for the main description of the lexer, the "Lexing/tokenization" chapter should have additional text describing how to convert its output to the other form(s), which the chapters for macros could refer to.


## Input to the parser

(I'm assuming that the intent is to accurately describe rustc's current behaviour, for the "descriptive" view of the spec.)


### Using compound tokens

If the grammar is defined in terms of compound tokens, there are rules which need to match both a single-character token and a compound token beginning with that character:

 - `&&` in *BorrowExpression*, *ReferencePattern*, and *ReferenceType*

 - `||` in *ClosureExpression*

 - `<` in many places (I think everywhere it appears, except comparison expressions)

 - `>` in many places (I think everywhere it appears, except comparison expressions)

[Brian Leibig's grammar] gives a reasonable idea of how this approach might end up.

With this approach, the description of how the parsed source is converted to an AST would also have to deal this complication.


### Using fine-grained tokens

If the grammar is defined in terms of fine-grained tokens, its rules will sometimes need to make a distinction depending on whether punctuation tokens were separated by whitespace.

For example:

 - `a << b` must be accepted as a left-shift expression, but `a < < b` must not
 - `a && b` must be taken as a *LazyBooleanExpression*, not treated as ambiguous with `a & &b`
   - but `a &- b` must be treated as equivalent to `a & -b`

I think that for `<`, `>`, and `|` it's sufficient to know whether there is spacing before the next token, but the second example above shows that for `&` what matters is whether it's immediately followed by another `&`.

One approach might be to describe the lexer's output as including explicit whitespace tokens (and presumably introduce some convenience notation in the grammar so that they don't usually have to be mentioned).

Another approach might be to say that there are two distinct tokens used to represent each relevant punctuation character, as suggested by [this comment][variant-tokens]:

> You distinguish "`>` followed by `>`" as one token and "`>` followed by something else" as another kind of token.

A third approach might be to use some form of attribute grammar, and say that the input tokens have attributes like "Spacing" and "Is immediately followed by another `&`".

With any of these approaches, I think the description of how the parsed source is converted to an AST would remain reasonably simple.


#### Rejected joined forms when using fine-grained tokens

There are some cases where parses which might naturally be allowed must be rejected when punctuation characters are _not_ separated by whitespace.

I know of the following:

 - `SomeStruct { field:::std::i32::MAX }` must be rejected, not treated as equivalent to `SomeStruct { field: ::std::i32::MAX }`

 - `fn f(s:::std::string::String)` must be rejected, not treated as equivalent to `fn f(s: ::std::string::String)`

I expect there are more of a similar sort.

This is perhaps an argument for using a compound token for `::` even if fine-grained tokens are used in other cases.

There's also this:

 - `a <- b` must be rejected, not taken as equivalent to `a < -b`

but perhaps that's really analogous to `a && b` (an ambiguity is being resolved in favour of an obsolete unstable feature).


## Input to procedural macros

Procedural macros see fine-grained `Punct` tokens, which have a `Spacing` property which indicates whether there was whitespace before the following token.

Lifetime-or-label (`'foo`) is represented as a `Punct` token with joint spacing followed by an `Ident` token.

The [documentation][proc-macro-punct] as of Rust 1.76 has this to say about `Punct`'s `Spacing` property:

> **Joint**
>
> \[…] in token streams parsed from source code, the compiler will only set spacing to `Joint` in the following cases.
>
> - When a `Punct` is immediately followed by another `Punct` without a whitespace. E.g. `+` is `Joint` in `+=` and `++`.
>
> - When a single quote `'` is immediately followed by an identifier without a whitespace. E.g. `'` is `Joint` in `'lifetime`.
>
> This list may be extended in the future to enable more token combinations.
>
> **Alone**
>
> \[…] In token streams parsed from source code, the compiler will set spacing to `Alone` in all cases not covered by the conditions for `Joint` above.
> E.g. `+` is `Alone` in `+ =`, `+ident` and `+()`.
> In particular, tokens not followed by anything will be marked as `Alone`.


## Input to "Macros by example" macros

By-example macros using the `tt` fragment specifier see the following combinations of punctuation as compound tokens:

```
  <=
  ==
  !=
  >=
  &&
  ||
  ..
  ...
  ..=
  ::
  ->
  <-
  =>
  <<
  >>
  +=
  -=
  *=
  /=
  %=
  ^=
  &=
  |=
  <<=
  >>=
```

Lifetime-or-label (`'foo`) is represented as a single token.


## Sources of information

### The current rustc implementation

(As of rustc 1.76)

The lower-level lexer in `rustc_lexer` emits fine-grained tokens for punctuation (but compound tokens for lifetime-or-label). It emits explicit tokens representing whitespace.

The higher-level lexer in `rustc_parse::lexer` emits compound tokens for punctuation. It represents whitespace as 'Spacing' information describing the relationship to the following token.

The parser breaks those compound tokens up again where necessary.

(As I understand it, there is consensus that ideally rustc would switch to using fine-grained tokens internally.)

The breaking-up takes place in the following functions in `rustc_parse`, which might be used to audit for cases where a grammar based on compound tokens would need special cases.

 - `expect_and()`
 - `expect_or()`
 - `eat_lt()`
 - `expect_lt()`
 - `expect_gt()`
 - `eat_plus()`

In particular, rustc takes care to split `+=` when the `+` appears in generic bounds, but I think that isn't currently making any difference (I don't think a following `=` can ever parse).


### Currently maintained documentation

#### The Reference

The [*Lexical structure*][ref-lexer] chapter of the Reference uses compound tokens for punctuation and lifetime-or-label.

The grammar used in the Reference is written in a mixed style.

In most places it's written as if it's working with fine-grained tokens; notation like `&&` can be taken as representing two `&` tokens without intervening whitespace.

In three cases (*BorrowExpression*, *ReferencePattern*, and *ClosureExpression*) it lists both the fine-grained and the compound tokens explicitly, sometimes with an explanation in the text.

It treats `LIFETIME_OR_LABEL` as a single token.


#### The Ferrocene spec

The Ferrocene spec's lexical description and grammar are close to the Reference's.

The lexical part was forked from the Reference sometime in late 2021, and has no interesting additions.

Its grammar doesn't include the separately-listed compound tokens that the Reference has.


### Older formalisations

#### Rustypop

[Rustypop] from 2016 used fine-grained tokens to deal with `<<`, `>>`, `&&`, and `||`.

Otherwise it used compound tokens (in particular, for `<=`, `<<=` `<-`, `>=`, `>>=`, and `::`).

Its README describes its approach to dealing with multiple-punctuation symbols in the grammar.


#### Brian Leibig's grammar

[Brian Leibig's grammar] (last updated in 2017) uses compound tokens throughout.

It's the only grammar of that sort I've found that tries to deal with all cases where compound tokens need to be accepted as if they were multiple fine-grained ones.


#### wg-grammar

The [wg-grammar] grammar from 2019 appears to have used compound tokens for both punctuation and lifetimes.

I don't think that project got as far as dealing with the resulting complications. https://github.com/rust-lang/wg-grammar/issues/47 has a little discussion.

https://github.com/rust-lang/wg-grammar/issues/3 includes extensive discussion on how the lexer might be modelled.

One result of that discussion was https://github.com/CAD97/rust-lexical-spec , which would have output fine-grained punctuation tokens and explicit whitespace tokens.


[ref-lexer]: https://doc.rust-lang.org/nightly/reference/lexical-structure.html
[wg-grammar]: https://github.com/rust-lang/wg-grammar
[Rustypop]: https://github.com/nikomatsakis/rustypop
[Brian Leibig's grammar]: https://github.com/bleibig/rust-grammar/
[variant-tokens]: https://github.com/rust-lang/rust/issues/30942#issuecomment-377378101
[proc-macro-punct]: https://doc.rust-lang.org/proc_macro/enum.Spacing.html#variant.Joint


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Output of the lexer #42

Clients of the lexer

Input to the parser

Using compound tokens

Using fine-grained tokens

Rejected joined forms when using fine-grained tokens

Input to procedural macros

Input to "Macros by example" macros

Sources of information

The current rustc implementation

Currently maintained documentation

The Reference

The Ferrocene spec

Older formalisations

Rustypop

Brian Leibig's grammar

wg-grammar

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Output of the lexer #42

Description

Clients of the lexer

Input to the parser

Using compound tokens

Using fine-grained tokens

Rejected joined forms when using fine-grained tokens

Input to procedural macros

Input to "Macros by example" macros

Sources of information

The current rustc implementation

Currently maintained documentation

The Reference

The Ferrocene spec

Older formalisations

Rustypop

Brian Leibig's grammar

wg-grammar

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions