-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output of the lexer #42
Comments
"Accept tuple.0.0 as tuple indexing"There is one other case where rustc currently breaks up a single token while parsing: if rustc currently analyses that as three tokens: Effectively the grammar allows Expression This isn't currently documented in the Reference (rust-lang/reference#847), or anywhere else as far as I know. Procedural and by-example macros both also see the third token as a single numeric literal. Procedural macros can return a token stream of this form (using a The history of this feature is in rust-lang/rust#71322. So I think there's a choice of whether to continue to treat this as an unusual exception, or to introduce a more fine-grained tokenisation of some numeric literals. |
In the C++ standard, there are two types of tokens: "preprocessing tokens" [lex.pptoken] and "tokens" [lex.token]. In an early phase, the source files are tokenized into preprocessing tokens, and only a few phases later will they be converted into tokens. It sounds like we should do something similar for Rust. Would it work if the first phase produces fine-grained tokens, and another phase (before parsing) will convert them to compound tokens? It'd be helpful to have very clear names for each type of token, and start working on their exact definition. (I think the |
I suggest the practical way to start is to write enough of the lexer chapter to describe how to produce a sequence of tokens that carries all the information that the rest of the spec ought to need, then go on to drafting other chapters. Then any decisions about how to convert that sequence into the forms that the grammar and macros need can be made together with writing those chapters. It might be clearer then how much of the description of how to construct (say) the tokens that proc macros see belongs in the Lexer chapter and how much in the Proc-macros chapter. I've made a start on a description that's more rigorous than the one in the Reference, and I think "fine-grained tokens with explicit tokens for whitespace" is going to be a good form to aim for at first. |
I have put that description at https://github.com/mattheww/lexeywan |
If the plan is to document the lexer separately from the grammar, there are choices to make about how to describe its output.
One choice is whether the lexer's output for punctuation should be described as consisting of fine-grained (single-character) tokens (like the ones procedural macros use), compound tokens (like the
tt
fragment specifier in macros-by-example uses), or some mixture.If the lexer doesn't use compound tokens, there's then a choice of how to allow the lexer's clients to distinguish (say)
&&
from& &
in cases where it's necessary.There's also a choice of whether lifetimes use fine-grained or compound tokens.
Clients of the lexer
There are three parts of Rust which consume the lexer's output:
The proc-macro system works in terms of fine-grained tokens; macros-by-example (when using the
tt
fragment specifier) works in terms of compound tokens.The parser could be specified using either fine-grained or compound tokens, or a mixture of the two.
I think in practice this means that, whatever choice is made for the main description of the lexer, the "Lexing/tokenization" chapter should have additional text describing how to convert its output to the other form(s), which the chapters for macros could refer to.
Input to the parser
(I'm assuming that the intent is to accurately describe rustc's current behaviour, for the "descriptive" view of the spec.)
Using compound tokens
If the grammar is defined in terms of compound tokens, there are rules which need to match both a single-character token and a compound token beginning with that character:
&&
in BorrowExpression, ReferencePattern, and ReferenceType||
in ClosureExpression<
in many places (I think everywhere it appears, except comparison expressions)>
in many places (I think everywhere it appears, except comparison expressions)Brian Leibig's grammar gives a reasonable idea of how this approach might end up.
With this approach, the description of how the parsed source is converted to an AST would also have to deal this complication.
Using fine-grained tokens
If the grammar is defined in terms of fine-grained tokens, its rules will sometimes need to make a distinction depending on whether punctuation tokens were separated by whitespace.
For example:
a << b
must be accepted as a left-shift expression, buta < < b
must nota && b
must be taken as a LazyBooleanExpression, not treated as ambiguous witha & &b
a &- b
must be treated as equivalent toa & -b
I think that for
<
,>
, and|
it's sufficient to know whether there is spacing before the next token, but the second example above shows that for&
what matters is whether it's immediately followed by another&
.One approach might be to describe the lexer's output as including explicit whitespace tokens (and presumably introduce some convenience notation in the grammar so that they don't usually have to be mentioned).
Another approach might be to say that there are two distinct tokens used to represent each relevant punctuation character, as suggested by this comment:
A third approach might be to use some form of attribute grammar, and say that the input tokens have attributes like "Spacing" and "Is immediately followed by another
&
".With any of these approaches, I think the description of how the parsed source is converted to an AST would remain reasonably simple.
Rejected joined forms when using fine-grained tokens
There are some cases where parses which might naturally be allowed must be rejected when punctuation characters are not separated by whitespace.
I know of the following:
SomeStruct { field:::std::i32::MAX }
must be rejected, not treated as equivalent toSomeStruct { field: ::std::i32::MAX }
fn f(s:::std::string::String)
must be rejected, not treated as equivalent tofn f(s: ::std::string::String)
I expect there are more of a similar sort.
This is perhaps an argument for using a compound token for
::
even if fine-grained tokens are used in other cases.There's also this:
a <- b
must be rejected, not taken as equivalent toa < -b
but perhaps that's really analogous to
a && b
(an ambiguity is being resolved in favour of an obsolete unstable feature).Input to procedural macros
Procedural macros see fine-grained
Punct
tokens, which have aSpacing
property which indicates whether there was whitespace before the following token.Lifetime-or-label (
'foo
) is represented as aPunct
token with joint spacing followed by anIdent
token.The documentation as of Rust 1.76 has this to say about
Punct
'sSpacing
property:Input to "Macros by example" macros
By-example macros using the
tt
fragment specifier see the following combinations of punctuation as compound tokens:Lifetime-or-label (
'foo
) is represented as a single token.Sources of information
The current rustc implementation
(As of rustc 1.76)
The lower-level lexer in
rustc_lexer
emits fine-grained tokens for punctuation (but compound tokens for lifetime-or-label). It emits explicit tokens representing whitespace.The higher-level lexer in
rustc_parse::lexer
emits compound tokens for punctuation. It represents whitespace as 'Spacing' information describing the relationship to the following token.The parser breaks those compound tokens up again where necessary.
(As I understand it, there is consensus that ideally rustc would switch to using fine-grained tokens internally.)
The breaking-up takes place in the following functions in
rustc_parse
, which might be used to audit for cases where a grammar based on compound tokens would need special cases.expect_and()
expect_or()
eat_lt()
expect_lt()
expect_gt()
eat_plus()
In particular, rustc takes care to split
+=
when the+
appears in generic bounds, but I think that isn't currently making any difference (I don't think a following=
can ever parse).Currently maintained documentation
The Reference
The Lexical structure chapter of the Reference uses compound tokens for punctuation and lifetime-or-label.
The grammar used in the Reference is written in a mixed style.
In most places it's written as if it's working with fine-grained tokens; notation like
&&
can be taken as representing two&
tokens without intervening whitespace.In three cases (BorrowExpression, ReferencePattern, and ClosureExpression) it lists both the fine-grained and the compound tokens explicitly, sometimes with an explanation in the text.
It treats
LIFETIME_OR_LABEL
as a single token.The Ferrocene spec
The Ferrocene spec's lexical description and grammar are close to the Reference's.
The lexical part was forked from the Reference sometime in late 2021, and has no interesting additions.
Its grammar doesn't include the separately-listed compound tokens that the Reference has.
Older formalisations
Rustypop
Rustypop from 2016 used fine-grained tokens to deal with
<<
,>>
,&&
, and||
.Otherwise it used compound tokens (in particular, for
<=
,<<=
<-
,>=
,>>=
, and::
).Its README describes its approach to dealing with multiple-punctuation symbols in the grammar.
Brian Leibig's grammar
Brian Leibig's grammar (last updated in 2017) uses compound tokens throughout.
It's the only grammar of that sort I've found that tries to deal with all cases where compound tokens need to be accepted as if they were multiple fine-grained ones.
wg-grammar
The wg-grammar grammar from 2019 appears to have used compound tokens for both punctuation and lifetimes.
I don't think that project got as far as dealing with the resulting complications. rust-lang/wg-grammar#47 has a little discussion.
rust-lang/wg-grammar#3 includes extensive discussion on how the lexer might be modelled.
One result of that discussion was https://github.com/CAD97/rust-lexical-spec , which would have output fine-grained punctuation tokens and explicit whitespace tokens.
The text was updated successfully, but these errors were encountered: