-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow integer suffixes starting with e
.
#111628
Conversation
This will need buy-in from the lang team. I have started a Zulip thread for discussion. |
This comment has been minimized.
This comment has been minimized.
6635303
to
6f5d2f6
Compare
6f5d2f6
to
9fa6652
Compare
cc @rust-lang/lang, for obvious reasons. cc @matklad, in case there are any rust-analyzer considerations. |
Actually, no, I think the lookahead would require a small adjustment in the code for incremental relexing: This is not super-precisely formulated (and probably buggy as-is), but their idea here is that a lot of edits modify just a single token (user appending a letter to identifier), so we should take advantage of that and modify the syntax tree without incremental reparsing, by just replacing a single token. We do have access to previous token there, so running this lookahead logic there should be possible, just more code. It is perhaps worth it to move this incremental re-lexing logic over to rustc code base (with suitable unit tests), to encode the core constraint an IDE needs: “re-lexing can be done incrementally”. |
proc_macro2 will probably need to be adjusted as well. |
How so? I'm no proc_macro2 expert, but won't the newly accepted tokens just be more tokens, not really any different to existing tokens? E.g. |
proc macro 2 has another copy of the lexer: |
Integers with arbitrary suffixes are allowed as inputs to proc macros. A number of real-world crates use this capability in interesting ways, as seen in rust-lang#103872. For example: - Suffixes representing units, such as `8bits`, `100px`, `20ns`, `30GB` - CSS hex colours such as `#7CFC00` (LawnGreen) - UUIDs, e.g. `785ada2c-f2d0-11fd-3839-b3104db0cb68` The hex cases may be surprising. - `#7CFC00` is tokenized as a `#` followed by a `7` integer with a `CFC00` suffix. - `785ada2c` is tokenized as a `785` integer with an `ada2c` suffix. - `f2d0` is tokenized as an identifier. - `3839` is tokenized as an integer literal. A proc macro will immediately stringify such tokens and reparse them itself, and so won't care that the token types vary. All suffixes must be consumed by the proc macro, of course; the only suffixes allowed after macro expansion are the numeric ones like `u8`, `i32`, and `f64`. Currently there is an annoying inconsistency in how integer literal suffixes are handled, which is that no suffix starting with `e` is allowed, because that it interpreted as a float literal with an exponent. For example: - Units: `1eV` and `1em` - CSS colours: `#90EE90` (LightGreen) - UUIDs: `785ada2c-f2d0-11ed-3839-b3104db0cb68` In each case, a sequence of digits followed by an 'e' or 'E' followed by a letter results in an "expected at least one digit in exponent" error. This is an annoying inconsistency in general, and a problem in practice. It's likely that some users haven't realized this inconsistency because they've gotten lucky and never used a token with an 'e' that causes problems. Other users *have* noticed; it's causing problems when embedding DSLs into proc macros, as seen in rust-lang#111615, where the CSS colours case is causing problems for two different UI frameworks (Slint and Makepad). We can do better. This commit changes the lexer so that, when it hits a possible exponent, it looks ahead and only produces an exponent if a valid one is present. Otherwise, it produces a non-exponent form, which may be a single token (e.g. `1eV`) or multiple tokens (e.g. `1e+a`). Consequences of this: - All the proc macro problem cases mentioned above are fixed. - The "expected at least one digit in exponent" error is no longer possible. A few tests that only worked in the presence of that error have been removed. - The lexer requires unbounded lookahead due to the presence of '_' chars in exponents. E.g. to distinguish `1e+_______3` (a float literal with exponent) from `1e+_______a` (previously invalid, but now the tokenised as `1e`, `+`, `_______a`). This is a backwards compatible language change: all existing valid programs will be treated in the same way, and some previously invalid programs will become valid. The tokens chapter of the language reference (https://doc.rust-lang.org/reference/tokens.html) will need changing to account for this. In particular, the "Reserved forms similar to number literals" section will need updating, and grammar rules involving the SUFFIX_NO_E nonterminal will need adjusting. Fixes rust-lang#111615.
9fa6652
to
e51cfe6
Compare
Looks like its own implementation of the lexer, right? cc @dtolnay, in that case, for the proc_macro2 perspective. |
☔ The latest upstream changes (presumably #111858) made this pull request unmergeable. Please resolve the merge conflicts. |
The main requirement from me here is for this change to be compatible with lexer producing finer-grained tokens for floats (possibly suffixed integers, idents, and punctuation instead of whole-floats) as I described on the Zulip thread and in #71322. Step 1So I suggest to actually implement that new behavior in the lexer first.
That would be of great help for any future work, and we could publicly expose this lexing mode from For compatibility we'd also provide a mode that would immediately glue everything we've just lexed back into a Step 2Then we'd just choose in some cases to not glue everything back, thus fixing #111615. |
I strongly suspect that can unsupport the underscores after +/- thus removing |
@nnethercote |
waiting-on-author is still appropriate. Vadim's suggestion above is for a completely different approach, one that requires much larger changes, and I haven't gotten around to trying it. |
@nnethercote any updates on this? thanks |
I'd still like to fix this, but it's fair to say progress is stalled enough that closing this is reasonable. |
Integers with arbitrary suffixes are allowed as inputs to proc macros. A number of real-world crates use this capability in interesting ways, as seen in #103872. For example:
8bits
,100px
,20ns
,30GB
#7CFC00
(LawnGreen)785ada2c-f2d0-11fd-3839-b3104db0cb68
The hex cases may be surprising.
#7CFC00
is tokenized as a#
followed by a7
integer with aCFC00
suffix.785ada2c
is tokenized as a785
integer with anada2c
suffix.f2d0
is tokenized as an identifier.3839
is tokenized as an integer literal.A proc macro will immediately stringify such tokens and reparse them itself, and so won't care that the token types vary. All suffixes must be consumed by the proc macro, of course; the only suffixes allowed after macro expansion are the numeric ones like
u8
,i32
, andf64
.Currently there is an annoying inconsistency in how integer literal suffixes are handled, which is that no suffix starting with
e
is allowed, because that it interpreted as a float literal with an exponent. For example:1eV
and1em
#90EE90
(LightGreen)785ada2c-f2d0-11ed-3839-b3104db0cb68
In each case, a sequence of digits followed by an 'e' or 'E' followed by a letter results in an "expected at least one digit in exponent" error. This is an annoying inconsistency in general, and a problem in practice. It's likely that some users haven't realized this inconsistency because they've gotten lucky and never used a token with an 'e' that causes problems. Other users have noticed; it's causing problems when embedding DSLs into proc macros, as seen in #111615, where the CSS colours case is causing problems for two different UI frameworks (Slint and Makepad).
We can do better. This commit changes the lexer so that, when it hits a possible exponent, it looks ahead and only produces an exponent if a valid one is present. Otherwise, it produces a non-exponent form, which may be a single token (e.g.
1eV
) or multiple tokens (e.g.1e+a
).Consequences of this:
1e+_______3
(a float literal with exponent) from1e+_______a
(previously invalid, but now the tokenised as1e
,+
,_______a
).This is a backwards compatible language change: all existing valid programs will be treated in the same way, and some previously invalid programs will become valid. The tokens chapter of the language reference (https://doc.rust-lang.org/reference/tokens.html) will need changing to account for this. In particular, the "Reserved forms similar to number literals" section will need updating, and grammar rules involving the SUFFIX_NO_E nonterminal will need adjusting.
Fixes #111615.
r? @ghost