Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow integer suffixes starting with e. #111628

Closed
wants to merge 1 commit into from

Conversation

nnethercote
Copy link
Contributor

Integers with arbitrary suffixes are allowed as inputs to proc macros. A number of real-world crates use this capability in interesting ways, as seen in #103872. For example:

  • Suffixes representing units, such as 8bits, 100px, 20ns, 30GB
  • CSS hex colours such as #7CFC00 (LawnGreen)
  • UUIDs, e.g. 785ada2c-f2d0-11fd-3839-b3104db0cb68

The hex cases may be surprising.

  • #7CFC00 is tokenized as a # followed by a 7 integer with a CFC00 suffix.
  • 785ada2c is tokenized as a 785 integer with an ada2c suffix.
  • f2d0 is tokenized as an identifier.
  • 3839 is tokenized as an integer literal.

A proc macro will immediately stringify such tokens and reparse them itself, and so won't care that the token types vary. All suffixes must be consumed by the proc macro, of course; the only suffixes allowed after macro expansion are the numeric ones like u8, i32, and f64.

Currently there is an annoying inconsistency in how integer literal suffixes are handled, which is that no suffix starting with e is allowed, because that it interpreted as a float literal with an exponent. For example:

  • Units: 1eV and 1em
  • CSS colours: #90EE90 (LightGreen)
  • UUIDs: 785ada2c-f2d0-11ed-3839-b3104db0cb68

In each case, a sequence of digits followed by an 'e' or 'E' followed by a letter results in an "expected at least one digit in exponent" error. This is an annoying inconsistency in general, and a problem in practice. It's likely that some users haven't realized this inconsistency because they've gotten lucky and never used a token with an 'e' that causes problems. Other users have noticed; it's causing problems when embedding DSLs into proc macros, as seen in #111615, where the CSS colours case is causing problems for two different UI frameworks (Slint and Makepad).

We can do better. This commit changes the lexer so that, when it hits a possible exponent, it looks ahead and only produces an exponent if a valid one is present. Otherwise, it produces a non-exponent form, which may be a single token (e.g. 1eV) or multiple tokens (e.g. 1e+a).

Consequences of this:

  • All the proc macro problem cases mentioned above are fixed.
  • The "expected at least one digit in exponent" error is no longer possible. A few tests that only worked in the presence of that error have been removed.
  • The lexer requires unbounded lookahead due to the presence of '_' chars in exponents. E.g. to distinguish 1e+_______3 (a float literal with exponent) from 1e+_______a (previously invalid, but now the tokenised as 1e, +, _______a).

This is a backwards compatible language change: all existing valid programs will be treated in the same way, and some previously invalid programs will become valid. The tokens chapter of the language reference (https://doc.rust-lang.org/reference/tokens.html) will need changing to account for this. In particular, the "Reserved forms similar to number literals" section will need updating, and grammar rules involving the SUFFIX_NO_E nonterminal will need adjusting.

Fixes #111615.

r? @ghost

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 16, 2023
@nnethercote nnethercote marked this pull request as draft May 16, 2023 03:12
@nnethercote
Copy link
Contributor Author

This will need buy-in from the lang team. I have started a Zulip thread for discussion.

@rust-log-analyzer

This comment has been minimized.

@nnethercote
Copy link
Contributor Author

cc @rust-lang/lang, for obvious reasons.

cc @matklad, in case there are any rust-analyzer considerations.

@petrochenkov petrochenkov self-assigned this May 16, 2023
@matklad
Copy link
Member

matklad commented May 16, 2023

No IDE concerns here. Unbounded look ahead in the lexer looks suspicious, but I think it’s actually fine.

Actually, no, I think the lookahead would require a small adjustment in the code for incremental relexing:

https://github.com/rust-lang/rust-analyzer/blob/2f8cd66fb4c98026d2bdbdf17270e3472e1ca42a/crates/syntax/src/parsing/reparsing.rs#L35

This is not super-precisely formulated (and probably buggy as-is), but their idea here is that a lot of edits modify just a single token (user appending a letter to identifier), so we should take advantage of that and modify the syntax tree without incremental reparsing, by just replacing a single token.

We do have access to previous token there, so running this lookahead logic there should be possible, just more code.

It is perhaps worth it to move this incremental re-lexing logic over to rustc code base (with suitable unit tests), to encode the core constraint an IDE needs: “re-lexing can be done incrementally”.

@ogoffart
Copy link
Contributor

proc_macro2 will probably need to be adjusted as well.

@nnethercote
Copy link
Contributor Author

proc_macro2 will probably need to be adjusted as well.

How so? I'm no proc_macro2 expert, but won't the newly accepted tokens just be more tokens, not really any different to existing tokens? E.g. 1eV doesn't seem particularly different to 1mm, once the lexer accepts it.

@matklad
Copy link
Member

matklad commented May 16, 2023

Integers with arbitrary suffixes are allowed as inputs to proc macros. A
number of real-world crates use this capability in interesting ways, as
seen in rust-lang#103872. For example:
- Suffixes representing units, such as `8bits`, `100px`, `20ns`, `30GB`
- CSS hex colours such as `#7CFC00` (LawnGreen)
- UUIDs, e.g. `785ada2c-f2d0-11fd-3839-b3104db0cb68`

The hex cases may be surprising.
- `#7CFC00` is tokenized as a `#` followed by a `7` integer with a
  `CFC00` suffix.
- `785ada2c` is tokenized as a `785` integer with an `ada2c` suffix.
- `f2d0` is tokenized as an identifier.
- `3839` is tokenized as an integer literal.

A proc macro will immediately stringify such tokens and reparse them
itself, and so won't care that the token types vary. All suffixes must
be consumed by the proc macro, of course; the only suffixes allowed
after macro expansion are the numeric ones like `u8`, `i32`, and `f64`.

Currently there is an annoying inconsistency in how integer literal
suffixes are handled, which is that no suffix starting with `e` is
allowed, because that it interpreted as a float literal with an
exponent. For example:
- Units: `1eV` and `1em`
- CSS colours: `#90EE90` (LightGreen)
- UUIDs: `785ada2c-f2d0-11ed-3839-b3104db0cb68`

In each case, a sequence of digits followed by an 'e' or 'E' followed by
a letter results in an "expected at least one digit in exponent" error.
This is an annoying inconsistency in general, and a problem in practice.
It's likely that some users haven't realized this inconsistency because
they've gotten lucky and never used a token with an 'e' that causes
problems. Other users *have* noticed; it's causing problems when
embedding DSLs into proc macros, as seen in rust-lang#111615, where the CSS
colours case is causing problems for two different UI frameworks (Slint
and Makepad).

We can do better. This commit changes the lexer so that, when it hits a
possible exponent, it looks ahead and only produces an exponent if a
valid one is present. Otherwise, it produces a non-exponent form, which
may be a single token (e.g. `1eV`) or multiple tokens (e.g. `1e+a`).

Consequences of this:
- All the proc macro problem cases mentioned above are fixed.
- The "expected at least one digit in exponent" error is no longer
  possible. A few tests that only worked in the presence of that error
  have been removed.
- The lexer requires unbounded lookahead due to the presence of '_'
  chars in exponents. E.g. to distinguish `1e+_______3` (a float literal
  with exponent) from `1e+_______a` (previously invalid, but now the
  tokenised as `1e`, `+`, `_______a`).

This is a backwards compatible language change: all existing valid
programs will be treated in the same way, and some previously invalid
programs will become valid. The tokens chapter of the language reference
(https://doc.rust-lang.org/reference/tokens.html) will need changing to
account for this. In particular, the "Reserved forms similar to number
literals" section will need updating, and grammar rules involving the
SUFFIX_NO_E nonterminal will need adjusting.

Fixes rust-lang#111615.
@nnethercote
Copy link
Contributor Author

proc macro 2 has another copy of the lexer:

Looks like its own implementation of the lexer, right?

cc @dtolnay, in that case, for the proc_macro2 perspective.

@bors
Copy link
Contributor

bors commented May 26, 2023

☔ The latest upstream changes (presumably #111858) made this pull request unmergeable. Please resolve the merge conflicts.

@petrochenkov petrochenkov marked this pull request as ready for review May 29, 2023 18:07
@petrochenkov
Copy link
Contributor

petrochenkov commented May 29, 2023

The main requirement from me here is for this change to be compatible with lexer producing finer-grained tokens for floats (possibly suffixed integers, idents, and punctuation instead of whole-floats) as I described on the Zulip thread and in #71322.

Step 1

So I suggest to actually implement that new behavior in the lexer first.

1e2 -> Int(1e2)
1. -> Int(1) Punct(.)
1.2 -> Int(1) Punct(.) Int(2)
1.2e3 -> Int(1) Punct(.) Int(2e3)
1e+2 -> Int(1e) Punct(+) Int(2)
1e+_2 -> Int(1e) Punct(+) Ident(_2)
1.2e+3 -> Int(1) Punct(.) Int(2e) Punct(+) Int(3)
1.2e+_3 -> Int(1) Punct(.) Int(2e) Punct(+) Ident(_3)

That would be of great help for any future work, and we could publicly expose this lexing mode from rustc_lexer even if rustc_parser is not using it right now.

For compatibility we'd also provide a mode that would immediately glue everything we've just lexed back into a Float token.

Step 2

Then we'd just choose in some cases to not glue everything back, thus fixing #111615.

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 29, 2023
@petrochenkov
Copy link
Contributor

1e+_2 -> Int(1e) Punct(+) Ident(_2)
1.2e+_3 -> Int(1) Punct(.) Int(2e) Punct(+) Ident(_3)

I strongly suspect that can unsupport the underscores after +/- thus removing Idents from the equation, and leaving only punctuation and (possibly suffixed) integers.
It would be interesting to run this change through crater.

@JohnCSimon
Copy link
Member

@nnethercote
ping from triage - can you post your status on this PR? There hasn't been an update in a few months. Thanks!

@nnethercote
Copy link
Contributor Author

waiting-on-author is still appropriate. Vadim's suggestion above is for a completely different approach, one that requires much larger changes, and I haven't gotten around to trying it.

@Dylan-DPC
Copy link
Member

@nnethercote any updates on this? thanks

@nnethercote
Copy link
Contributor Author

I'd still like to fix this, but it's fair to say progress is stalled enough that closing this is reasonable.

@nnethercote nnethercote closed this Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow numeric tokens containing 'e' that aren't exponents be passed to proc macros
9 participants