Allow integer suffixes starting with `e`. #111628

nnethercote · 2023-05-16T03:12:28Z

Integers with arbitrary suffixes are allowed as inputs to proc macros. A number of real-world crates use this capability in interesting ways, as seen in #103872. For example:

Suffixes representing units, such as 8bits, 100px, 20ns, 30GB
CSS hex colours such as #7CFC00 (LawnGreen)
UUIDs, e.g. 785ada2c-f2d0-11fd-3839-b3104db0cb68

The hex cases may be surprising.

#7CFC00 is tokenized as a # followed by a 7 integer with a CFC00 suffix.
785ada2c is tokenized as a 785 integer with an ada2c suffix.
f2d0 is tokenized as an identifier.
3839 is tokenized as an integer literal.

A proc macro will immediately stringify such tokens and reparse them itself, and so won't care that the token types vary. All suffixes must be consumed by the proc macro, of course; the only suffixes allowed after macro expansion are the numeric ones like u8, i32, and f64.

Currently there is an annoying inconsistency in how integer literal suffixes are handled, which is that no suffix starting with e is allowed, because that it interpreted as a float literal with an exponent. For example:

Units: 1eV and 1em
CSS colours: #90EE90 (LightGreen)
UUIDs: 785ada2c-f2d0-11ed-3839-b3104db0cb68

In each case, a sequence of digits followed by an 'e' or 'E' followed by a letter results in an "expected at least one digit in exponent" error. This is an annoying inconsistency in general, and a problem in practice. It's likely that some users haven't realized this inconsistency because they've gotten lucky and never used a token with an 'e' that causes problems. Other users have noticed; it's causing problems when embedding DSLs into proc macros, as seen in #111615, where the CSS colours case is causing problems for two different UI frameworks (Slint and Makepad).

We can do better. This commit changes the lexer so that, when it hits a possible exponent, it looks ahead and only produces an exponent if a valid one is present. Otherwise, it produces a non-exponent form, which may be a single token (e.g. 1eV) or multiple tokens (e.g. 1e+a).

Consequences of this:

All the proc macro problem cases mentioned above are fixed.
The "expected at least one digit in exponent" error is no longer possible. A few tests that only worked in the presence of that error have been removed.
The lexer requires unbounded lookahead due to the presence of '_' chars in exponents. E.g. to distinguish 1e+_______3 (a float literal with exponent) from 1e+_______a (previously invalid, but now the tokenised as 1e, +, _______a).

This is a backwards compatible language change: all existing valid programs will be treated in the same way, and some previously invalid programs will become valid. The tokens chapter of the language reference (https://doc.rust-lang.org/reference/tokens.html) will need changing to account for this. In particular, the "Reserved forms similar to number literals" section will need updating, and grammar rules involving the SUFFIX_NO_E nonterminal will need adjusting.

Fixes #111615.

r? @ghost

nnethercote · 2023-05-16T03:18:24Z

This will need buy-in from the lang team. I have started a Zulip thread for discussion.

nnethercote · 2023-05-16T07:54:57Z

cc @rust-lang/lang, for obvious reasons.

cc @matklad, in case there are any rust-analyzer considerations.

matklad · 2023-05-16T11:04:44Z

~~No IDE concerns here. Unbounded look ahead in the lexer looks suspicious, but I think it’s actually fine.~~

Actually, no, I think the lookahead would require a small adjustment in the code for incremental relexing:

https://github.com/rust-lang/rust-analyzer/blob/2f8cd66fb4c98026d2bdbdf17270e3472e1ca42a/crates/syntax/src/parsing/reparsing.rs#L35

This is not super-precisely formulated (and probably buggy as-is), but their idea here is that a lot of edits modify just a single token (user appending a letter to identifier), so we should take advantage of that and modify the syntax tree without incremental reparsing, by just replacing a single token.

We do have access to previous token there, so running this lookahead logic there should be possible, just more code.

It is perhaps worth it to move this incremental re-lexing logic over to rustc code base (with suitable unit tests), to encode the core constraint an IDE needs: “re-lexing can be done incrementally”.

ogoffart · 2023-05-16T12:26:47Z

proc_macro2 will probably need to be adjusted as well.

nnethercote · 2023-05-16T12:29:57Z

proc_macro2 will probably need to be adjusted as well.

How so? I'm no proc_macro2 expert, but won't the newly accepted tokens just be more tokens, not really any different to existing tokens? E.g. 1eV doesn't seem particularly different to 1mm, once the lexer accepts it.

matklad · 2023-05-16T12:45:15Z

proc macro 2 has another copy of the lexer:

https://github.com/dtolnay/proc-macro2/blob/2c1b1021cff64aa6c29dd2c82bcb87b369013d00/src/parse.rs#L325

Integers with arbitrary suffixes are allowed as inputs to proc macros. A number of real-world crates use this capability in interesting ways, as seen in rust-lang#103872. For example: - Suffixes representing units, such as `8bits`, `100px`, `20ns`, `30GB` - CSS hex colours such as `#7CFC00` (LawnGreen) - UUIDs, e.g. `785ada2c-f2d0-11fd-3839-b3104db0cb68` The hex cases may be surprising. - `#7CFC00` is tokenized as a `#` followed by a `7` integer with a `CFC00` suffix. - `785ada2c` is tokenized as a `785` integer with an `ada2c` suffix. - `f2d0` is tokenized as an identifier. - `3839` is tokenized as an integer literal. A proc macro will immediately stringify such tokens and reparse them itself, and so won't care that the token types vary. All suffixes must be consumed by the proc macro, of course; the only suffixes allowed after macro expansion are the numeric ones like `u8`, `i32`, and `f64`. Currently there is an annoying inconsistency in how integer literal suffixes are handled, which is that no suffix starting with `e` is allowed, because that it interpreted as a float literal with an exponent. For example: - Units: `1eV` and `1em` - CSS colours: `#90EE90` (LightGreen) - UUIDs: `785ada2c-f2d0-11ed-3839-b3104db0cb68` In each case, a sequence of digits followed by an 'e' or 'E' followed by a letter results in an "expected at least one digit in exponent" error. This is an annoying inconsistency in general, and a problem in practice. It's likely that some users haven't realized this inconsistency because they've gotten lucky and never used a token with an 'e' that causes problems. Other users *have* noticed; it's causing problems when embedding DSLs into proc macros, as seen in rust-lang#111615, where the CSS colours case is causing problems for two different UI frameworks (Slint and Makepad). We can do better. This commit changes the lexer so that, when it hits a possible exponent, it looks ahead and only produces an exponent if a valid one is present. Otherwise, it produces a non-exponent form, which may be a single token (e.g. `1eV`) or multiple tokens (e.g. `1e+a`). Consequences of this: - All the proc macro problem cases mentioned above are fixed. - The "expected at least one digit in exponent" error is no longer possible. A few tests that only worked in the presence of that error have been removed. - The lexer requires unbounded lookahead due to the presence of '_' chars in exponents. E.g. to distinguish `1e+_______3` (a float literal with exponent) from `1e+_______a` (previously invalid, but now the tokenised as `1e`, `+`, `_______a`). This is a backwards compatible language change: all existing valid programs will be treated in the same way, and some previously invalid programs will become valid. The tokens chapter of the language reference (https://doc.rust-lang.org/reference/tokens.html) will need changing to account for this. In particular, the "Reserved forms similar to number literals" section will need updating, and grammar rules involving the SUFFIX_NO_E nonterminal will need adjusting. Fixes rust-lang#111615.

nnethercote · 2023-05-16T23:06:25Z

proc macro 2 has another copy of the lexer:

Looks like its own implementation of the lexer, right?

cc @dtolnay, in that case, for the proc_macro2 perspective.

bors · 2023-05-26T07:08:35Z

☔ The latest upstream changes (presumably #111858) made this pull request unmergeable. Please resolve the merge conflicts.

petrochenkov · 2023-05-29T18:23:51Z

The main requirement from me here is for this change to be compatible with lexer producing finer-grained tokens for floats (possibly suffixed integers, idents, and punctuation instead of whole-floats) as I described on the Zulip thread and in #71322.

Step 1

So I suggest to actually implement that new behavior in the lexer first.

1e2 -> Int(1e2)
1. -> Int(1) Punct(.)
1.2 -> Int(1) Punct(.) Int(2)
1.2e3 -> Int(1) Punct(.) Int(2e3)
1e+2 -> Int(1e) Punct(+) Int(2)
1e+_2 -> Int(1e) Punct(+) Ident(_2)
1.2e+3 -> Int(1) Punct(.) Int(2e) Punct(+) Int(3)
1.2e+_3 -> Int(1) Punct(.) Int(2e) Punct(+) Ident(_3)

That would be of great help for any future work, and we could publicly expose this lexing mode from rustc_lexer even if rustc_parser is not using it right now.

For compatibility we'd also provide a mode that would immediately glue everything we've just lexed back into a Float token.

Step 2

Then we'd just choose in some cases to not glue everything back, thus fixing #111615.

petrochenkov · 2023-05-30T23:14:11Z

1e+_2 -> Int(1e) Punct(+) Ident(_2)
1.2e+_3 -> Int(1) Punct(.) Int(2e) Punct(+) Ident(_3)

I strongly suspect that can unsupport the underscores after +/- thus removing Idents from the equation, and leaving only punctuation and (possibly suffixed) integers.
It would be interesting to run this change through crater.

JohnCSimon · 2023-10-01T03:14:47Z

@nnethercote
ping from triage - can you post your status on this PR? There hasn't been an update in a few months. Thanks!

nnethercote · 2023-10-01T06:35:17Z

waiting-on-author is still appropriate. Vadim's suggestion above is for a completely different approach, one that requires much larger changes, and I haven't gotten around to trying it.

Dylan-DPC · 2024-07-28T06:50:28Z

@nnethercote any updates on this? thanks

nnethercote · 2024-08-01T00:20:51Z

I'd still like to fix this, but it's fair to say progress is stalled enough that closing this is reasonable.

lexer: Disallow some leading underscores in float exponents A second, scaled down, attempt at rust-lang#114567. cc rust-lang#111628 (comment)

…ustc_session, r=<try> move some invalid exponent detection into rustc_session This PR moves part of the exponent checks from `rustc_lexer`/`rustc_parser` into `rustc_session`. This change does not affect which programs are accepted by the complier, or the diagnostics that are reported, with one main exception. That exception is that floats or ints with suffixes beginning with `e` are rejected *after* the token stream is passed to proc macros, rather than being rejected by the parser as was the case. This gives proc macro authors more consistent access to numeric literals: currently a proc macro could interpret `1m` or `30s` but not `7eggs` or `3em`. After this change all are handled the same. The lexer will still reject input if it contains `e` followed by a number, `+`/`-`, or `_` if they are not followed by a valid integer literal (number + `_`), but this doesn't affect macro authors who just want to access alpha suffixes. This PR is a continuation of rust-lang#79912. It is also solving exactly the same problem as [rust-lang#111628](rust-lang#111628). Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like `1e_______________23`) but nevertheless it is handled correctly. Also adds tests for various edge cases and improves diagnostics marginally. r: `@petrochenkov,` since they reviewed rust-lang#79912.

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 16, 2023

nnethercote marked this pull request as draft May 16, 2023 03:12

This comment has been minimized.

Sign in to view

nnethercote force-pushed the allow-e-suffixes branch from 6635303 to 6f5d2f6 Compare May 16, 2023 04:25

nnethercote mentioned this pull request May 16, 2023

Allow numeric tokens containing 'e' that aren't exponents be passed to proc macros #111615

Open

nnethercote force-pushed the allow-e-suffixes branch from 6f5d2f6 to 9fa6652 Compare May 16, 2023 07:06

petrochenkov self-assigned this May 16, 2023

nnethercote force-pushed the allow-e-suffixes branch from 9fa6652 to e51cfe6 Compare May 16, 2023 23:01

petrochenkov marked this pull request as ready for review May 29, 2023 18:07

petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 29, 2023

petrochenkov mentioned this pull request Aug 11, 2023

Disallow leading underscores in float exponents. #114567

Closed

guofoo mentioned this pull request Jan 5, 2024

Error defining colors containing hex digit 'e' in live_design! macro makepad/makepad#343

Open

nnethercote closed this Aug 1, 2024

richard-uk1 mentioned this pull request Oct 13, 2024

lexer: Treat more floats with empty exponent as valid tokens #131656

Open

petrochenkov mentioned this pull request Feb 21, 2025

lexer: Disallow some leading underscores in float exponents #137394

Closed

nnethercote deleted the allow-e-suffixes branch May 22, 2025 00:21

Allow integer suffixes starting with e. #111628

Allow integer suffixes starting with e. #111628

Uh oh!

Conversation

nnethercote commented May 16, 2023

Uh oh!

nnethercote commented May 16, 2023

Uh oh!

This comment has been minimized.

nnethercote commented May 16, 2023

Uh oh!

matklad commented May 16, 2023

Uh oh!

ogoffart commented May 16, 2023

Uh oh!

nnethercote commented May 16, 2023

Uh oh!

matklad commented May 16, 2023

Uh oh!

nnethercote commented May 16, 2023

Uh oh!

bors commented May 26, 2023

Uh oh!

petrochenkov commented May 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Step 1

Step 2

Uh oh!

petrochenkov commented May 30, 2023

Uh oh!

JohnCSimon commented Oct 1, 2023

Uh oh!

nnethercote commented Oct 1, 2023

Uh oh!

Dylan-DPC commented Jul 28, 2024

Uh oh!

nnethercote commented Aug 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Allow integer suffixes starting with `e`. #111628

Allow integer suffixes starting with `e`. #111628

petrochenkov commented May 29, 2023 •

edited

Loading