introduce unescape module #60261

matklad · 2019-04-25T13:35:01Z

A WIP PR to gauge early feedback

Currently, we deal with escape sequences twice: once when we lex a string, and a second time when we unescape literals. Note that we also produce different sets of diagnostics in these two cases.

This PR aims to remove this duplication, by introducing a new unescape module as a single source of truth for character escaping rules.

I think this would be a useful cleanup by itself, but I also need this for #59706.

In the current state, the PR has unescape module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module

What this PR doesn't have yet are:

handling of byte and byte string literals (should be simple to add)
good diagnostics
actual removal of code from lexer (giant scan_char_or_byte should go away completely)
performance check
general cleanup of the new code

Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that unescape produces a plain old enum with various problems, and they are rendered into Handler separately. This bit is not actually required (it is possible to just pass the Handler in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

cc @eddyb , @petrochenkov

rust-highfive · 2019-04-25T13:35:07Z

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

petrochenkov · 2019-04-28T18:26:40Z

Meta: for reviewing convenience it's better to update UI test outputs and satisfy tidy to make CI green, even if the changes in test results are temporarily wrong / intended to disappear.
This way it's clear how exactly they are wrong and what still needs to be fixed.

matklad · 2019-04-28T19:23:53Z

Apparently, running ./x.py test src/test/ui --bless is not enough to make tests green :(

src/libsyntax/parse/lexer/mod.rs

src/libsyntax/parse/mod.rs

src/libsyntax/parse/unescape.rs

src/libsyntax/parse/mod.rs

petrochenkov · 2019-04-28T22:44:11Z

Question: what happens if a literal is lexed, but never "parsed properly"?
For example, if it's passed to a macro that accepts tts and throws them away.
The errors for incorrect escapes, etc, should be reported in that case as well.

(P.S. I haven't reviewed everything yet, will continue tomorrow.)

petrochenkov · 2019-04-28T22:45:00Z

r? @petrochenkov

matklad · 2019-04-29T08:37:31Z

Question: what happens if a literal is lexed, but never "parsed properly"?

Good question! Given that diag: Option<(Span, &Handler)> argument to char_lit function, I was under the impression that we always parse literals properly. Turns out that even today we don't do that, so existing code is buggy.

The following compiles, while it shouldn't (the 6F literal is out of range for char):

macro_rules! erase {
    ($($tt:tt)*) => {}
}

fn main() {
    erase! {
        '\u{FFFFFF}'
    }
}

playground

If we pursue the approach in this PR, then we should run unescape_* family of functions twice: once in the lexer, where we just report errors and disregard escaped characters, and once in the parser, where we do the opposite and ignore errors, but collect unescaped literals. That means that we will be able to remove that diag: Option argument (indeed, "optionally" reporting diagnostics seems like a sure way to have bugs)

matklad · 2019-04-29T09:06:30Z

Hm, or is the above example an expected behavior? We don't check ranges of integer literals, for example:

macro_rules! erase {
    ($($tt:tt)*) => {}
}

fn main() {
    erase!(999u8);
}

for chars, we do check that there are at most six hex digits in the lexer, but we only do precise check for range and surrogates in the parser, which seems somewhat arbitrary.

eddyb · 2019-04-30T13:08:00Z

@matklad I think for out of range integer literals, we warn and truncate later in the compilation, but there's not much we can do for char (there is no meaningful truncation to speak of).

That said, maybe literals should be checked by AST -> HIR lowering, i.e. only unescaped once, to store both versions, or just the unescaped one, in HIR?

One notable exception to "AST literals can be opaque, only HIR needs unescaping" is string literals used in attributes (such as #[path = "foo\\bar\\baz.rs"] mod baz;).

matklad · 2019-04-30T14:27:30Z

That said, maybe literals should be checked by AST -> HIR lowering, i.e. only unescaped once, to store both versions, or just the unescaped one, in HIR?

I think we should do validation before macro expansion, for two reasons at least:

macro_rules that drop tokens can be used to skip AST -> HIR validation, and that seems bad
proc_macros should be able to assume that the tokens they get are well-formed

As for actual unescapeing, agree that, abstractly, it makes sense to do it as late as possible. Ideally (not necessary practically), running cargo check should not construct unescapped strings at all. However, changing the place where unescaping occurs is probably out of scope for this PR.

So my current plan is:

run the unescape_* functions twice, once for validation, in lexer, once for actual unescaping, in the parser
treat \u{fff_fff} as a lexical error (that will require lang team sign off and a crate run) (after thinking about this more, I come to the conclusion that chars and integers are really different: to validate ints, we need type inference results. chars we can validate lexically, and something like "hello \u{lone surrogate} world" just doesn't make sense to me)

bors · 2019-05-03T12:55:08Z

⌛ Trying commit 1835cbe with merge bfdcf6d...

@eddyb

introduce unescape module A WIP PR to gauge early feedback Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases. This PR aims to remove this duplication, by introducing a new `unescape` module as a single source of truth for character escaping rules. I think this would be a useful cleanup by itself, but I also need this for #59706. In the current state, the PR has `unescape` module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module What this PR doesn't have yet are: * [x] handling of byte and byte string literals (should be simple to add) * [x] good diagnostics * [x] actual removal of code from lexer (giant `scan_char_or_byte` should go away completely) * [ ] performance check * [x] general cleanup of the new code Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that `unescape` produces a plain old `enum` with various problems, and they are rendered into `Handler` separately. This bit is not actually required (it is possible to just pass the `Handler` in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706 cc @eddyb , @petrochenkov

bors · 2019-05-03T15:18:00Z

☀️ Try build successful - checks-travis
Build commit: bfdcf6d

matklad · 2019-05-03T17:17:35Z

This probably should be tagged with Breaking Change and Waiting on Crater presumably?

petrochenkov · 2019-05-03T18:35:56Z

@craterbot run mode=check-only

craterbot · 2019-05-03T18:35:58Z

👌 Experiment pr-60261 created and queued.
🤖 Automatically detected try build bfdcf6d
🔍 You can check out the queue and this experiment's details.

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

craterbot · 2019-05-03T18:36:02Z

🚧 Experiment pr-60261 is now running on agent aws-2.

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

petrochenkov · 2019-05-03T18:37:08Z

@rust-timer build bfdcf6d

rust-timer · 2019-05-03T18:37:10Z

Success: Queued bfdcf6d with parent 1891bfa, comparison URL.

rust-timer · 2019-05-03T23:47:51Z

Finished benchmarking try commit bfdcf6d

matklad · 2019-05-03T23:54:32Z

Looks like there are no significant perf differences, let's wait what crater says

craterbot · 2019-05-05T22:20:08Z

🎉 Experiment pr-60261 is completed!
📊 0 regressed and 0 fixed (60951 total)
📰 Open the full report.

⚠️ If you notice any spurious failure please add them to the blacklist!
ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

petrochenkov · 2019-05-05T22:32:53Z

@bors r+

bors · 2019-05-05T22:32:55Z

📌 Commit 1835cbe has been approved by petrochenkov

bors · 2019-05-06T00:16:24Z

⌛ Testing commit 1835cbe with merge 46d0ca0...

@eddyb

introduce unescape module A WIP PR to gauge early feedback Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases. This PR aims to remove this duplication, by introducing a new `unescape` module as a single source of truth for character escaping rules. I think this would be a useful cleanup by itself, but I also need this for #59706. In the current state, the PR has `unescape` module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module What this PR doesn't have yet are: * [x] handling of byte and byte string literals (should be simple to add) * [x] good diagnostics * [x] actual removal of code from lexer (giant `scan_char_or_byte` should go away completely) * [x] performance check * [x] general cleanup of the new code Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that `unescape` produces a plain old `enum` with various problems, and they are rendered into `Handler` separately. This bit is not actually required (it is possible to just pass the `Handler` in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706 cc @eddyb , @petrochenkov

bors · 2019-05-06T03:11:07Z

☀️ Test successful - checks-travis, status-appveyor
Approved by: petrochenkov
Pushing 46d0ca0 to master...

matklad · 2019-05-07T16:44:41Z

FWIW, this is now used by rust-analyzer: rust-lang/rust-analyzer#1253

rust-highfive assigned eddyb Apr 25, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 25, 2019

petrochenkov self-assigned this Apr 25, 2019

This comment has been minimized.

Sign in to view

matklad changed the title ~~introduce unescape module~~ WIP: introduce unescape module Apr 25, 2019

This comment has been minimized.

Sign in to view

matklad force-pushed the one-escape branch 2 times, most recently from a39c555 to f91fbdc Compare April 28, 2019 20:20

This comment has been minimized.

Sign in to view

petrochenkov reviewed Apr 28, 2019

View reviewed changes

src/libsyntax/parse/lexer/mod.rs Show resolved Hide resolved

petrochenkov reviewed Apr 28, 2019

View reviewed changes

src/libsyntax/parse/lexer/mod.rs Show resolved Hide resolved

petrochenkov reviewed Apr 28, 2019

View reviewed changes

src/libsyntax/parse/mod.rs Outdated Show resolved Hide resolved

petrochenkov reviewed Apr 28, 2019

View reviewed changes

src/libsyntax/parse/unescape.rs Show resolved Hide resolved

petrochenkov reviewed Apr 28, 2019

View reviewed changes

src/libsyntax/parse/mod.rs Show resolved Hide resolved

rust-highfive unassigned eddyb Apr 28, 2019

matklad force-pushed the one-escape branch from f91fbdc to 072c0fa Compare April 29, 2019 09:52

This comment has been minimized.

Sign in to view

matklad force-pushed the one-escape branch from 072c0fa to a3fd9bd Compare April 30, 2019 15:31

This comment has been minimized.

Sign in to view

craterbot added S-waiting-on-crater Status: Waiting on a crater run to be completed. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 3, 2019

craterbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-crater Status: Waiting on a crater run to be completed. labels May 5, 2019

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 5, 2019

bors added the merged-by-bors This PR was explicitly merged by bors. label May 6, 2019

bors mentioned this pull request May 6, 2019

Preserve byte string literal raw-ness in AST #60506

Closed

bors merged commit 1835cbe into rust-lang:master May 6, 2019

matklad deleted the one-escape branch May 6, 2019 07:24

petrochenkov mentioned this pull request May 15, 2019

lexer: Disallow bare CR in raw byte strings #60793

Merged

bvanjoi mentioned this pull request Mar 8, 2023

fix(lexer): use is_asscii_whitespace to find first not space pos #108403

Closed

introduce unescape module #60261

introduce unescape module #60261

Uh oh!

Conversation

matklad commented Apr 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rust-highfive commented Apr 25, 2019

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

petrochenkov commented Apr 28, 2019

Uh oh!

This comment has been minimized.

matklad commented Apr 28, 2019

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrochenkov commented Apr 28, 2019

Uh oh!

petrochenkov commented Apr 28, 2019

Uh oh!

matklad commented Apr 29, 2019

Uh oh!

matklad commented Apr 29, 2019

Uh oh!

This comment has been minimized.

eddyb commented Apr 30, 2019

Uh oh!

matklad commented Apr 30, 2019

Uh oh!

This comment has been minimized.

bors commented May 3, 2019

Uh oh!

bors commented May 3, 2019

Uh oh!

matklad commented May 3, 2019

Uh oh!

petrochenkov commented May 3, 2019

Uh oh!

craterbot commented May 3, 2019

Uh oh!

craterbot commented May 3, 2019

Uh oh!

petrochenkov commented May 3, 2019

Uh oh!

rust-timer commented May 3, 2019

Uh oh!

rust-timer commented May 3, 2019

Uh oh!

matklad commented May 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craterbot commented May 5, 2019

Uh oh!

petrochenkov commented May 5, 2019

Uh oh!

bors commented May 5, 2019

Uh oh!

bors commented May 6, 2019

Uh oh!

bors commented May 6, 2019

Uh oh!

matklad commented May 7, 2019

Uh oh!

Uh oh!

matklad commented Apr 25, 2019 •

edited

Loading

matklad commented May 3, 2019 •

edited

Loading