Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Underscore suffixes allowed in literals #41723

Closed
qnighy opened this issue May 3, 2017 · 10 comments
Closed

Underscore suffixes allowed in literals #41723

qnighy opened this issue May 3, 2017 · 10 comments
Labels
A-grammar Area: The grammar of Rust A-parser Area: The parsing of Rust source code to an AST. T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@qnighy
Copy link
Contributor

qnighy commented May 3, 2017

I found these pathlogical examples pass the lexer and the parser:

fn main() {
    println!("{}", "Foo"_);
    println!("{}", 10._);
}

The reason is that

  1. scan_optional_raw_name eats _, but reports no suffix.
  2. scan_number distinguishes between decimal points and field/method syntax by is_xid_start, which in fact excludes _.

According to the reference, it seems these examples shouldn't be allowed.

Tested on stable (rustc 1.17.0 (56124baa9 2017-04-24)) and nightly (rustc 1.19.0-nightly (6a5fc9eec 2017-05-02)) on the playground.

@qnighy
Copy link
Contributor Author

qnighy commented May 3, 2017

According to the reference, decimal points are optionally followed by another "decimal literal". Since _ isn't a decimal literal, it shouldn't be allowed.

Here the parser is recognizing _ in 10._ as a suffix rather than a decimal literal followed by the point. If not, 10._f64 would be lexed as 10._+f64, but actually the parser complains _f64 is not a valid suffix.

In addition, the reference says f32 and f64 are the only valid suffixes. So _ shouldn't be allowed as a suffix either.

@nagisa nagisa added A-grammar Area: The grammar of Rust A-parser Area: The parsing of Rust source code to an AST. I-nominated T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 3, 2017
@nikomatsakis
Copy link
Contributor

It seems to me that _ should be considered part of the number. So I would expect 10._f64 to work, indeed, just as 10.2_f64 or 2_f64 works.

@arielb1 points out that "a"_ is also accepted, which seems like a bug.

Basically, I do not expect a suffix to begin with _, I think, but I do expect _ to be allowed in numeric literals anywhere (including the end).

@arielb1 arielb1 added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label May 11, 2017
@nikomatsakis nikomatsakis removed the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label May 11, 2017
@nikomatsakis
Copy link
Contributor

Shifting nomination to T-lang for now.

@joshtriplett
Copy link
Member

The language team discussed this in today's meeting. We also considered cases like 42.method_on_numbers(), or 42._method_on_numbers(), both of which ought to be parsed as methods. The conclusion we came to was that it's ambiguous (for our mental lexers if nothing else) to allow a trailing underscore on a floating-point literal immediately following a . with no subsequent digits. Parsing _[a-z] in a way that puts the _ and the [a-z] in separate tokens feels incredibly confusing. It's not unreasonable to have to write 42.0f64 or 42.0_f64 instead if you want a suffix.

(By contrast, 42_f64 or 42_u64 seems fine, and unambiguous, though whether you should write that might be a good question for the style team.)

So, the recommendation from the language team would be to always parse 42._ident as a field or method, and never include the _ as part of the floating-point literal.

One question that didn't come up in the language team meeting: does anything rely on the current (apparently inconsistent) lexing behavior here? Would fixing this ambiguity break anything?

@qnighy
Copy link
Contributor Author

qnighy commented May 12, 2017

I'll try to write a patch for this issue.

@nikomatsakis
Copy link
Contributor

nikomatsakis commented May 12, 2017

@joshtriplett

One question that didn't come up in the language team meeting: does anything rely on the current (apparently inconsistent) lexing behavior here? Would fixing this ambiguity break anything?

I would think not, at least for numbers, given that it results in an error. Not sure about ""_, which... apparently parses but does not error? (When do we ever support suffixes on strings, besides #?)

frewsxcv added a commit to frewsxcv/rust that referenced this issue May 13, 2017
…loat, r=petrochenkov

Disallow ._ in float literal.

This patch makes lexer stop parsing number literals before `._`, as well as before `.a`. Underscore itself is still allowed like in `4_000_000.000_000_`.

Fixes a half part of rust-lang#41723. The other is `""_`.
Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this issue May 13, 2017
…loat, r=petrochenkov

Disallow ._ in float literal.

This patch makes lexer stop parsing number literals before `._`, as well as before `.a`. Underscore itself is still allowed like in `4_000_000.000_000_`.

Fixes a half part of rust-lang#41723. The other is `""_`.
Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this issue May 14, 2017
…loat, r=petrochenkov

Disallow ._ in float literal.

This patch makes lexer stop parsing number literals before `._`, as well as before `.a`. Underscore itself is still allowed like in `4_000_000.000_000_`.

Fixes a half part of rust-lang#41723. The other is `""_`.
@qnighy
Copy link
Contributor Author

qnighy commented May 15, 2017

@joshtriplett

So, the recommendation from the language team would be to always parse 42._ident as a field or method, and never include the _ as part of the floating-point literal.

Do you mean by "never include" also forbidding literals like 4_000.000_000 ?

@joshtriplett
Copy link
Member

joshtriplett commented May 16, 2017

@qnighy No, that's fine. I meant specifically that 42._ident or similar should never lex as a floating-point number 42._ with suffix ident; an underscore immediately after the dot should not lex as part of the literal.

@qnighy
Copy link
Contributor Author

qnighy commented May 16, 2017

@joshtriplett I see. Then #41946 and #41990 will close this issue.

bors added a commit that referenced this issue Jun 6, 2017
…ike-literals, r=nikomatsakis

Disallow underscore suffix for string-like literals.

This patch turns string/bytestring/char/byte literals followed by an underscore, like `"Foo"_`, to an error.

`scan_optional_raw_name` will parse `_` as a valid raw name, but it will be rejected by the parser. I also considered just stopping parsing when the suffix is `_`, but in that case `"Foo"_` will be lexed as two valid tokens.

Fixes the latter half of #41723.
@qnighy
Copy link
Contributor Author

qnighy commented Jun 6, 2017

Closing since #41946 and #41990 are merged.

@qnighy qnighy closed this as completed Jun 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-grammar Area: The grammar of Rust A-parser Area: The parsing of Rust source code to an AST. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants