Skip to content

RFC to lex binary and octal literals more eagerly. #879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 30, 2015
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions text/0000-small-base-lexing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
- Feature Name: stable, it only restricts the language
- Start Date: 2015-02-17
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Summary

Lex binary and octal literals as if they were decimal.

# Motivation

Lexing all digits (even ones not valid in the given base) allows for
improved error messages & future proofing (this is more conservative
than the current approach) and less confusion, with little downside.

Currently, the lexer stops lexing binary and octal literals (`0b10` and
`0o12345670`) as soon as it sees an invalid digit (2-9 or 8-9
respectively), and immediately starts lexing a new token,
e.g. `0b0123` is two tokens, `0b01` and `23`. Writing such a thing in
normal code gives a strange error message:

```rust
<anon>:2:9: 2:11 error: expected one of `.`, `;`, `}`, or an operator, found `23`
<anon>:2 0b0123
^~
```

However, it is valid to write such a thing in a macro (e.g. using the
`tt` non-terminal), and thus lexing the adjacent digits as two tokens
can lead to unexpected behaviour.

```rust
macro_rules! expr { ($e: expr) => { $e } }

macro_rules! add {
($($token: tt)*) => {
0 $(+ expr!($token))*
}
}
fn main() {
println!("{}", add!(0b0123));
}
```

prints `24` (`add` expands to `0 + 0b01 + 23`).

It would be nicer for both cases to print an error like:

```rust
error: found invalid digit `2` in binary literal
0b0123
^
```

(The non-macro case could be handled by detecting this pattern in the
lexer and special casing the message, but this doesn't not handle the
macro case.)

Code that wants two tokens can opt in to it by `0b01 23`, for
example. This is easy to write, and expresses the intent more clearly
anyway.

# Detailed design

The grammar that the lexer uses becomes

```
(0b[0-9]+ | 0o[0-9]+ | [0-9]+ | 0x[0-9a-fA-F]+) suffix
```

instead of just `[01]` and `[0-7]` for the first two, respectively.

However, it is always an error (in the lexer) to have invalid digits
in a numeric literal beginning with `0b` or `0o`. In particular, even
a macro invocation like

```rust
macro_rules! ignore { ($($_t: tt)*) => { {} } }

ignore!(0b0123)
```

is an error even though it doesn't use the tokens.


# Drawbacks

This adds a slightly peculiar special case, that is somewhat unique to
Rust. On the other hand, most languages do not expose the lexical
grammar so directly, and so have more freedom in this respect. That
is, in many languages it is indistinguishable if `0b1234` is one or
two tokens: it is *always* an error either way.


# Alternatives

Don't do it, obviously.

Consider `0b123` to just be `0b1` with a suffix of `23`, and this is
an error or not depending if a suffix of `23` is valid. Handling this
uniformly would require `"foo"123` and `'a'123` also being lexed as a
single token. (Which may be a good idea anyway.)

# Unresolved questions

None.