Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexical syntax simplification #90

Merged
merged 1 commit into from
May 29, 2014
Merged

Lexical syntax simplification #90

merged 1 commit into from
May 29, 2014

Conversation

emberian
Copy link
Member

@emberian emberian commented May 24, 2014

@emberian
Copy link
Member Author

Another benefit of this is that the output of the lexer can be only spans and their associated token type, rather than having to do any work.


LIT_STR_RAW
: 'r' LIT_STR_RAW_INNER
| 'r' '"' .*? '"'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't just be 'r' LIT_STR_RAW_INNER2? (and the inner tokens should probably be swapped).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it can.

@emberian
Copy link
Member Author

This needs to take into account rust-lang/rust#14400 still.

;

LIT_FLOAT
: [0-9][0-9_]* ('.' [0-9][0-9_]*)? ([eE] [-+]? [0-9][0-9]*)? FLOAT_SUFFIX?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exponent [0-9]* part—should it be [0-9_]*?

Also I think this will be tightening what is accepted; at present, for example, 1. is acceptable (but not 1.f32 for clear reasons), but this change will break that. Is that deliberate? Desirable? &c.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That was not deliberate, I didn't mean to change the float literal syntax at all.

@pcwalton
Copy link
Contributor

+1, sounds like an improvement for rustfmt

@emberian
Copy link
Member Author

@kballard is the CRLF stuff correct? I extended the places that accept newline to also accept '\r\n', but not '\r', and I've removed '\r' from the whitespace skipping.

@lilyball
Copy link
Contributor

@cmr My patch actually allows bare '\r' in whitespace skipping and in non-doc comments. It only rejects it inside of strings and doc comments. That said, I don't know if it's worth trying to be that permissive. It may be better just to go ahead and treat a bare '\r' without a subsequent '\n' as a hard error anywhere in the file.

;

LIT_CHAR
: '\'' ( '\\' CHAR_ESCAPE | [^'\n\t\r] ) '\''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't [^'\n\t\r] be ~['\n\t\r]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the character set needs to include \\ or else invalid character escapes will end up matching anyway.

@zwarich
Copy link

zwarich commented May 27, 2014

This grammar is wildly ambiguous. Identifiers, numbers and operators can be tokenized in multiple ways.

@huonw
Copy link
Member

huonw commented May 27, 2014

@zwarich do you have an example of an ambiguous sequence of tokens?

@emberian
Copy link
Member Author

"12.12" could be INTEGER(12) DOT INTEGER(12)

@emberian
Copy link
Member Author

etc. will be relatively easy to fix.

@lilyball
Copy link
Contributor

Doesn't antlr4 pick the longest matching token?

@huonw
Copy link
Member

huonw commented May 27, 2014

(Yeah, isn't the maximal munch principal the standard way to resolve "ambiguities" like this?)

@zwarich
Copy link

zwarich commented May 27, 2014

@kballard @huonw Yes, that is the standard way of resolving ambiguities with lexical syntax, and apparently the 'lexer grammar' feature of ANTLR makes it choose this strategy, as opposed to what it uses for normal grammars.

@anasazi
Copy link

anasazi commented May 27, 2014

I like the idea of keeping comments after lexing so pretty-printers / refactoring tools can use the same lexer as the compiler, but how about we just make comment dropping a micropass between the lexer and parser instead of adding to the parser workload?

@emberian
Copy link
Member Author

Sure, whatever.

@emberian
Copy link
Member Author

Fixed most things, and verified that it works as I expect.

@emberian
Copy link
Member Author

cc @nikomatsakis @pcwalton @brson I've updated this. It behaves as I expect for the code I've run it against, and accepts/rejects everything it should in the compiler/libs/testsuite/servo.

@brson
Copy link
Contributor

brson commented May 29, 2014

withoutboats pushed a commit to withoutboats/rfcs that referenced this pull request Jan 15, 2017
@Centril Centril added the A-syntax Syntax related proposals & ideas label Nov 23, 2018
wycats pushed a commit to wycats/rust-rfcs that referenced this pull request Mar 5, 2019
RFC for caching results of `treeFor` hook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-syntax Syntax related proposals & ideas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants