-
-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(compiler)!: Re-implement Grain parser #1033
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another beautiful pull request from our very own @ospencer. I haven't read through the messages
file yet, but the rest of this looks awesome! I left a couple of comments with questions, but they are very minor.
type id = loc(Identifier.t); | ||
type str = loc(string); | ||
type loc = Location.t; | ||
|
||
let default_loc_src = ref(() => Location.dummy_loc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌
@@ -18,18 +18,15 @@ type AReallyReallyReallyReallyReallyReallyReallyReallyReallyLongLineBreakingType | |||
type AReallyReallyReallyReallyReallyReallyReallyReallyReallyLongLineBreakingType< | |||
a, | |||
b | |||
> = String // Test comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this disappear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, the locations of the type parameters were just completely wrong before 😶
They're correct now, and the comments were only kept coincidentally before. @marcusroberts is already working on a fix for me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is working now!
c122746
to
9b6935f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this. So excited for the new parser. 🔥 I reviewed everything except the messages file because safari can't load it, but I wanted to submit my initial comments before I review the messages.
compiler/src/parsing/driver.re
Outdated
let env = checkpoint => | ||
switch (checkpoint) { | ||
| I.HandlingError(env) => env | ||
| _ => assert(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use an exception so we can track this if the parser somehow gets into a weird checkpoint state?
}; | ||
|
||
let show = (text, positions) => | ||
E.extract(text, positions) |> E.sanitize |> E.compress |> E.shorten(20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why shorten to 20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real reason! It can be whatever we like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we want it to be? Why shorten at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what we want to it be. We can play around with it though... generally we'd want to shorten because I believe this is token-based, and it could get awkward if your token was very long (like a long string, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I'm fine leaving it as 20 for now. Do you want to open a tracking issue mentioning that it might be adjusted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth it? I think's probably fine as-is unless we decide that we want to mess around with it.
compiler/src/parsing/driver.re
Outdated
let get = (text, checkpoint, i) => | ||
switch (I.get(i, env(checkpoint))) { | ||
| Some(I.Element(_, _, pos1, pos2)) => show(text, (pos1, pos2)) | ||
| None => assert(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should have a real error too?
compiler/src/parsing/driver.re
Outdated
| None => assert(false) | ||
}; | ||
|
||
let succeed = _v => assert(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver for the incremental table-based (slow) parser takes a handler for when it successfully parses an input and a handler for when it fails to parse an input. In our case, it can never succeed (because we only invoked this parser since the fast one failed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Should we at least add a failwith "Impossible by:"
?
raise( | ||
Ast_helper.SyntaxError( | ||
location, | ||
Printf.sprintf("%s%s%!", indication, message), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does menhir use ocaml boxes? Do we want to use them so these wrap correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but we could add some here.
compiler/src/parsing/parser.mly
Outdated
| colon typ { Some $2 } | ||
|
||
annotated_expr: | ||
| non_binop_expr opt_annotation { Option.fold ~none:$1 ~some:(fun ann -> Exp.constraint_ ~loc:(to_loc $loc) $1 ann) $2 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not have a helper for this? I kinda tend towards having extra helpers written in Reason that we can use very simply in the ocaml-syntax of the parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just rewrote the rule to be simpler.
| UNDERSCORE { Pat.any ~loc:(to_loc $loc) () } | ||
| const { let (pat, loc) = $1 in Pat.constant ~loc:(to_loc loc) pat } | ||
// Allow rational numbers in patterns | ||
| dash_op? NUMBER_INT slash_op dash_op? NUMBER_INT { Pat.constant ~loc:(to_loc $sloc) @@ Const.number (PConstNumberRational ((if Option.is_some $1 then "-" ^ $2 else $2), (if Option.is_some $4 then "-" ^ $5 else $5))) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you don't have a rule for rationals instead of using this in multiple places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what I was seeing then
compiler/src/parsing/parser.mly
Outdated
| IMPORT lseparated_nonempty_list(comma, import_shape) comma? FROM file_path { Imp.mk ~loc:(to_loc $loc) $2 $5 } | ||
|
||
data_declaration_stmt: | ||
// TODO: Attach attributes to the node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens now if they are not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a carryover from the old parser. It's really just missing a well-formedness error if you were to actually put one. I didn't try to fix it here since I'd also need to write in that logic / make disableGC do something for it.
12c011b
to
70fd41d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early approval from me. I've read the code which increased my understanding a lot, and I like the performance and errors we get from this new parser. Fantastic job!
f9e9697
to
728e4b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a review of the error messages. Great work! See review comments, but one general comment: I think we should refrain from having "paren" in error messages in place of "parenthesis". I don't see a reason to abbreviate.
compiler/src/parsing/parser.messages
Outdated
## In state 112, spurious reduction of production option(typs) -> typs | ||
## | ||
|
||
Expected type parameters surrounded by carets, a comma followed by more types, a dot followed by an identifier, or a closing paren to finish the variant definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expected type parameters surrounded by carets, a comma followed by more types, a dot followed by an identifier, or a closing paren to finish the variant definition. | |
Expected type parameters surrounded by angle brackets, a comma followed by more types, a dot followed by an identifier, or a closing parenthesis to finish the variant definition. |
"caret" refers to a v-shaped grapheme which points up or down (i.e. "^" or "v"). Sideways ones are "angle brackets" or "chevrons". I recommend renaming the tokens to LANGLE
and RANGLE
as well.
The latter bit of my review comments got broken? GitHub is showing the diffs based on the wrong line (at least locally for me)...the line numbers are correct, but the diffs are not |
"parser:interpret": "esy b menhir src/parsing/parser.mly --unused-tokens --interpret", | ||
"parser:interpret-error": "esy b menhir src/parsing/parser.mly --unused-tokens --interpret-error", | ||
"parser:list-errors": "esy b menhir src/parsing/parser.mly --unused-tokens --list-errors > src/parsing/parser.messages.generated", | ||
"parser:check-errors": "yarn parser:list-errors && esy b menhir src/parsing/parser.mly --unused-tokens --compare-errors src/parsing/parser.messages.generated --compare-errors src/parsing/parser.messages", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about this. Should we move these commands into the esy.json file or keep them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say you know best! If you think we should move it then I will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong feeling about it. I think all of the other ones just proxy to the esy.json scripts
Co-authored-by: Blaine Bublitz <blaine.bublitz@gmail.com>
compiler/test/suites/types.re
Outdated
let foo = (x: (String, List<Number>)) => | ||
x: Foo<Number> | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
compiler/test/suites/types.re
Outdated
let foo = (x: Foo) => | ||
x: Bar | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely heroic effort on this @ospencer! The review process has also been great (amazing description, error messages in gdoc, etc). Thank you 🙇
@ospencer One additional thing I'd like is for you to grep for any TODOs with the issue numbers you are closing and either fix or create a new issue to update that code path. For example, the formatter has a workaround for trailing type annotations (though it looks like it references #866 which then references the #783 you are closing, which was a mistake, but that one is on my mind because we need to remove the workaround). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work!
Closes #323
Closes #573
Closes #679
Closes #709
Closes #783
Closes #1062
Parse Grain programs using Menhir 🗿
Menhir is an LR(1) parser generator for OCaml. It generates parsers that are fast and gives us tools to help maintain the parser. The main motivation for this change is efficiency—this implementation parses the standard library's hefty
regex.gr
25X faster than the current implementation. In addition to the speed improvements, syntax error messages have better location information and there are now nearly 200 custom error messages for all possible parser error states.Syntax changes
There are some minor syntax changes that come along with this change. Grain ends expressions with a newline character— most of the changes are around where line breaks are allowed. These changes shouldn't affect how a majority of users use Grain. In fact, most code that is formatted with Grain's code formatter from Grain v0.4 will continue to parse with Grain v0.5.
Binary operators
Grain is taking Python's approach to binary operators—newline characters are no longer allowed before a binary operator.
✅ Accepted
❌ Not accepted
Records
{ x }
will no longer parse as a single-argument record, and instead will parse as a block with the identifierx
. Similar to a tuple, a record with a single punned argument can be made by adding a trailing comma:✅ Accepted
❌ Not accepted
Conversely, a block that starts with
x:
will parse as a record rather than an expression with a type annotation. For example,() => { foo: Bar }
will parse as a function that returns a record with keyfoo
and valueBar
rather than a block with valuefoo
annotated withBar
.Syntax error messages
Menhir's default parser generation generates a parser that is very fast, but can't provide useful error messages. Menhir can produce a slower, more compact parser that is capable of producing good error messages, so for that reason, we actually generate two parsers—a really fast parser that can generate a Grain AST, and a slower parser that doesn't generate anything, but can give good error messages. We try to parse a program using the fast parser, and if it fails, we reparse it using the slow parser and try to provide a good error message.
Menhir provides the
.messages
file format to maintain parser error messages. You can read all about it here. Our file is calledparser.messages
. It's large, but the file is mostly comments!I added some new yarn commands for working with the messages:
yarn compiler parser:list-errors
This command will generate a file,
parser.messages.generated
, which contains all of the possible error states that exist in the parser, with default error messages. You can use this to view all of the states and cross-check withparser.messages
.yarn compiler parser:check-errors
This command will generate
parser.messages.generated
(as above) and verify that there is an error message defined inparser.messages
for each error state inparser.messages.generated
. (This is also run in CI.)yarn compiler parser:interpret
This command starts an interpreter for the parser. You can type tokens and check if they're accepted by the parser:
I think if we define aliases for all of the tokens, you can type regular Grain code and it'd work. Maybe something for the future if people find the interpreter useful!
yarn compiler parser:interpret-error
This is similar to the regular interpret, except that it expects a syntax error to occur on the last token. It'll then print information about the error state, in the same format as the
.messages
file.Info for parsing nerds
Menhir is an LR(1) parser generator, and Dypgen (our current parser generator) is a GLR parser. GLR parsers have a worst-case runtime complexity of O(n^3), while LR(k) parsers are O(n). We can't completely blame the performance on the algorithm, though. GLR parsers are also O(n) for unambiguous grammars—we could take all of the work that was done to make the Grain grammar work with Menhir and patch it back into the Dypgen parser and it would run significantly faster. However, Menhir forces us to keep our grammar unambiguous and gives us tools to maintain things like error messages, so overall it makes sense to keep the Menhir implementation.
Caveats
LR parsers can only accept a subset of the languages that GLR parsers can. This is somewhat of a problem, as Grain's grammar is not actually LR—Grain has arrow functions. After encountering an open paren, an LR parser doesn't know if it should be parsing a tuple or a function. This is an important distinction since tuples contain expressions and function arguments contain destructuring patterns. In fact, an LR parser wouldn't be able to tell which it should be parsing until it encountered (or didn't encounter) an arrow after the closing paren. The 1 in LR(1) (or the k in LR(k)) represents the number of tokens the parser will look ahead, in this case, just one. To check for that arrow, the parser would need to look ahead a potentially infinite number of tokens, which is decidedly more than one.
The tools that Menhir gives us are so good that it's worth implementing a lexer hack to solve this one problem. The simple solution to the tuple/arrow function issue is to just tell the parser when it's parsing a function. We borrow this solution from ReasonML's lexer. In the lexer, when an open paren is encountered, we scan ahead to the matching closing paren and check if the next token is an arrow. If so, we inject a special
FUN
token right before the open paren. To avoid a slowdown, all of the tokens that are seen during this process are cached—work is never duplicated.