Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(compiler)!: Re-implement Grain parser #1033

Merged
merged 28 commits into from
Jan 19, 2022
Merged

feat(compiler)!: Re-implement Grain parser #1033

merged 28 commits into from
Jan 19, 2022

Conversation

ospencer
Copy link
Member

@ospencer ospencer commented Nov 19, 2021

Closes #323
Closes #573
Closes #679
Closes #709
Closes #783
Closes #1062

Parse Grain programs using Menhir 🗿

Menhir is an LR(1) parser generator for OCaml. It generates parsers that are fast and gives us tools to help maintain the parser. The main motivation for this change is efficiency—this implementation parses the standard library's hefty regex.gr 25X faster than the current implementation. In addition to the speed improvements, syntax error messages have better location information and there are now nearly 200 custom error messages for all possible parser error states.

Syntax changes

There are some minor syntax changes that come along with this change. Grain ends expressions with a newline character— most of the changes are around where line breaks are allowed. These changes shouldn't affect how a majority of users use Grain. In fact, most code that is formatted with Grain's code formatter from Grain v0.4 will continue to parse with Grain v0.5.

Binary operators

Grain is taking Python's approach to binary operators—newline characters are no longer allowed before a binary operator.

✅ Accepted

foo /
bar
foo &&
bar
foo.
bar

❌ Not accepted

foo
/ bar
foo
&& bar
foo
.bar

Records

{ x } will no longer parse as a single-argument record, and instead will parse as a block with the identifier x. Similar to a tuple, a record with a single punned argument can be made by adding a trailing comma:

✅ Accepted

{ x, }

❌ Not accepted

{ x }

Conversely, a block that starts with x: will parse as a record rather than an expression with a type annotation. For example, () => { foo: Bar } will parse as a function that returns a record with key foo and value Bar rather than a block with value foo annotated with Bar.

Syntax error messages

Menhir's default parser generation generates a parser that is very fast, but can't provide useful error messages. Menhir can produce a slower, more compact parser that is capable of producing good error messages, so for that reason, we actually generate two parsers—a really fast parser that can generate a Grain AST, and a slower parser that doesn't generate anything, but can give good error messages. We try to parse a program using the fast parser, and if it fails, we reparse it using the slow parser and try to provide a good error message.

Menhir provides the .messages file format to maintain parser error messages. You can read all about it here. Our file is called parser.messages. It's large, but the file is mostly comments!

I added some new yarn commands for working with the messages:

yarn compiler parser:list-errors
This command will generate a file, parser.messages.generated, which contains all of the possible error states that exist in the parser, with default error messages. You can use this to view all of the states and cross-check with parser.messages.

yarn compiler parser:check-errors
This command will generate parser.messages.generated (as above) and verify that there is an error message defined in parser.messages for each error state in parser.messages.generated. (This is also run in CI.)

yarn compiler parser:interpret
This command starts an interpreter for the parser. You can type tokens and check if they're accepted by the parser:

image

I think if we define aliases for all of the tokens, you can type regular Grain code and it'd work. Maybe something for the future if people find the interpreter useful!

yarn compiler parser:interpret-error
This is similar to the regular interpret, except that it expects a syntax error to occur on the last token. It'll then print information about the error state, in the same format as the .messages file.

image

Info for parsing nerds

Menhir is an LR(1) parser generator, and Dypgen (our current parser generator) is a GLR parser. GLR parsers have a worst-case runtime complexity of O(n^3), while LR(k) parsers are O(n). We can't completely blame the performance on the algorithm, though. GLR parsers are also O(n) for unambiguous grammars—we could take all of the work that was done to make the Grain grammar work with Menhir and patch it back into the Dypgen parser and it would run significantly faster. However, Menhir forces us to keep our grammar unambiguous and gives us tools to maintain things like error messages, so overall it makes sense to keep the Menhir implementation.

Caveats

LR parsers can only accept a subset of the languages that GLR parsers can. This is somewhat of a problem, as Grain's grammar is not actually LR—Grain has arrow functions. After encountering an open paren, an LR parser doesn't know if it should be parsing a tuple or a function. This is an important distinction since tuples contain expressions and function arguments contain destructuring patterns. In fact, an LR parser wouldn't be able to tell which it should be parsing until it encountered (or didn't encounter) an arrow after the closing paren. The 1 in LR(1) (or the k in LR(k)) represents the number of tokens the parser will look ahead, in this case, just one. To check for that arrow, the parser would need to look ahead a potentially infinite number of tokens, which is decidedly more than one.

The tools that Menhir gives us are so good that it's worth implementing a lexer hack to solve this one problem. The simple solution to the tuple/arrow function issue is to just tell the parser when it's parsing a function. We borrow this solution from ReasonML's lexer. In the lexer, when an open paren is encountered, we scan ahead to the matching closing paren and check if the next token is an arrow. If so, we inject a special FUN token right before the open paren. To avoid a slowdown, all of the tokens that are seen during this process are cached—work is never duplicated.

@ospencer ospencer self-assigned this Nov 19, 2021
Copy link
Member

@peblair peblair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another beautiful pull request from our very own @ospencer. I haven't read through the messages file yet, but the rest of this looks awesome! I left a couple of comments with questions, but they are very minor.

type id = loc(Identifier.t);
type str = loc(string);
type loc = Location.t;

let default_loc_src = ref(() => Location.dummy_loc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

compiler/src/parsing/lexer.mll Show resolved Hide resolved
@@ -18,18 +18,15 @@ type AReallyReallyReallyReallyReallyReallyReallyReallyReallyLongLineBreakingType
type AReallyReallyReallyReallyReallyReallyReallyReallyReallyLongLineBreakingType<
a,
b
> = String // Test comment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this disappear?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, the locations of the type parameters were just completely wrong before 😶

They're correct now, and the comments were only kept coincidentally before. @marcusroberts is already working on a fix for me!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is working now!

compiler/src/parsing/parser.mly Outdated Show resolved Hide resolved
@ospencer ospencer force-pushed the oscar/menhir branch 2 times, most recently from c122746 to 9b6935f Compare November 29, 2021 02:54
Copy link
Member

@phated phated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this. So excited for the new parser. 🔥 I reviewed everything except the messages file because safari can't load it, but I wanted to submit my initial comments before I review the messages.

let env = checkpoint =>
switch (checkpoint) {
| I.HandlingError(env) => env
| _ => assert(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use an exception so we can track this if the parser somehow gets into a weird checkpoint state?

};

let show = (text, positions) =>
E.extract(text, positions) |> E.sanitize |> E.compress |> E.shorten(20);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shorten to 20?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real reason! It can be whatever we like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we want it to be? Why shorten at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we want to it be. We can play around with it though... generally we'd want to shorten because I believe this is token-based, and it could get awkward if your token was very long (like a long string, etc.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I'm fine leaving it as 20 for now. Do you want to open a tracking issue mentioning that it might be adjusted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth it? I think's probably fine as-is unless we decide that we want to mess around with it.

let get = (text, checkpoint, i) =>
switch (I.get(i, env(checkpoint))) {
| Some(I.Element(_, _, pos1, pos2)) => show(text, (pos1, pos2))
| None => assert(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should have a real error too?

| None => assert(false)
};

let succeed = _v => assert(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver for the incremental table-based (slow) parser takes a handler for when it successfully parses an input and a handler for when it fails to parse an input. In our case, it can never succeed (because we only invoked this parser since the fast one failed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Should we at least add a failwith "Impossible by:"?

raise(
Ast_helper.SyntaxError(
location,
Printf.sprintf("%s%s%!", indication, message),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does menhir use ocaml boxes? Do we want to use them so these wrap correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but we could add some here.

| colon typ { Some $2 }

annotated_expr:
| non_binop_expr opt_annotation { Option.fold ~none:$1 ~some:(fun ann -> Exp.constraint_ ~loc:(to_loc $loc) $1 ann) $2 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not have a helper for this? I kinda tend towards having extra helpers written in Reason that we can use very simply in the ocaml-syntax of the parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just rewrote the rule to be simpler.

compiler/src/parsing/parser.mly Outdated Show resolved Hide resolved
| UNDERSCORE { Pat.any ~loc:(to_loc $loc) () }
| const { let (pat, loc) = $1 in Pat.constant ~loc:(to_loc loc) pat }
// Allow rational numbers in patterns
| dash_op? NUMBER_INT slash_op dash_op? NUMBER_INT { Pat.constant ~loc:(to_loc $sloc) @@ Const.number (PConstNumberRational ((if Option.is_some $1 then "-" ^ $2 else $2), (if Option.is_some $4 then "-" ^ $5 else $5))) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you don't have a rule for rationals instead of using this in multiple places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what I was seeing then

| IMPORT lseparated_nonempty_list(comma, import_shape) comma? FROM file_path { Imp.mk ~loc:(to_loc $loc) $2 $5 }

data_declaration_stmt:
// TODO: Attach attributes to the node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens now if they are not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a carryover from the old parser. It's really just missing a well-formedness error if you were to actually put one. I didn't try to fix it here since I'd also need to write in that logic / make disableGC do something for it.

compiler/src/parsing/parser.mly Outdated Show resolved Hide resolved
Copy link
Member

@marcusroberts marcusroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early approval from me. I've read the code which increased my understanding a lot, and I like the performance and errors we get from this new parser. Fantastic job!

Copy link
Member

@peblair peblair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a review of the error messages. Great work! See review comments, but one general comment: I think we should refrain from having "paren" in error messages in place of "parenthesis". I don't see a reason to abbreviate.

compiler/src/parsing/parser.messages Outdated Show resolved Hide resolved
## In state 112, spurious reduction of production option(typs) -> typs
##

Expected type parameters surrounded by carets, a comma followed by more types, a dot followed by an identifier, or a closing paren to finish the variant definition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Expected type parameters surrounded by carets, a comma followed by more types, a dot followed by an identifier, or a closing paren to finish the variant definition.
Expected type parameters surrounded by angle brackets, a comma followed by more types, a dot followed by an identifier, or a closing parenthesis to finish the variant definition.

"caret" refers to a v-shaped grapheme which points up or down (i.e. "^" or "v"). Sideways ones are "angle brackets" or "chevrons". I recommend renaming the tokens to LANGLE and RANGLE as well.

compiler/src/parsing/parser.messages Outdated Show resolved Hide resolved
compiler/src/parsing/parser.messages Outdated Show resolved Hide resolved
compiler/src/parsing/parser.messages Outdated Show resolved Hide resolved
compiler/src/parsing/parser.messages Outdated Show resolved Hide resolved
compiler/src/parsing/parser.messages Outdated Show resolved Hide resolved
compiler/src/parsing/parser.messages Outdated Show resolved Hide resolved
compiler/src/parsing/parser.messages Outdated Show resolved Hide resolved
compiler/src/parsing/parser.messages Outdated Show resolved Hide resolved
@peblair
Copy link
Member

peblair commented Jan 15, 2022

The latter bit of my review comments got broken? GitHub is showing the diffs based on the wrong line (at least locally for me)...the line numbers are correct, but the diffs are not

Comment on lines +26 to +29
"parser:interpret": "esy b menhir src/parsing/parser.mly --unused-tokens --interpret",
"parser:interpret-error": "esy b menhir src/parsing/parser.mly --unused-tokens --interpret-error",
"parser:list-errors": "esy b menhir src/parsing/parser.mly --unused-tokens --list-errors > src/parsing/parser.messages.generated",
"parser:check-errors": "yarn parser:list-errors && esy b menhir src/parsing/parser.mly --unused-tokens --compare-errors src/parsing/parser.messages.generated --compare-errors src/parsing/parser.messages",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about this. Should we move these commands into the esy.json file or keep them here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say you know best! If you think we should move it then I will.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong feeling about it. I think all of the other ones just proxy to the esy.json scripts

@ospencer ospencer marked this pull request as ready for review January 17, 2022 21:37
Comment on lines 56 to 59
let foo = (x: (String, List<Number>)) =>
x: Foo<Number>
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines 67 to 71
let foo = (x: Foo) =>
x: Bar
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Member

@phated phated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely heroic effort on this @ospencer! The review process has also been great (amazing description, error messages in gdoc, etc). Thank you 🙇

@phated
Copy link
Member

phated commented Jan 19, 2022

@ospencer One additional thing I'd like is for you to grep for any TODOs with the issue numbers you are closing and either fix or create a new issue to update that code path. For example, the formatter has a workaround for trailing type annotations (though it looks like it references #866 which then references the #783 you are closing, which was a mistake, but that one is on my mind because we need to remove the workaround).

Copy link
Member

@peblair peblair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work!

@ospencer
Copy link
Member Author

@phated greps are done and all references to those issues are complete, going to do #1107 right after this lands

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
4 participants