feat(compiler)!: Re-implement Grain parser #1033

ospencer · 2021-11-19T06:47:39Z

Closes #323
Closes #573
Closes #679
Closes #709
Closes #783
Closes #1062

Parse Grain programs using Menhir 🗿

Menhir is an LR(1) parser generator for OCaml. It generates parsers that are fast and gives us tools to help maintain the parser. The main motivation for this change is efficiency—this implementation parses the standard library's hefty regex.gr 25X faster than the current implementation. In addition to the speed improvements, syntax error messages have better location information and there are now nearly 200 custom error messages for all possible parser error states.

Syntax changes

There are some minor syntax changes that come along with this change. Grain ends expressions with a newline character— most of the changes are around where line breaks are allowed. These changes shouldn't affect how a majority of users use Grain. In fact, most code that is formatted with Grain's code formatter from Grain v0.4 will continue to parse with Grain v0.5.

Binary operators

Grain is taking Python's approach to binary operators—newline characters are no longer allowed before a binary operator.

✅ Accepted

foo /
bar

foo &&
bar

foo.
bar

❌ Not accepted

foo
/ bar

foo
&& bar

foo
.bar

Records

{ x } will no longer parse as a single-argument record, and instead will parse as a block with the identifier x. Similar to a tuple, a record with a single punned argument can be made by adding a trailing comma:

✅ Accepted

{ x, }

❌ Not accepted

{ x }

Conversely, a block that starts with x: will parse as a record rather than an expression with a type annotation. For example, () => { foo: Bar } will parse as a function that returns a record with key foo and value Bar rather than a block with value foo annotated with Bar.

Syntax error messages

Menhir's default parser generation generates a parser that is very fast, but can't provide useful error messages. Menhir can produce a slower, more compact parser that is capable of producing good error messages, so for that reason, we actually generate two parsers—a really fast parser that can generate a Grain AST, and a slower parser that doesn't generate anything, but can give good error messages. We try to parse a program using the fast parser, and if it fails, we reparse it using the slow parser and try to provide a good error message.

Menhir provides the .messages file format to maintain parser error messages. You can read all about it here. Our file is called parser.messages. It's large, but the file is mostly comments!

I added some new yarn commands for working with the messages:

yarn compiler parser:list-errors
This command will generate a file, parser.messages.generated, which contains all of the possible error states that exist in the parser, with default error messages. You can use this to view all of the states and cross-check with parser.messages.

yarn compiler parser:check-errors
This command will generate parser.messages.generated (as above) and verify that there is an error message defined in parser.messages for each error state in parser.messages.generated. (This is also run in CI.)

yarn compiler parser:interpret
This command starts an interpreter for the parser. You can type tokens and check if they're accepted by the parser:

I think if we define aliases for all of the tokens, you can type regular Grain code and it'd work. Maybe something for the future if people find the interpreter useful!

yarn compiler parser:interpret-error
This is similar to the regular interpret, except that it expects a syntax error to occur on the last token. It'll then print information about the error state, in the same format as the .messages file.

Info for parsing nerds

Menhir is an LR(1) parser generator, and Dypgen (our current parser generator) is a GLR parser. GLR parsers have a worst-case runtime complexity of O(n^3), while LR(k) parsers are O(n). We can't completely blame the performance on the algorithm, though. GLR parsers are also O(n) for unambiguous grammars—we could take all of the work that was done to make the Grain grammar work with Menhir and patch it back into the Dypgen parser and it would run significantly faster. However, Menhir forces us to keep our grammar unambiguous and gives us tools to maintain things like error messages, so overall it makes sense to keep the Menhir implementation.

Caveats

LR parsers can only accept a subset of the languages that GLR parsers can. This is somewhat of a problem, as Grain's grammar is not actually LR—Grain has arrow functions. After encountering an open paren, an LR parser doesn't know if it should be parsing a tuple or a function. This is an important distinction since tuples contain expressions and function arguments contain destructuring patterns. In fact, an LR parser wouldn't be able to tell which it should be parsing until it encountered (or didn't encounter) an arrow after the closing paren. The 1 in LR(1) (or the k in LR(k)) represents the number of tokens the parser will look ahead, in this case, just one. To check for that arrow, the parser would need to look ahead a potentially infinite number of tokens, which is decidedly more than one.

The tools that Menhir gives us are so good that it's worth implementing a lexer hack to solve this one problem. The simple solution to the tuple/arrow function issue is to just tell the parser when it's parsing a function. We borrow this solution from ReasonML's lexer. In the lexer, when an open paren is encountered, we scan ahead to the matching closing paren and check if the next token is an arrow. If so, we inject a special FUN token right before the open paren. To avoid a slowdown, all of the tokens that are seen during this process are cached—work is never duplicated.

peblair

Another beautiful pull request from our very own @ospencer. I haven't read through the messages file yet, but the rest of this looks awesome! I left a couple of comments with questions, but they are very minor.

peblair · 2021-11-20T20:37:36Z

compiler/src/parsing/ast_helper.re

 type id = loc(Identifier.t);
 type str = loc(string);
 type loc = Location.t;

-let default_loc_src = ref(() => Location.dummy_loc);


compiler/src/parsing/lexer.mll

peblair · 2021-11-20T20:45:49Z

compiler/test/formatter_outputs/aliases.gr

@@ -18,18 +18,15 @@ type AReallyReallyReallyReallyReallyReallyReallyReallyReallyLongLineBreakingType
 type AReallyReallyReallyReallyReallyReallyReallyReallyReallyLongLineBreakingType<
  a,
  b
-> = String // Test comment


Why does this disappear?

Basically, the locations of the type parameters were just completely wrong before 😶

They're correct now, and the comments were only kept coincidentally before. @marcusroberts is already working on a fix for me!

I think this is working now!

compiler/src/parsing/parser.mly

phated

I love this. So excited for the new parser. 🔥 I reviewed everything except the messages file because safari can't load it, but I wanted to submit my initial comments before I review the messages.

phated · 2021-11-24T16:41:34Z

compiler/src/parsing/driver.re

+let env = checkpoint =>
+  switch (checkpoint) {
+  | I.HandlingError(env) => env
+  | _ => assert(false)


Can you use an exception so we can track this if the parser somehow gets into a weird checkpoint state?

phated · 2021-11-24T16:42:27Z

compiler/src/parsing/driver.re

+  };
+
+let show = (text, positions) =>
+  E.extract(text, positions) |> E.sanitize |> E.compress |> E.shorten(20);


Why shorten to 20?

No real reason! It can be whatever we like.

What do we want it to be? Why shorten at all?

I'm not sure what we want to it be. We can play around with it though... generally we'd want to shorten because I believe this is token-based, and it could get awkward if your token was very long (like a long string, etc.)

Alright. I'm fine leaving it as 20 for now. Do you want to open a tracking issue mentioning that it might be adjusted?

Do you think it's worth it? I think's probably fine as-is unless we decide that we want to mess around with it.

phated · 2021-11-24T16:43:02Z

compiler/src/parsing/driver.re

+let get = (text, checkpoint, i) =>
+  switch (I.get(i, env(checkpoint))) {
+  | Some(I.Element(_, _, pos1, pos2)) => show(text, (pos1, pos2))
+  | None => assert(false)


Maybe this should have a real error too?

phated · 2021-11-24T16:43:14Z

compiler/src/parsing/driver.re

+  | None => assert(false)
+  };
+
+let succeed = _v => assert(false);


What's this for?

The driver for the incremental table-based (slow) parser takes a handler for when it successfully parses an input and a handler for when it fails to parse an input. In our case, it can never succeed (because we only invoked this parser since the fast one failed).

Interesting. Should we at least add a failwith "Impossible by:"?

phated · 2021-11-24T16:44:22Z

compiler/src/parsing/driver.re

+  raise(
+    Ast_helper.SyntaxError(
+      location,
+      Printf.sprintf("%s%s%!", indication, message),


Does menhir use ocaml boxes? Do we want to use them so these wrap correctly?

I'm not sure, but we could add some here.

phated · 2021-11-29T04:17:01Z

compiler/src/parsing/parser.mly

+  | colon typ { Some $2 }
+
+annotated_expr:
+  | non_binop_expr opt_annotation { Option.fold ~none:$1 ~some:(fun ann -> Exp.constraint_ ~loc:(to_loc $loc) $1 ann) $2 }


Any reason to not have a helper for this? I kinda tend towards having extra helpers written in Reason that we can use very simply in the ocaml-syntax of the parser.

I just rewrote the rule to be simpler.

compiler/src/parsing/parser.mly

phated · 2021-11-29T04:20:35Z

compiler/src/parsing/parser.mly

+  | UNDERSCORE { Pat.any ~loc:(to_loc $loc) () }
+  | const { let (pat, loc) = $1 in Pat.constant ~loc:(to_loc loc) pat }
+  // Allow rational numbers in patterns
+  | dash_op? NUMBER_INT slash_op dash_op? NUMBER_INT { Pat.constant ~loc:(to_loc $sloc) @@ Const.number (PConstNumberRational ((if Option.is_some $1 then "-" ^ $2 else $2), (if Option.is_some $4 then "-" ^ $5 else $5))) }


Is there a reason you don't have a rule for rationals instead of using this in multiple places?

This is only used here!

I have no idea what I was seeing then

phated · 2021-11-29T04:29:18Z

compiler/src/parsing/parser.mly

+  | IMPORT lseparated_nonempty_list(comma, import_shape) comma? FROM file_path { Imp.mk ~loc:(to_loc $loc) $2 $5 }
+
+data_declaration_stmt:
+  // TODO: Attach attributes to the node


What happens now if they are not?

This is a carryover from the old parser. It's really just missing a well-formedness error if you were to actually put one. I didn't try to fix it here since I'd also need to write in that logic / make disableGC do something for it.

compiler/src/parsing/parser.mly

marcusroberts

Early approval from me. I've read the code which increased my understanding a lot, and I like the performance and errors we get from this new parser. Fantastic job!

peblair

Did a review of the error messages. Great work! See review comments, but one general comment: I think we should refrain from having "paren" in error messages in place of "parenthesis". I don't see a reason to abbreviate.

compiler/src/parsing/parser.messages

peblair · 2022-01-15T11:12:39Z

compiler/src/parsing/parser.messages

+## In state 112, spurious reduction of production option(typs) -> typs
+##
+
+Expected type parameters surrounded by carets, a comma followed by more types, a dot followed by an identifier, or a closing paren to finish the variant definition.


Suggested change

Expected type parameters surrounded by carets, a comma followed by more types, a dot followed by an identifier, or a closing paren to finish the variant definition.

Expected type parameters surrounded by angle brackets, a comma followed by more types, a dot followed by an identifier, or a closing parenthesis to finish the variant definition.

"caret" refers to a v-shaped grapheme which points up or down (i.e. "^" or "v"). Sideways ones are "angle brackets" or "chevrons". I recommend renaming the tokens to LANGLE and RANGLE as well.

compiler/src/parsing/parser.messages

peblair · 2022-01-15T12:20:44Z

The latter bit of my review comments got broken? GitHub is showing the diffs based on the wrong line (at least locally for me)...the line numbers are correct, but the diffs are not

phated · 2022-01-17T21:28:27Z

compiler/package.json

+    "parser:interpret": "esy b menhir src/parsing/parser.mly --unused-tokens --interpret",
+    "parser:interpret-error": "esy b menhir src/parsing/parser.mly --unused-tokens --interpret-error",
+    "parser:list-errors": "esy b menhir src/parsing/parser.mly --unused-tokens --list-errors > src/parsing/parser.messages.generated",
+    "parser:check-errors": "yarn parser:list-errors && esy b menhir src/parsing/parser.mly --unused-tokens --compare-errors src/parsing/parser.messages.generated --compare-errors src/parsing/parser.messages",


Question about this. Should we move these commands into the esy.json file or keep them here?

I would say you know best! If you think we should move it then I will.

I don't have a strong feeling about it. I think all of the other ones just proxy to the esy.json scripts

Co-authored-by: Blaine Bublitz <blaine.bublitz@gmail.com>

phated · 2022-01-17T21:46:00Z

compiler/test/suites/types.re

+      let foo = (x: (String, List<Number>)) =>
        x: Foo<Number>
-      }


Same as above

phated · 2022-01-17T21:46:03Z

compiler/test/suites/types.re

+      let foo = (x: Foo) =>
        x: Bar
-      }


Same as above

phated

Absolutely heroic effort on this @ospencer! The review process has also been great (amazing description, error messages in gdoc, etc). Thank you 🙇

phated · 2022-01-19T18:45:17Z

@ospencer One additional thing I'd like is for you to grep for any TODOs with the issue numbers you are closing and either fix or create a new issue to update that code path. For example, the formatter has a workaround for trailing type annotations (though it looks like it references #866 which then references the #783 you are closing, which was a mistake, but that one is on my mind because we need to remove the workaround).

peblair

Fantastic work!

ospencer · 2022-01-19T22:48:50Z

@phated greps are done and all references to those issues are complete, going to do #1107 right after this lands

ospencer self-assigned this Nov 19, 2021

peblair reviewed Nov 20, 2021

View reviewed changes

ospencer force-pushed the oscar/menhir branch 2 times, most recently from c122746 to 9b6935f Compare November 29, 2021 02:54

phated reviewed Nov 29, 2021

View reviewed changes

ospencer force-pushed the oscar/menhir branch from 12c011b to 70fd41d Compare December 9, 2021 23:47

marcusroberts approved these changes Jan 7, 2022

View reviewed changes

ospencer added 10 commits January 11, 2022 12:05

feat(compiler)!: Re-implement Grain parser

ee49fbe

Upgrade Menhir and add yarn commands

74deea5

Add CI parser errors exhaustiveness check

da5663c

Remove unnecessary %prec annotation

a716619

Make annotated_expr clearer

62012bd

Attach attributes to data delcaration nodes

c47b6d0

Add support for a newline character before 'when'

0d6cc8c

sic

50cceda

sic

8af374e

Support for if/else, newlines around arrows, rebase updates

728e4b8

ospencer force-pushed the oscar/menhir branch from f9e9697 to 728e4b8 Compare January 13, 2022 17:57

ospencer added 2 commits January 13, 2022 13:26

Add messages for new attributes

af338af

Fix formatter

374f3b4

peblair reviewed Jan 15, 2022

View reviewed changes

Apply error messages feedback

7d04bdf

phated reviewed Jan 17, 2022

View reviewed changes

ospencer marked this pull request as ready for review January 17, 2022 21:37

ospencer and others added 3 commits January 17, 2022 16:38

Fix dune-project Menhir version

0c7a29f

Co-authored-by: Blaine Bublitz <blaine.bublitz@gmail.com>

Add comment about binop helper

48be770

Use Lexing.set_filename

1735e32

phated reviewed Jan 17, 2022

View reviewed changes

compiler/test/suites/types.re Outdated

Comment on lines 67 to 71

let foo = (x: Foo) =>

x: Bar

}

Copy link

Member

phated Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

ospencer added 10 commits January 17, 2022 16:47

Add failwith for error handling state

167ffaf

Remove more assert false

827f4c3

Style

ec23961

Improve final error state

834a4d8

Swap type test record/block syntax

52dc159

Swap type test record/block syntax

45be967

Update compiler/src/parsing/parser.mly

88d7fd4

Update compiler/src/parsing/parser.mly

be138e3

Sync messages to Google doc

6692dcd

Remove closing character terminology

eb23d59

phated approved these changes Jan 19, 2022

View reviewed changes

peblair approved these changes Jan 19, 2022

View reviewed changes

Remove reference to issue we're closing

e066a36

ospencer mentioned this pull request Jan 19, 2022

Update formatter to not put parens around annotated types and reformat stdlib #1107

Closed

Merge branch 'main' into oscar/menhir

de27967

ospencer merged commit 9dc3c96 into main Jan 19, 2022

ospencer deleted the oscar/menhir branch January 19, 2022 23:26

github-actions bot mentioned this pull request Jan 19, 2022

chore: release main #1108

Closed

github-actions bot mentioned this pull request May 16, 2022

chore: release main #1230

Closed

phated mentioned this pull request May 30, 2022

chore: release main phated/grain#1

Merged

github-actions bot mentioned this pull request May 31, 2022

chore: release main #1292

Merged

github-actions bot mentioned this pull request Jul 25, 2022

chore: release main spotandjake/grain#1

Closed

This was referenced Mar 20, 2023

chore: release main alex-snezhko/grain#2

Open

chore: release main alex-snezhko/grain#3

Open

	Expected type parameters surrounded by carets, a comma followed by more types, a dot followed by an identifier, or a closing paren to finish the variant definition.
	Expected type parameters surrounded by angle brackets, a comma followed by more types, a dot followed by an identifier, or a closing parenthesis to finish the variant definition.

feat(compiler)!: Re-implement Grain parser #1033

feat(compiler)!: Re-implement Grain parser #1033

Conversation

ospencer commented Nov 19, 2021 • edited Loading

Parse Grain programs using Menhir 🗿

Syntax changes

Binary operators

Records

Syntax error messages

Info for parsing nerds

Caveats

peblair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phated left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcusroberts left a comment

Choose a reason for hiding this comment

peblair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peblair commented Jan 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phated left a comment

Choose a reason for hiding this comment

phated commented Jan 19, 2022 • edited Loading

peblair left a comment

Choose a reason for hiding this comment

ospencer commented Jan 19, 2022

ospencer commented Nov 19, 2021 •

edited

Loading

phated commented Jan 19, 2022 •

edited

Loading