[Frontend][Relay] Add Parser 2.0 #5932

jroesch · 2020-06-26T03:00:54Z

This PR implements a pure C++ parser for Relay's text format and starts to lay ground work for improved error reporting and full program parsing work that I will send an RFC for sometime next week. The goal is to remove the external dependency on ANTLR and make it easier for non-parsing experts to make simple modifications or tweaks to the parser.

I have implemented nearly all the expression and definition parsing, I have some remaining work to do on parsing types and ensuring end to end examples are working. I am opening the PR now in draft form to solicit some initial feedback.

Features

weberlo · 2020-06-26T19:53:22Z

A few thoughts:
It's not clear to me that modifying this parser is any easier than the current parser. One could make a case that the current parser is suboptimal, because ANTLR does a sort of "covering parse", and _parser.py then does another stage of parsing that incorporates context, but I would argue there's value in this separation of concerns, because you no longer need to worry about the syntactic components of parsing (e.g., precedence and associativity).

Another benefit of using a parser generator like ANTLR is that you have a specification of the language that serves as documentation and defines the parsing behavior, keeping the documentation always up to date.

I see the value in error reporting integration and removing the external dependency, but it would be good to further motivate these changes and maybe find ways to further modularize version 2.0 to make it noob-friendly.

src/parser/parser.cc

jroesch · 2020-06-29T19:33:47Z

@weberlo I think ANTLR only provides those benefits if you assume the people working on the project actually know ANTLR, which in so far as I can tell is not true. Josh and you were pretty much the only ones to work on the previous parser. Not to mention as we extend the parser to TIR and the rest of TVM it will become increasingly hard for anyone to make even small tweaks.

The current parser was also incomplete and failed to handle many tricky cases which can often be solved with small amounts of constant lookahead tokens.

Furthermore many of the grammar gymnastics required to parse in ANTLR are complex and easy for new users to break while this might still require some understanding the ordering is explicit in code for users to read and learn from.

ANTLR is also a painful deployment dependency as we need Java, Python, and C++ to build the current parser. Furthermore the parser necessitated a rewrite given that it was in Python and needs to be in C++ or another statically linkable language.

Finally error reporting the main reason to write it by hand, if you look at most production quality compilers they have hand written parsers mostly for error reporting and recovery reasons. Most generated parsers fail on the first invalid token, or parse issue such as an invalid identifier. The above parser can continue even after encountering a parse error enabling better error reporting.

In my exp. compilers which use parser generators i.e OCaml or F* have a horrible user exp. when compared to languages with hand rolled parsers such as Rust or Lean.

jroesch · 2020-06-30T00:03:21Z

cc @MarisaKirisame @joshpoll @wweic @zhiics

jroesch · 2020-06-30T00:05:03Z

I just marked this as ready for review, my suggestion is that we review the existing code and land it in an experimental state. I will finish the metadata parsing and integration tests on real models in follow up work. My fear is that if I do it on this branch we are looking at 4kloc+ to review all at once. The PR is already pretty big and will take time to review. Thoughts?

jroesch · 2020-06-30T00:06:59Z

cc @antinucleon and @jwfromm

zhiics · 2020-06-30T00:19:11Z

I agree we should incrementally add the support of these language features to make review smoother.

src/parser/diagnostic.h

src/parser/parser.cc

tqchen · 2020-06-30T16:10:20Z

also cc @junrushao1994 @spectrometerHBH @MarisaKirisame

src/parser/op_table.h

src/parser/parser.cc

tests/python/relay/test_ir_parser.py

tests/python/relay/test_ir_parser2.py

electriclilies

In general, more doc would be helpful, especially in parser.cc talking about how the various classes fit together. Also the Lookahead fn in parser.cc seemed like it was a bit weird / broken -- I left some comments about it there

src/parser/tokenizer.h

src/parser/parser.cc

ANSHUMAN87 · 2020-07-05T06:02:15Z

@jroesch : Thanks for the PR! Great work 👍
I totally agree with the motivations behind this, you have mentioned.
I am sorry i could not find any RFC for this PR. Will it be possible to share some initial HLD draft ?

jroesch · 2020-07-06T23:48:48Z

@ANSHUMAN87 I have been super busy and will post one soon.

jroesch · 2020-07-07T00:13:59Z

@ANSHUMAN87 here is some initial details https://discuss.tvm.ai/t/rfc-meta-rfc-3-pronged-plan-for-improving-error-messages-in-tvm/7214

jroesch · 2020-07-07T09:49:55Z

Okay I addressed the vast majority of comments directly and hopefully got everything, CI is building if people can do another pass.

src/parser/parser.cc

weberlo · 2020-07-07T17:32:18Z

src/parser/parser.cc

+  /*! Conditionally consume a token when it matches, this will never trigger an error
+   * as we guard against consuming the token before we do.
+   *
+   * Useful for matching optional tokens, effectively looksahead by one.


Suggested change

* Useful for matching optional tokens, effectively looksahead by one.

* Useful for matching optional tokens, effectively looks ahead by one.

weberlo · 2020-07-07T17:35:26Z

src/parser/parser.cc

+  /*! \brief Convert a numeric token to an NDArray for embedding into the Relay program. */
+  NDArray NumberToNDArray(const Token& token) {
+    if (token->token_type == TokenType::Integer) {
+      DLContext ctx({.device_type = DLDeviceType::kDLCPU, .device_id = 0});
+      auto dtype = String2DLDataType("int32");
+      auto data = NDArray::Empty({}, dtype, ctx);
+      auto array = reinterpret_cast<int32_t*>(data->data);
+      // revisit this, literal node issue.
+      int64_t value = Downcast<tvm::Integer>(token->data);
+      array[0] = (int32_t)value;
+      return data;
+    } else if (token->token_type == TokenType::Float) {
+      DLContext ctx({.device_type = DLDeviceType::kDLCPU, .device_id = 0});
+      auto dtype = String2DLDataType("float32");
+      auto data = NDArray::Empty({}, dtype, ctx);
+      auto array = reinterpret_cast<float*>(data->data);
+      // revisit this, literal node issue.
+      float value = Downcast<tvm::FloatImm>(token->data)->value;
+      array[0] = value;
+      return data;
+    } else {
+      LOG(FATAL) << "internal error: should only call this function on numeric tokens";
+      return NDArray();
+    }
+  }
+
+  /*! \brief Convert a boolean value to an NDArray for embedding into the Relay program. */
+  NDArray BooleanToNDarray(bool value) {
+    DLContext ctx({.device_type = DLDeviceType::kDLCPU, .device_id = 0});
+    auto dtype = String2DLDataType("bool");
+    auto data = NDArray::Empty({}, dtype, ctx);
+    auto array = reinterpret_cast<bool*>(data->data);
+    array[0] = value;
+    return data;
+  }


does it make sense to refactor this?

with the method above. seems like there's shared structure at a glance

There isn't really any easy way to refactor because you really templatize the code cleanly due to the need to pass dtypes around and perform the correct casting based on dtype and container type.

sounds good. wasn't sure

src/parser/parser.cc

weberlo · 2020-07-07T17:38:14Z

src/parser/parser.cc

+  SemVer ParseSemVer() {
+    // TODO(@jroesch): convert semver to module level attribute.
+    auto id = Peek();
+    if (id->token_type == TokenType::Identifier && id.ToString() == "v0") {
+      auto id = Match(TokenType::Identifier);
+      Consume(TokenType::Period);
+      // CHECK_EQ(minor_and_patch)
+      Consume(TokenType::Float);
+    }
+    // For now we only support current version.
+    return SemVer{.major = 0, .minor = 0, .patch = 4};
+  }


even if we only support the current version, we should still validate the given version matches that, right?

There are annoying issues with how this is done right now. I would like to move away from some ugly lexing hacks but in order to do that I need to change the semver. I would like to introduce module level attributes and instead provide general parsing for those instead of continue to hack this in. I will make sure this works before we purge the old parser.

ohh. so the todo above means we remove the semver from the text format? we can discuss whether or not to do so later, but yeah, until we do so, we should at least have a hack that checks for "v0.0.4", rather than the current half measure.

No the problem is that that isn't a valid token, and trying to hack in is going to be a huge hack because its incredibly contextual and overlaps with a lots of other lexing rules. I don't really want to do it given that I WILL rip it out soon, and the old parser is still in place for now.

okay. if it's a big change, then we can leave it as is for now

weberlo · 2020-07-07T17:43:30Z

src/parser/parser.cc

+        case TokenType::Extern: {
+          Consume(TokenType::Extern);
+          // TODO(@jroesch): add some validation here?
+          defs.types.push_back(ParseTypeDef());


i'm not sure if it can be validated, since it's opaque. unless you mean something else

The parser is the only place we can reject this if it has non-zero fields. I will come back to this in the final clean up. Trying to land the initial infra and then can do a polish pass or two.

oh my bad. you meant to ensure there aren't any constructors. should be a two-line fix, right? just store the parsed def, then CHECK_EQ(def->constructors.length, 0), before pushing it back.

yeah we need to do errors (which is the goal of my next PR, so I figured I would do all in one).

sounds good

Co-authored-by: Logan Weber <36520469+weberlo@users.noreply.github.com>

zhiics

LGTM. @tqchen could you take another look?

tqchen · 2020-07-08T20:05:23Z

Thanks @jroesch @zhiics @weberlo @electriclilies @MarisaKirisame

tqchen added the status: need review label Jun 26, 2020

weberlo suggested changes Jun 26, 2020

View reviewed changes

src/parser/parser.cc Outdated Show resolved Hide resolved

jroesch marked this pull request as ready for review June 30, 2020 00:03

tqchen requested changes Jun 30, 2020

View reviewed changes

zhiics reviewed Jun 30, 2020

View reviewed changes

src/parser/op_table.h Outdated Show resolved Hide resolved

src/parser/op_table.h Show resolved Hide resolved

src/parser/op_table.h Show resolved Hide resolved

src/parser/parser.cc Outdated Show resolved Hide resolved

src/parser/parser.cc Outdated Show resolved Hide resolved

MarisaKirisame reviewed Jun 30, 2020

View reviewed changes

electriclilies reviewed Jul 1, 2020

View reviewed changes

src/parser/tokenizer.h Outdated Show resolved Hide resolved

src/parser/parser.cc Show resolved Hide resolved

src/parser/parser.cc Show resolved Hide resolved

src/parser/parser.cc Show resolved Hide resolved

src/parser/parser.cc Outdated Show resolved Hide resolved

weberlo suggested changes Jul 7, 2020

View reviewed changes

jroesch and others added 11 commits July 7, 2020 11:52

Start on new parser

b1ba914

First test passing

7d89a94

Try and factor into helpers

86c33e2

Comments are working

e34afcb

Refactor binding parser to be iterative

c624107

Add number parsing code

5e26476

Add configurable shift-reduce parser for operators

f6bc944

Operator parsing is done

d50e750

Add tokenizer fix

2d8cf54

I hate parsing

17a14aa

About half-way through parsing functions.

db7879c

jroesch and others added 11 commits July 7, 2020 11:52

More feedback

4b83ea2

Fix

5546a95

Formatting

30c6281

Add parser

4f0622e

Cpp lint

0c61fda

Format

8aca8f4

Add module string

e700147

Small tweaks for older C++ compiler on CI

5a0b17e

Format

4cd12c3

Apply suggestions from code review

abddaf6

Co-authored-by: Logan Weber <36520469+weberlo@users.noreply.github.com>

Minor tweaks

1eae9e1

jroesch force-pushed the parser-2.0 branch from 3b3c8a2 to 1eae9e1 Compare July 7, 2020 18:52

jroesch added 3 commits July 7, 2020 13:42

Fix some windows issues

3bc277a

More comments and windows issues

ac836a8

Formatting

ef5fbda

weberlo approved these changes Jul 7, 2020

View reviewed changes

jroesch added 3 commits July 7, 2020 14:24

Rename field names

62d9f97

Formatting

e169ea7

Fix span test

8206bc8

zhiics approved these changes Jul 8, 2020

View reviewed changes

tqchen approved these changes Jul 8, 2020

View reviewed changes

tqchen merged commit f9e905a into apache:master Jul 8, 2020

tqchen added status: accepted and removed status: need review labels Jul 8, 2020

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jul 14, 2020

[Frontend][Relay] Add Parser 2.0 (apache#5932)

1984745

trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jul 14, 2020

[Frontend][Relay] Add Parser 2.0 (apache#5932)

a94dcbd

tqchen mentioned this pull request Jul 21, 2020

Improve ANTLR Language Dependency #4495

Closed

ZihengJiang mentioned this pull request Sep 25, 2020

TVM v0.7 Release Note Candidate #6486

Closed

jroesch deleted the parser-2.0 branch February 4, 2021 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend][Relay] Add Parser 2.0 #5932

[Frontend][Relay] Add Parser 2.0 #5932

jroesch commented Jun 26, 2020 •

edited

Loading

weberlo commented Jun 26, 2020

jroesch commented Jun 29, 2020

jroesch commented Jun 30, 2020

jroesch commented Jun 30, 2020

jroesch commented Jun 30, 2020

zhiics commented Jun 30, 2020 •

edited

Loading

tqchen commented Jun 30, 2020

electriclilies left a comment

ANSHUMAN87 commented Jul 5, 2020

jroesch commented Jul 6, 2020

jroesch commented Jul 7, 2020

jroesch commented Jul 7, 2020

weberlo Jul 7, 2020

weberlo Jul 7, 2020

jroesch Jul 7, 2020

weberlo Jul 7, 2020

jroesch Jul 7, 2020

weberlo Jul 7, 2020

weberlo Jul 7, 2020

jroesch Jul 7, 2020

weberlo Jul 7, 2020 •

edited

Loading

jroesch Jul 7, 2020 •

edited

Loading

weberlo Jul 7, 2020

weberlo Jul 7, 2020

jroesch Jul 7, 2020

weberlo Jul 7, 2020

jroesch Jul 7, 2020

weberlo Jul 7, 2020

zhiics left a comment

tqchen commented Jul 8, 2020 •

edited

Loading

	* Useful for matching optional tokens, effectively looksahead by one.
	* Useful for matching optional tokens, effectively looks ahead by one.

[Frontend][Relay] Add Parser 2.0 #5932

[Frontend][Relay] Add Parser 2.0 #5932

Conversation

jroesch commented Jun 26, 2020 • edited Loading

Features

weberlo commented Jun 26, 2020

jroesch commented Jun 29, 2020

jroesch commented Jun 30, 2020

jroesch commented Jun 30, 2020

jroesch commented Jun 30, 2020

zhiics commented Jun 30, 2020 • edited Loading

tqchen commented Jun 30, 2020

electriclilies left a comment

Choose a reason for hiding this comment

ANSHUMAN87 commented Jul 5, 2020

jroesch commented Jul 6, 2020

jroesch commented Jul 7, 2020

jroesch commented Jul 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weberlo Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

jroesch Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhiics left a comment

Choose a reason for hiding this comment

tqchen commented Jul 8, 2020 • edited Loading

jroesch commented Jun 26, 2020 •

edited

Loading

zhiics commented Jun 30, 2020 •

edited

Loading

weberlo Jul 7, 2020 •

edited

Loading

jroesch Jul 7, 2020 •

edited

Loading

tqchen commented Jul 8, 2020 •

edited

Loading