Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend][Relay] Add Parser 2.0 #5932

Merged
merged 45 commits into from
Jul 8, 2020
Merged

[Frontend][Relay] Add Parser 2.0 #5932

merged 45 commits into from
Jul 8, 2020

Conversation

jroesch
Copy link
Member

@jroesch jroesch commented Jun 26, 2020

This PR implements a pure C++ parser for Relay's text format and starts to lay ground work for improved error reporting and full program parsing work that I will send an RFC for sometime next week. The goal is to remove the external dependency on ANTLR and make it easier for non-parsing experts to make simple modifications or tweaks to the parser.

I have implemented nearly all the expression and definition parsing, I have some remaining work to do on parsing types and ensuring end to end examples are working. I am opening the PR now in draft form to solicit some initial feedback.

Features

  • graph definitions and variables
  • comments
  • integer literals
  • float literals
  • boolean literals
  • unary operations
  • binary operations
  • parens
  • operator table an operator precedence
  • let bindings
  • sequencing
  • tuple expressions
  • function literals
  • top-level global functions
  • recursive calls
  • if-then-else
  • function calls
  • incomplete types
  • builtin types
  • tuple types
  • adt definitions
  • match expression
  • extern types
  • metadata parsing

@weberlo
Copy link
Contributor

weberlo commented Jun 26, 2020

A few thoughts:
It's not clear to me that modifying this parser is any easier than the current parser. One could make a case that the current parser is suboptimal, because ANTLR does a sort of "covering parse", and _parser.py then does another stage of parsing that incorporates context, but I would argue there's value in this separation of concerns, because you no longer need to worry about the syntactic components of parsing (e.g., precedence and associativity).

Another benefit of using a parser generator like ANTLR is that you have a specification of the language that serves as documentation and defines the parsing behavior, keeping the documentation always up to date.

I see the value in error reporting integration and removing the external dependency, but it would be good to further motivate these changes and maybe find ways to further modularize version 2.0 to make it noob-friendly.

src/parser/parser.cc Outdated Show resolved Hide resolved
@jroesch
Copy link
Member Author

jroesch commented Jun 29, 2020

@weberlo I think ANTLR only provides those benefits if you assume the people working on the project actually know ANTLR, which in so far as I can tell is not true. Josh and you were pretty much the only ones to work on the previous parser. Not to mention as we extend the parser to TIR and the rest of TVM it will become increasingly hard for anyone to make even small tweaks.

The current parser was also incomplete and failed to handle many tricky cases which can often be solved with small amounts of constant lookahead tokens.

Furthermore many of the grammar gymnastics required to parse in ANTLR are complex and easy for new users to break while this might still require some understanding the ordering is explicit in code for users to read and learn from.

ANTLR is also a painful deployment dependency as we need Java, Python, and C++ to build the current parser. Furthermore the parser necessitated a rewrite given that it was in Python and needs to be in C++ or another statically linkable language.

Finally error reporting the main reason to write it by hand, if you look at most production quality compilers they have hand written parsers mostly for error reporting and recovery reasons. Most generated parsers fail on the first invalid token, or parse issue such as an invalid identifier. The above parser can continue even after encountering a parse error enabling better error reporting.

In my exp. compilers which use parser generators i.e OCaml or F* have a horrible user exp. when compared to languages with hand rolled parsers such as Rust or Lean.

@jroesch
Copy link
Member Author

jroesch commented Jun 30, 2020

@jroesch jroesch marked this pull request as ready for review June 30, 2020 00:03
@jroesch
Copy link
Member Author

jroesch commented Jun 30, 2020

I just marked this as ready for review, my suggestion is that we review the existing code and land it in an experimental state. I will finish the metadata parsing and integration tests on real models in follow up work. My fear is that if I do it on this branch we are looking at 4kloc+ to review all at once. The PR is already pretty big and will take time to review. Thoughts?

@jroesch
Copy link
Member Author

jroesch commented Jun 30, 2020

cc @antinucleon and @jwfromm

@zhiics
Copy link
Member

zhiics commented Jun 30, 2020

I agree we should incrementally add the support of these language features to make review smoother.

src/parser/diagnostic.h Outdated Show resolved Hide resolved
src/parser/diagnostic.h Show resolved Hide resolved
src/parser/parser.cc Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
@tqchen
Copy link
Member

tqchen commented Jun 30, 2020

src/parser/op_table.h Outdated Show resolved Hide resolved
src/parser/op_table.h Show resolved Hide resolved
src/parser/op_table.h Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
tests/python/relay/test_ir_parser.py Outdated Show resolved Hide resolved
tests/python/relay/test_ir_parser2.py Outdated Show resolved Hide resolved
Copy link
Contributor

@electriclilies electriclilies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, more doc would be helpful, especially in parser.cc talking about how the various classes fit together. Also the Lookahead fn in parser.cc seemed like it was a bit weird / broken -- I left some comments about it there

src/parser/tokenizer.h Outdated Show resolved Hide resolved
src/parser/parser.cc Show resolved Hide resolved
src/parser/parser.cc Show resolved Hide resolved
src/parser/parser.cc Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
@ANSHUMAN87
Copy link
Contributor

@jroesch : Thanks for the PR! Great work 👍
I totally agree with the motivations behind this, you have mentioned.
I am sorry i could not find any RFC for this PR. Will it be possible to share some initial HLD draft ?

@jroesch
Copy link
Member Author

jroesch commented Jul 6, 2020

@ANSHUMAN87 I have been super busy and will post one soon.

@jroesch
Copy link
Member Author

jroesch commented Jul 7, 2020

@ANSHUMAN87 here is some initial details https://discuss.tvm.ai/t/rfc-meta-rfc-3-pronged-plan-for-improving-error-messages-in-tvm/7214

@jroesch
Copy link
Member Author

jroesch commented Jul 7, 2020

Okay I addressed the vast majority of comments directly and hopefully got everything, CI is building if people can do another pass.

src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
src/parser/parser.cc Outdated Show resolved Hide resolved
/*! Conditionally consume a token when it matches, this will never trigger an error
* as we guard against consuming the token before we do.
*
* Useful for matching optional tokens, effectively looksahead by one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Useful for matching optional tokens, effectively looksahead by one.
* Useful for matching optional tokens, effectively looks ahead by one.

Comment on lines 461 to 495
/*! \brief Convert a numeric token to an NDArray for embedding into the Relay program. */
NDArray NumberToNDArray(const Token& token) {
if (token->token_type == TokenType::Integer) {
DLContext ctx({.device_type = DLDeviceType::kDLCPU, .device_id = 0});
auto dtype = String2DLDataType("int32");
auto data = NDArray::Empty({}, dtype, ctx);
auto array = reinterpret_cast<int32_t*>(data->data);
// revisit this, literal node issue.
int64_t value = Downcast<tvm::Integer>(token->data);
array[0] = (int32_t)value;
return data;
} else if (token->token_type == TokenType::Float) {
DLContext ctx({.device_type = DLDeviceType::kDLCPU, .device_id = 0});
auto dtype = String2DLDataType("float32");
auto data = NDArray::Empty({}, dtype, ctx);
auto array = reinterpret_cast<float*>(data->data);
// revisit this, literal node issue.
float value = Downcast<tvm::FloatImm>(token->data)->value;
array[0] = value;
return data;
} else {
LOG(FATAL) << "internal error: should only call this function on numeric tokens";
return NDArray();
}
}

/*! \brief Convert a boolean value to an NDArray for embedding into the Relay program. */
NDArray BooleanToNDarray(bool value) {
DLContext ctx({.device_type = DLDeviceType::kDLCPU, .device_id = 0});
auto dtype = String2DLDataType("bool");
auto data = NDArray::Empty({}, dtype, ctx);
auto array = reinterpret_cast<bool*>(data->data);
array[0] = value;
return data;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to refactor this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the method above. seems like there's shared structure at a glance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't really any easy way to refactor because you really templatize the code cleanly due to the need to pass dtypes around and perform the correct casting based on dtype and container type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. wasn't sure

src/parser/parser.cc Outdated Show resolved Hide resolved
Comment on lines 599 to 617
SemVer ParseSemVer() {
// TODO(@jroesch): convert semver to module level attribute.
auto id = Peek();
if (id->token_type == TokenType::Identifier && id.ToString() == "v0") {
auto id = Match(TokenType::Identifier);
Consume(TokenType::Period);
// CHECK_EQ(minor_and_patch)
Consume(TokenType::Float);
}
// For now we only support current version.
return SemVer{.major = 0, .minor = 0, .patch = 4};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if we only support the current version, we should still validate the given version matches that, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are annoying issues with how this is done right now. I would like to move away from some ugly lexing hacks but in order to do that I need to change the semver. I would like to introduce module level attributes and instead provide general parsing for those instead of continue to hack this in. I will make sure this works before we purge the old parser.

Copy link
Contributor

@weberlo weberlo Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh. so the todo above means we remove the semver from the text format? we can discuss whether or not to do so later, but yeah, until we do so, we should at least have a hack that checks for "v0.0.4", rather than the current half measure.

Copy link
Member Author

@jroesch jroesch Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the problem is that that isn't a valid token, and trying to hack in is going to be a huge hack because its incredibly contextual and overlaps with a lots of other lexing rules. I don't really want to do it given that I WILL rip it out soon, and the old parser is still in place for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. if it's a big change, then we can leave it as is for now

case TokenType::Extern: {
Consume(TokenType::Extern);
// TODO(@jroesch): add some validation here?
defs.types.push_back(ParseTypeDef());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if it can be validated, since it's opaque. unless you mean something else

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser is the only place we can reject this if it has non-zero fields. I will come back to this in the final clean up. Trying to land the initial infra and then can do a polish pass or two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my bad. you meant to ensure there aren't any constructors. should be a two-line fix, right? just store the parsed def, then CHECK_EQ(def->constructors.length, 0), before pushing it back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we need to do errors (which is the goal of my next PR, so I figured I would do all in one).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Copy link
Member

@zhiics zhiics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @tqchen could you take another look?

@tqchen tqchen merged commit f9e905a into apache:master Jul 8, 2020
@tqchen
Copy link
Member

tqchen commented Jul 8, 2020

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jul 14, 2020
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jul 14, 2020
@jroesch jroesch deleted the parser-2.0 branch February 4, 2021 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants