Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Parser #262

Merged
merged 65 commits into from
Jan 23, 2023
Merged

New Parser #262

merged 65 commits into from
Jan 23, 2023

Conversation

AntonLydike
Copy link
Collaborator

First of all, I want to note that none of this is final, and that I welcome any ciriticism and discussion on all parts of this! This is my "getting to know xDSL" project, so my understanding of the underlying principles is still basic at best. I hope the following makes sense.

The new parser is aimed at making it easier to:

  • Change existing syntax (as Mathieu isn't that happy with current xDSL synax (or MLIR for that matter))
  • Reason about the parsing code, fix bugs, introduce features, etc.
  • Implement custom parsing/printing for Attribues/Types
  • Move closer to 100% MLIR<->xDSL compat
  • Produce very nice error messages

For that a couple of things were implemented:

  1. The parser now operates on Spans over the input, meaning we have location information attached to our string snippets we move around
  2. Backtracking built into the parser from the ground up, using:
    with self.tokenizer.backtrackin():
       # do stuff
  3. A BNF-like meta-programming layer was introduced
    1. Complex structures can be written in a BNF-like notation in the parser to make analyis/changes easier
    2. The BNF trees should be able to handle parsing and printing
    3. In the future we want to automatically generate Attribute and Operation Parsing/Printing either from user-supplied BNF notation or form the tablegen spec
    4. Will allow custom parsers to drop into their own parsing routines whenever they want

The BNF stuff:

There are a couple different motivating factors, and it wasn't easy to find something that satisfied them all:

  • Make it easy to reason about parsing correctness
  • Make it easy to implement custom parsers (and printers?)
  • Make it easy to convert from tablegen format

I also had the weird dream on generating printer and parser out of a single spec.

So let's begin adressing the first three points. For that I sketched up parsing of a generic operation:

generic_operation = BNF.Group([  
    BNF.Nonterminal('string-literal', bind="name"),  
    BNF.Literal('('),  
    BNF.ListOf(BNF.Nonterminal('value-id'), bind='args'),  
    BNF.Literal(')'),  
    BNF.OptionalGroup([  
        BNF.Literal('['),  
        BNF.ListOf(BNF.Nonterminal('block-id'), allow_empty=False, bind='blocks'),  
        # TODD: allow for block args here?! (accordin to spec)  
        BNF.Literal(']')  
    ]),  
    BNF.OptionalGroup([  
        BNF.Literal('('),  
        BNF.ListOf(BNF.Nonterminal('region'), bind='regions', allow_empty=False),  
        BNF.Literal(')')  
    ]),  
    BNF.Nonterminal('attr-dict', bind='attributes'),  
    BNF.Literal(':'),  
    BNF.Nonterminal('function-type', bind='type_signature')  
])

Note that a BNF.Literal represents a fixed string in the input, and is not to be confused with the parsing of e.g. a string-literal (so "some arbitrary string here")

Each Nonterminal calls an underlying function in the parser.

This is not exactly pure BNF. I instead provided something I feel is easier to use. For example, Optional and Group are often combined, so there is a wrapper for that. And parsing lists can be much more comfortable now using ListOf which takes a containing token, a regex separator, and can be configured to either allow or disallow empty lists.

There are also no plans to provide a OR (so basically ( something | something-else )) as this explodes the parsing complexity.

Extracting parsed fields is done through the bind=<name> attributes on the nodes. After parsing is complete, you get a dictionary where the fields <name> are populated with the parsing results of that parser.

Note: There are constraints on what makes "sense" to bind to. If you, for example bind inside of a ListOf, you will only have the last element on the output dictionary. Instead you should probably bind the ListOf? I am not sure though, because you can, theoretically, have arbitrary BNF inside the ListOf, which would take away all the simplicity we gained from bind. This is not a solved problem yet.

Printing with the Parser?

This whole bind stuff, gave me the idea, that it might be possible to now go from dict[<name>, <value>] and the BNF tree back to the source code and implement parsing/printing in one!

On the surface it seems possible, but it is a lot more complicated than this sadly. How do we decide when to print an OptionalGroup, what to do with a ListOf that contains a Group? an OptionalGroup? I have some ideas, but it's not very straight forward sadly.

The best thing might be to restrict the nested complexity of the BNF to allow for good ergonomics there. lets see. I only just got "here" yesterday evening.

On Error Messages:

They currently don't fulfill their promise. At all. Sorry about that. It's all still very wonky right now!

@codecov
Copy link

codecov bot commented Dec 9, 2022

Codecov Report

Base: 88.73% // Head: 88.53% // Decreases project coverage by -0.20% ⚠️

Coverage data is based on head (3c4349a) compared to base (b6678ee).
Patch coverage: 84.79% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #262      +/-   ##
==========================================
- Coverage   88.73%   88.53%   -0.20%     
==========================================
  Files          64       65       +1     
  Lines        7864     8017     +153     
  Branches     1286     1270      -16     
==========================================
+ Hits         6978     7098     +120     
- Misses        631      660      +29     
- Partials      255      259       +4     
Impacted Files Coverage Δ
xdsl/parser.py 83.60% <ø> (-0.47%) ⬇️
tests/test_attribute_definition.py 95.36% <61.53%> (ø)
xdsl/utils/exceptions.py 68.62% <61.53%> (-23.69%) ⬇️
xdsl/dialects/builtin.py 82.57% <73.07%> (+1.19%) ⬆️
xdsl/dialects/llvm.py 91.48% <80.00%> (ø)
tests/test_parser_error.py 93.02% <83.33%> (-6.98%) ⬇️
xdsl/ir.py 84.69% <93.33%> (-0.06%) ⬇️
tests/test_printer.py 99.07% <96.66%> (ø)
tests/test_attribute_builder.py 99.22% <100.00%> (ø)
tests/test_ir.py 100.00% <100.00%> (ø)
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

])


class MlirParser:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class MlirParser:
class MLIRParser:

methods marked try_... will attempt to parse, and return None if they failed. If they return None
they must make sure to restore all state.

methods marked must_... will do greedy parsing, meaning they consume as much as they can. They will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"must parse" reads like a boolean expression to me, it took a little while to understand that it was actually parsing the text. Maybe replace with either parse or parse_greedy to make into an active verb phrase?

@webmiche
Copy link
Collaborator

webmiche commented Dec 9, 2022

Puh, honestly this PR is a mouthful, so let me start at the top (or the thing I feel like I understand somewhat): The BNF description. I am reading this as the BNF description of the xDSL representation, not the MLIR one.

I am personally more familiar with EBNF which, AFAIK, is the same, but with some "syntactic sugar". Therefore, I will argue with EBNF. If something is unclear, please ask!

AFAIU, Optional and Group correspond to a choice and a list. In EBNF, we use { } in this case, so just the list, as a not-taken choice is the same as repeating 0 times in a way. Therefore, I feel like you might not even need to combine into an OptionalGroup, a Group should be enough (if it allows to be taken 0 times).

Next, I am confused that blocks are not nested into regions. and that they get the [ ] symbols. These symbols are used for the attributes in the current representation. Are you suggesting to change the representation of operations or is this an oversight?

And why is the entire thing wrapped inside a group?

Fundamentally, I feel that it really should be possible to generate parsers and printers from this, if we are a little bit careful about how we write the grammars. Not sure whether you are familiar with the concept of left/right-recursive grammars and its implications on parsings such grammars.

Anyway, will jump into the code now, but I don't believe I will be able to fully review/understand this today. I will probably revisit it on monday :)

@AntonLydike
Copy link
Collaborator Author

AntonLydike commented Dec 9, 2022

Hey, thanks @webmiche, I don't think you want to/should jump into the code right now. This PR is very WIP. It's more meant to be a "Hey, I'm working on this feature right now and am thinking about these concepts", and to get some feedback on the concepts.

Edit: Had the wrong mention here

@math-fehr
Copy link
Collaborator

Next, I am confused that blocks are not nested into regions. and that they get the [ ] symbols. These symbols are used for the attributes in the current representation. Are you suggesting to change the representation of operations or is this an oversight?

Just to quickly respond to that, I think that this is the syntax for the successors, not the blocks themselves!

@webmiche
Copy link
Collaborator

webmiche commented Dec 9, 2022

Ah yes. But aren't we currently printing successors wrapped into ( )?

@math-fehr
Copy link
Collaborator

I think we still wrap them in [].
Though I plan on drastically changing the IR syntax, to match almost the one from MLIR (besides 1-2 changes).
The only changes I would like to keep, is that attributes and the operation type is written before the regions, so you don't have to jump to the other side of the IR to see them.
The other thing is maybe changing the region syntax, since I feel like the MLIR one (({}, {}, {})) is confusing, especially for users used to using {}, which will be parsed as an attribute dictionary.
The hope is that this PR will make it obvious how to change it, and make it less error prone!

@webmiche
Copy link
Collaborator

webmiche commented Dec 9, 2022

Looking at https://github.com/xdslproject/xdsl/blob/main/tests/filecheck/cf_ops.xdsl we use ( ), right? Am I confused? xD

Sure, I agree with you. I also don't like that regions and attributes are wrapped into { } in MLIR. So maybe we could keep the change to [ ] for attrs?

@math-fehr
Copy link
Collaborator

Looking at https://github.com/xdslproject/xdsl/blob/main/tests/filecheck/cf_ops.xdsl we use ( ), right? Am I confused? xD

Sure, I agree with you. I also don't like that regions and attributes are wrapped into { } in MLIR. So maybe we could keep the change to [ ] for attrs?

Okay, forget my comments, I'm probably the most confused xD I think [] is for successors in MLIR (but is often changed for the custom constraints).
I'm still not sure what is the best syntax for attributes, since people from the Python world would prefer {}, though [] is removing the ambiguous syntax that MLIR has.

@webmiche
Copy link
Collaborator

webmiche commented Dec 9, 2022

Well, at the end of the day, aren't we all confused? xD

Okay, yes I agree that { } makes sense in the python world as this basically is a dictionary attached to an op...
On the other hand, people from Java/C/C++ feel like { } implies something like a function/nesting, which is pretty much a region...

I guess people from the python world can also look at it as a list of tuples, so there is a weak argument for [ ] 😅

@georgebisbas georgebisbas added the dialects Changes on the dialects label Jan 4, 2023
@tobiasgrosser tobiasgrosser added the hackathon To be tackled at the hackathon label Jan 7, 2023
@webmiche webmiche added xdsl xdsl framework specific changes and removed dialects Changes on the dialects labels Jan 10, 2023
@wence-
Copy link

wence- commented Jan 11, 2023

A flyby comment (I don't think I'm going to make it up the hill to the hackathon this week, sorry). Is there a reason that you are not using an existing package (for example lark) for the parsing infrastructure. It seems to me that would help quite a bit because you just need to define the grammars and translation of parsed trees into XDSL (using their existing tree-visitor infrastructure).

@math-fehr
Copy link
Collaborator

So the reason we cannot use most parser/printer generators is that we need to use arbitrary Python for the grammar (for attributes and operations).
If Lark allows to execute arbitrary Python (without too much of a hassle), then I would say it's worth it to look at it!

@AntonLydike AntonLydike marked this pull request as draft January 11, 2023 16:43
@AntonLydike AntonLydike force-pushed the anton/parser-printer-rework branch from 726dba6 to a69de90 Compare January 13, 2023 16:13
@AntonLydike
Copy link
Collaborator Author

Current state:

Filecheck:

Testing Time: 3.92s
  Unsupported:  3
  Passed     : 19
  Failed     : 25

Pytest:
4 failed, 322 passed, 1 skipped

@AntonLydike
Copy link
Collaborator Author

I removed the BNF stuff from this PR as well, the design was not ready, and the PR is big enough as is.

@AntonLydike AntonLydike force-pushed the anton/parser-printer-rework branch 3 times, most recently from 77ec5ca to 2fcfedb Compare January 18, 2023 17:19
@AntonLydike AntonLydike marked this pull request as ready for review January 18, 2023 18:17
Copy link
Collaborator

@webmiche webmiche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be split into multiple PRs? I just scrolled through the text in the parser file and added the most obvious comments, but I cannot review the actual functionality like this, it is just too much.

tests/test_ir.py Outdated
parser = Parser(ctx, program_func)
program: ModuleOp = parser.parse_op()
parser = XDSLParser(ctx, program_func)
program: ModuleOp = parser.must_parse_operation()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that must_parse_operation sounds like an awful name for an API you want to expose. Maybe rename or wrap into a properly named function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct. There already is a function called begin_parsing which is meant to be called from outside to parse a file. I'll get to it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just realized that this breaks some tests. Specifically, begin_parse makes sure that the operation parsed is a module_op. Some tests don't wrap their input in a builtin.module, and are therefore not "valid" xdsl/mlir programs.

I changed back to using must_parse_operation, as we are wanting to parse just a single operation here. I don't think we should expose an interface like parse_op.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that must_parse_op is not an API the parser wants to expose. It's only meant to be used parser internally. The test just has to use it as it isn't using the "whole" parser. If that makes sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not an exposed API, preface it with _ please. (Python coding standards)

tests/test_mlir_printer.py Outdated Show resolved Hide resolved
attr = DictionaryAttr.from_dict(data)

with StringIO() as io:
Printer(io).print(attr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this makes sense as this makes the Parser tests depend on the Printer. I feel that parser tests should really take strings and the data structure that is expected in order to test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. I'll change that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that most of the printer tests rely on the parser as well. What's up with that? Why is that more okay?

tests/test_printer.py Outdated Show resolved Hide resolved
xdsl/ir.py Outdated Show resolved Hide resolved
xdsl/xdsl_opt_main.py Outdated Show resolved Hide resolved
import itertools
import re
import sys
import traceback
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these new dependencies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all builtin python modules

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised the parser needs the python ast?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use it to evaluate string literals, that is actually quite tricky, and instead of re-inventing the wheel I looked for a stdlib function that handles that for us. That's why I used ast.literal_eval. We also could have used json.loads, or something else. I just found literal_eval first while searching/thinking about it.

xdsl/parser.py Show resolved Hide resolved
xdsl/parser.py Show resolved Hide resolved
xdsl/parser.py Show resolved Hide resolved
@AntonLydike
Copy link
Collaborator Author

@webmiche How would you go about splitting something like this into smaller PRs? I'm genuinely curious, I can't think of a sensible way.

I could sit down with you and give you a high-level overview over the concepts, if that would help? The plan was to do the review during the Hackathon, sadly I couldn't get it done in time :(

@webmiche
Copy link
Collaborator

So I guess the issue is that a lot of tests need a full parser in order to run, right?

I could imagine having a branch without the old parser and with all tests marked as UNSUPPORTED and then basically upstream into that. So start out with a xDSLParser that can just about parse an empty module and then develop that by adding more functionality, enabling tests along the way. And once you pass a good amount of tests, we can merge that parser into the main branch and continue upstreaming there.

Or just remove everything that is around the MLIRParser. That should be relatively little anyhow. Then we can let that still flow through the old infrastructure, or maybe not support it at all. I think this would already cut down the number of lines by a lot.

Also notice that you are pretty much removing the old parser, so github diffs look extremely bad. Simply renaming the parser file might already be quite an improvement on that side.

@AntonLydike
Copy link
Collaborator Author

We cam move the parser back to parser_ng.py (I originally developed it there), which would make the diff much more readable. The problem with that would be, that we then have two completely different parsers in the same codebase. This might be a worthy tradeoff for git diff visibility though, as it can be removed relatively easily on a follow-up pr. I can make the change if you want @webmiche

@AntonLydike AntonLydike force-pushed the anton/parser-printer-rework branch from c3b2915 to 868e509 Compare January 23, 2023 16:38
Copy link
Collaborator

@math-fehr math-fehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think we can merge it now!

@webmiche webmiche merged commit b09e94e into main Jan 23, 2023
@tobiasgrosser tobiasgrosser deleted the anton/parser-printer-rework branch January 24, 2023 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hackathon To be tackled at the hackathon xdsl xdsl framework specific changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants