Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Parsing Algorithm From Pest(PEG) to LALRPOP(LALR(1)) #303

Merged
merged 47 commits into from
Oct 25, 2022

Conversation

InsertCreativityHere
Copy link
Member

@InsertCreativityHere InsertCreativityHere commented Oct 17, 2022

Issues Fixed

Fixes #277

Pest eagerly consumed whitespace after a doc comment giving bogus spans.
LALRPOP ignores the whitespace and only include the comment's text in it's spans.

Fixes #233

The new parser no longer has a grammar rule for block doc comments.
All places where we used/tested them have been removed.

Implements & Closes #131

Attributes can be placed on both members and their types:
op([cs::identifier("foo")] myParam: [cs::generic("bar")] sequence<MyType>);
Type attributes must be placed on the type, and entity attributes must be placed on the identifier.
IMPORTANT: This PR doesn't add validation/testing for this. We should add validation/testing for this.

Closes #53

This issue was Pest-specific and is now pointless.

Explanation

This PR switches the parsing library we use from Pest to LALRPOP.
We still use Pest specifically for parsing doc comments, but this will be removed in the future. One step at a time!

This adds 2 'folder' modules: one for preprocessing and one for slice. Each folder contains:

  • grammar.lalrpop: Defines the grammar rules used by LALRPOP.
  • grammar.rs: Pulls in the generated LALRPOP code and defines helper functions.
  • lexer.rs: Converts a String of text into a stream (iterator) of tokens. This stream is used by the parser.
  • parser.rs: Parses a token stream into something
  • tokens.rs: Defines/lists the tokens/errors that the lexer and parser can return/work with.

Control Flow:

  1. Read the contents of a file into a String.
  2. Create a preprocessor-lexer from the string. This tokenizes the string into preprocessor tokens ('#', 'undef', '&&') and source blocks (Slice source code that isn't part of a preprocessor directive).
    The lexer is just an iterator, so calling 'lexer.next()' returns the next token/sourceblock.
  3. Pass this lexer into a Preprocessor and call parse_slice_file.
    This evaluates the preprocessor directives and returns an iterator of source blocks.
    Any blocks that are conditionally compiled out are not in this iterator.
  4. Pass the source blocks to a slice-lexer. This tokenizes everything into slice tokens ('module', '?', 'identifier').
  5. Pass this lexer into a Parser and call parse_slice_file.
    This parses the slice and returns a SliceFile.

This is the source code that does this:
https://github.com/InsertCreativityHere/icerpc/blob/9b188ce56567a10d15eacbb6f13b608db2a329aa/src/parser/slice.rs#L38-L46

// Run the raw text through the preprocessor.
let mut defined_symbols = HashSet::new();
let mut preprocessor = crate::parsers::Preprocessor::new(file, &mut defined_symbols, self.diagnostic_reporter);
let preprocessed_text = preprocessor.parse_slice_file(raw_text)?;

// Run the preprocessed text through the parser.
let mut parser = crate::parsers::Parser::new(file, ast, self.diagnostic_reporter);
let (file_encoding, file_attributes, modules) = parser.parse_slice_file(preprocessed_text)?;

Other Changes

  • Adds LALRPOP as a dependency and a build script for the grammar files. They're compiled, not interpreted like Pests.
  • Removes the parent patcher. The parsing algorithm that LALRPOP uses makes it possible to set parents during parsing, instead of doing it later as a separate step.
  • Removes the new and add_x methods from grammar elements. Also no longer necessary.
  • Drops the raw_value field from Identifier. Literally wasn't used anywhere.
  • Removes the old Pest parsing code. Except for doc comment parsing. Another PR.

@ReeceHumphreys
Copy link
Contributor

Just ran and pushed a commit that runs rustfmt over the repo. All of its formatting suggestions are now in one commit so it should be nice and easy to find any bad suggestions!

src/parser/slice.rs Outdated Show resolved Hide resolved
@ReeceHumphreys
Copy link
Contributor

Looks like icerpc-csharp is not building; got the following errors:

Cannot implicitly convert type 'System.Collections.Generic.Dictionary<int, int>' to 'IceRpc.Tests.Slice.CustomDictionary<int, int>'
Cannot implicitly convert type 'IceRpc.Tests.Slice.CustomDictionary<int, int>' to 'System.Collections.Generic.Dictionary<int, int>'
Cannot implicitly convert type 'int[]' to 'IceRpc.Tests.Slice.CustomSequence<int>'
Cannot implicitly convert type 'IceRpc.Tests.Slice.CustomSequence<int>' to 'int[]'
Cannot convert from 'IceRpc.Tests.Slice.CustomSequence<int>' to 'System.ReadOnlyMemory<int>'

This occurs in DictionaryMappingTests.cs and SequenceMappingTests.cs

@InsertCreativityHere
Copy link
Member Author

Looks like icerpc-csharp is not building; got the following errors:

The new parser unintentionally fixed some bugs in the repo. I had a local patch that fixed them, and opened a PR for them here: icerpc/icerpc-csharp#1923

Co-authored-by: Reece Humphreys <reecewh@icloud.com>
tests/encoding_tests.rs Outdated Show resolved Hide resolved
tests/interfaces/operations.rs Outdated Show resolved Hide resolved
tests/scope_resolution_tests.rs Outdated Show resolved Hide resolved
tests/attribute_tests.rs Outdated Show resolved Hide resolved
tests/attribute_tests.rs Show resolved Hide resolved
InsertCreativityHere and others added 2 commits October 18, 2022 10:54
Co-authored-by: Reece Humphreys <reecewh@icloud.com>
Co-authored-by: Reece Humphreys <reecewh@icloud.com>
Copy link
Member

@externl externl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of comments.

build.rs Show resolved Hide resolved
src/diagnostics/errors.rs Show resolved Hide resolved
src/diagnostics/errors.rs Outdated Show resolved Hide resolved
src/diagnostics/errors.rs Outdated Show resolved Hide resolved
src/parsers/common.rs Show resolved Hide resolved
src/parsers/preprocessor/lexer.rs Show resolved Hide resolved
src/parsers/preprocessor/lexer.rs Show resolved Hide resolved
src/parsers/preprocessor/lexer.rs Show resolved Hide resolved
src/parsers/preprocessor/mod.rs Show resolved Hide resolved
src/parsers/slice/grammar.lalrpop Outdated Show resolved Hide resolved
src/parsers/slice/grammar.lalrpop Outdated Show resolved Hide resolved
Co-authored-by: Joe George <joe@externl.com>
@InsertCreativityHere
Copy link
Member Author

InsertCreativityHere commented Oct 20, 2022

#303 (comment) > I would just fix them to expect a syntax error.

Alright! Sounds good to me!
Going to have to start including links in my quotes, it's getting so hard to find anything!

Keep 'em coming though! The more comments the more chances of catching things : v)

@ReeceHumphreys
Copy link
Contributor

Everythings looking good. I'll approve once the format action passes 😉

@InsertCreativityHere
Copy link
Member Author

I would just fix them to expect a syntax error.

The tests were changed to expect a syntax error. I included the location information even though I'm under the impression we don't actually check it. But I hope that's fine.

@InsertCreativityHere
Copy link
Member Author

I'll approve once the format action passes

I kept a majority of the changes that rustfmt proposed, but the remaining ones seem incorrect to me.
Some examples:

-    ($child:expr, Module, $module_ptr:expr, $parser:expr) => {{

+    ($child:expr,Module, $module_ptr:expr, $parser:expr) => {{

It removes the space between Module and the comma before it. There should always be a space after a comma,
and this is how we actually call the macro too (with the space I mean).

-                // Ensure the next character is also a '|' (since the whole token should be "||").

+                                       // Ensure the next character is also a '|' (since the whole token should be "||").
                 if matches!(self.buffer.peek(), Some('|')) {

It shifts these comments way to the right to try and align them with the line above,
but that's bogus here, the comment is describing the line beneath it.

-        Some(crate::parser::comments::CommentParser::parse_doc_comment(&combined, dummy_span))

+        Some(crate::parser::comments::CommentParser::parse_doc_comment(
+            &combined, dummy_span,
+        ))

I don't think this syntax is correct. Either the function goes on one line, or each parameter gets it's own line.
I couldn't find anything in the style examples or guidelines that suggests this is correct.

@ReeceHumphreys
Copy link
Contributor

ReeceHumphreys commented Oct 24, 2022

It removes the space between Module and the comma before it. There should always be a space after a comma,
and this is how we actually call the macro too (with the space I mean).

Yes this is a bug, we should open an issue on the rustfmt repo.

Either the function goes on one line, or each parameter gets it's own line.

There is an option to configure this behavior for rustfmt. We can change that rule if we think this is bad code.

@InsertCreativityHere
Copy link
Member Author

Yes this is a bug, we should open an issue on the rustfmt repo.

I've narrowed down the problem to their macro-argument-parser, when it recursively sub-parses the match arms:
https://github.com/rust-lang/rustfmt/blob/ef91154250977b3b5d05448dafbca524a1168b47/src/macros.rs#L904-L911
For some reason they only insert whitespace in between successive punctuation or successive identifiers:
https://github.com/rust-lang/rustfmt/blob/ef91154250977b3b5d05448dafbca524a1168b47/src/macros.rs#L829-L845
So:

ident,     // No whitespace, good!
,ident     // No whitespace, weird.
,,        // Inserts a space between then
ii        // Technically would insert a space between them, but I think this state is impossible, 'ii' would be treated as a single identifier.

I'll open a bug in the morning. You guys should copyedit my words for professionalism.

@InsertCreativityHere
Copy link
Member Author

InsertCreativityHere commented Oct 24, 2022

(https://github.com/rust-lang/rustfmt/blob/ef91154250977b3b5d05448dafbca524a1168b47/src/macros.rs#L992)

// This is a bit sketchy. The token rules probably need tweaking, but it works
// for some common cases. I hope the basic logic is sufficient. Note that the
// meaning of some tokens is a bit different here from usual Rust, e.g., `*`
// and `(`/`)` have special meaning.

@InsertCreativityHere
Copy link
Member Author

InsertCreativityHere commented Oct 25, 2022

I opened a bug for the whitespace issue: rust-lang/rustfmt#5573

@InsertCreativityHere
Copy link
Member Author

There is an option to configure this behavior for rustfmt. We can change that rule if we think this is bad code.

Yeah, I think this is pretty weird looking for Rust. What do you have in mind?

@InsertCreativityHere
Copy link
Member Author

I think the comment alignment might be buggy too, but that's less clear to me.
It leaves this completely alone:

    let x = 45; // Something else?
    let y = 574943834; // Thing1
    println!("{x} {y}"); // Thing2

But here it insists they must be visually aligned:

    let i = Identifier {
        value: "something".to_owned(), // Identifier
-        span: span.clone(), // Cool span
+        span: span.clone(),           // Cool span
    };

The only reason I think it's a bug is because usually rustfmt dislikes visual indentation like this.
The actual place this happens in the PR:

Diff in \\?\C:\Users\austin\Desktop\lalrpop-real\icerpc\src\parsers\slice\grammar.rs at line 388:
         data_type,
         tag,
         is_streamed,
-        is_returned: false, // Patched by its operation.
+        is_returned: false,                      // Patched by its operation.
         parent: WeakPtr::create_uninitialized(), // Patched by its container.

These comments are unrelated, so it feels weird to align them. They're only next to each other by coincedence here.
There's many other places where they aren't touching and rustfmt leaves them alone:

    let mut struct_ptr = OwnedPtr::new(Struct {
        identifier,
        members: Vec::new(),
        is_compact,
        parent: WeakPtr::create_uninitialized(), // Patched by its container.
        scope: parser.current_scope.clone(),
        attributes,
        comment,
        span,
        supported_encodings: None, // Patched by the encoding patcher.
    });

@ReeceHumphreys
Copy link
Contributor

I think the comment alignment might be buggy too, but that's less clear to me.

I think this isnt a big deal for this PR. We can just add the newline and everything is fine.

                self.advance_buffer(); // Consume the '&' character.

                 // Ensure the next character is also an '&' (since the whole token should be "&&").
                 if matches!(self.buffer.peek(), Some('&')) {
                ...

The above works and rustfmt is fine with it. For issues like the one above lets just make those minor changes (in this case a line) and open an issue with rustfmt.

Yeah, I think this is pretty weird looking for Rust. What do you have in mind?

Lets let rustfmt does its thing for now and we can open an issue once this PR is merged to change the setting as I am guessing depending on what decision we come to it could change a lot of code.

@InsertCreativityHere
Copy link
Member Author

Seems the visual alignment is a known issue: rust-lang/rustfmt#4108
Priority high even!

@InsertCreativityHere
Copy link
Member Author

lets just make those minor changes

I am of the opinion that we shouldn't intentionally mis-format our code.
Especially after the sheer amount of time and effort I've spent making sure that everything is formatted correctly.

Lets let rustfmt does its thing for now and we can open an issue once this PR is merged to change the setting as I am guessing depending on what decision we come to it could change a lot of code.

I'm not convinced the setting exists.

@InsertCreativityHere
Copy link
Member Author

@externl, @ReeceHumphreys is this okay to merge?
It passes the builds, tests, and linter.

I understand that rustfmt has issues with these changes,
and so you both do as well, but IMO these are bugs in rustfmt.

If you both feel that strongly about this, I'm fine with cargo +nightly fmting main after I merge...
Because at least, I would strongly prefer to have those changes isolated to their own commit.
That way, it's easy to go back and fix them once the bugs are fixed in rustfmt,
instead of having them lumped in with this massive PR.

@externl
Copy link
Member

externl commented Oct 25, 2022

@externl, @ReeceHumphreys is this okay to merge? It passes the builds, tests, and linter.

I understand that rustfmt has issues with these changes, and so you both do as well, but IMO these are bugs in rustfmt.

If you both feel that strongly about this, I'm fine with cargo +nightly fmting main after I merge... Because at least, I would strongly prefer to have those changes isolated to their own commit. That way, it's easy to go back and fix them once the bugs are fixed in rustfmt, instead of having them lumped in with this massive PR.

Sure, this time we can run the formatter in a separate commit.

@InsertCreativityHere InsertCreativityHere merged commit 59b444f into icerpc:main Oct 25, 2022
InsertCreativityHere added a commit to icerpc/icerpc-csharp that referenced this pull request Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants