Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error recovery #289

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Add error recovery #289

wants to merge 4 commits into from

Conversation

kw217
Copy link
Contributor

@kw217 kw217 commented Feb 15, 2022

When using a parser in the context of an IDE or a compiler, it is desirable to be able to recover from errors, returning a best-efforts parse anyway and reporting all errors rather than just the first. This PR extends rust-peg with support for this, based on Syntax Error Recovery in Parsing Expression Grammars and Automatic Syntax Error Reporting and Recovery in Parsing Expression Grammars.

@kevinmehall I'm not sure if multiple errors is something you want to support, and I'm not 100% confident I've chosen the best syntax, so I would be grateful for your feedback.

Adding error recovery necessarily introduces a breaking change - the public return type of a parser has to be able to include multiple errors. To ease upgrade, the new type has an .into_result() method which returns the original result type.

There are three expressions for raising an error:

  • error!{ "message" e2 } is the simplest form. It reports an error with the given message at the current location, then attempts to recover by parsing e2 instead.
  • error_if!{ e1 | "message" e2 } allows the error to be reported at an earlier location. It attempts to parse e1; if it succeeds it reports an error with the given message at the start of e1, then attempts to recover by parsing e2 instead.
  • error_unless!{ e1 | "message" e2 } is shorthand for a choice expression. It is useful in order to report an error when an expression fails to match. It attempts to parse e1; if it fails it reports an error with the given message at the current location, then attempts to recover by parsing e2 instead.

The result of the parser (ParseResults) contains the final value, failure, or error, along with a list of all errors we recovered from during the parse.

Full documentation is provided in src/lib.rs.

Changes

  • The new expression forms are represented in the AST by RecoverIfExpr and RecoverUnlessExpr (the simple error! form is sugar for error_if! with an empty e1).
  • Most test and example code has been upgraded by inserting .into_result().
  • New tests are added as recovery.rs (the example from the paper) and arithmetic_recovery.rs (a worked example of how to use the power of the new features).
  • In the parser, RuleResult now has a third alternative (beyond Matched and Failed): Error. This is handled largely following the rules from the papers cited, although a few improvements and optimizations have been added as noted in the comment in src/lib.rs.
  • Similarly ErrorState gains a set of errors recovered from so far.
  • I made a few renamings for clarity, since now "error" and "failure" are two different things - e.g., into_parse_error is now into_failure, since it creates a failure not an error, and reparsing_on_error becomes reparsing_on_failure.
  • In a few places in the code I added comments explaining the meaning of fields and methods (as I figured them out :-) ).

Let me know if you have any questions!

@Mingun
Copy link

Mingun commented Feb 16, 2022

It is quite big improvement and I think it will be better to split it on several commits since otherwise it will be difficult to review.

I suppose to split this PR to following commits:

  • Add documentation to existing stuff
  • Introduce new return type and upgrade library to use it + migration guide
  • Introduce new grammar constructs + AST upgrade

Maybe some steps can be split even more, but that is a required minimum as I think.

@kw217
Copy link
Contributor Author

kw217 commented Feb 16, 2022

Thanks @Mingun , I'll split up the PR into separate commits as you suggest.

I've fixed the bug in trace mode that was spotted by the automated tests (oops!).

I just noticed the discussion at #276, which I hadn't seen before I submitted this PR (I did most of the work a few months ago). Here are some thoughts against each of @kevinmehall's headings:

  • Producing errors as part of the normal result is the intended way of using this extension; however this extension allows you to record those errors too.
  • I love the idea of checking the parser can't fail. That's not addressed here.
  • Error localisation - this extension (following the paper) collects the points at which you threw a labelled error, rather than re-running the existing reparsing algorithm.
  • Labeled failures - that's the paper I based this work on!
  • Lossless syntax tree. This isn't addressed here either (though some principled way of capturing location/span info is clearly needed, and most parsers will reinvent it somehow).

@kw217 kw217 force-pushed the recovery-parser branch 2 times, most recently from a5aeae7 to bcc4456 Compare February 16, 2022 22:19
Changes the public return type of parsers, so clients can access
parse result information directly.

This is a breaking change, but clients can easily adapt by
invoking `.into_result()` to obtain the original return type.

This commit contains a manually-updated grammar.rs.
The next commit contains the bootstrapped one.
Previous commit exposes the parser return type, but used a
manually-updated grammar.rs. This commit autogenerates
the bootstrapped grammar.rs from grammar.rustpeg.
Adds a new expression `error!{"message" e}` which reports an error and
attempts to recover it with `e`. Also `error_if!{}` and
`error_unless!{}`.

Maintains and returns a list of errors which have been recovered.

This enables the parser to report multiple errors, which is useful
in contexts like IDEs and compilers where the user may find it easier
to be presented with all errors rather than just the first.

This commit contains a manually-updated grammar.rs,
sufficient only for bootstrapping but with no support for
parsing the new expression types.
The next commit contains the bootstrapped one.
Previous commit adds error recovery, but used a
manually-updated grammar.rs. This commit autogenerates
the bootstrapped grammar.rs from grammar.rustpeg.
@kw217
Copy link
Contributor Author

kw217 commented Feb 16, 2022

(apologies for the noise)

This is now split into five commits - @Mingun hopefully this is easier to review!

  1. Add extra comments.
  2. Expose (existing) parser return type, and require .into_result().
  3. Bootstrap grammar.rs.
  4. Add error recovery.
  5. Bootstrap grammar.rs.

This is now ready for review. Thanks.

@kevinmehall
Copy link
Owner

Yes, some kind of error recovery is definitely in-scope for rust-peg and I'm glad to see someone working on this. Error recovery is potentially a big addition and I want to make sure we've explored the design space and get it right, so this is going to take some discussion and iteration.

Some initial feedback:

Syntax

It seems that

  • error_if! { e1 | "message" e2 } is the same as &e1 error!{ "message" e2 }
  • error_unless!{ e1 | "message" e2 } is the same as e1 / error!{ "message" e2 }

I would lean towards re-using the existing control flow operators unless there's some advantage of duplicating them.

Collecting multiple errors

I'm wary of anything that returns data out of a PEG expression except via its return value, because this doesn't play well with backtracking.

For example,

rule abc() = "a" error_unless!{ "b" | "expected b" [_] }  "c"
rule asdf() = "asdf"
pub rule top() = abc() / asdf()

top("asdf") produces an "expected b" error in its errors list, even though the otherwise-successful parse matched using asdf() instead of abc().

The Medeiros paper doesn't say much about how errors should be collected. You could imagine reverting recorded errors on backtracking, but this seems like it would have a large performance penalty to implement.

This is why on #276 I proposed returning the errors as part of the AST -- you have to construct a placeholder AST node anyway, so why not make it carry the error? That also leaves the error type up to the user, instead of requiring &'static str like here, and avoids complicating the parser API.

If you really want to collect a list of errors without dealing with backtracking, you can always pass in a mutable Vec with the existing support for grammar arguments.

Error vs Failure

What constructs ParseResult::Error? It seems like this is only ever produced if recovery expression e2 fails? Giving error!{} the dual purpose of logging an error and continuing vs aborting the rest of the parse depending on the presence and success of the recovery expression seems like it's unnecessarily combining two pieces of functionality that could be orthogonal.

I also question whether the behavior of preventing further backtracking and aborting the parse is desirable here if the goal is to build an AST that is as complete as possible. You could imagine wanting layered error recovery, e.g. attempting recovery within the rule for expressions, but if that fails, the calling rule tries to recover at the following statement, and if nothing else, you replace the whole function with an error node but continue with the rest of the file.

I do support being more consistent on "error" vs "failure" terminology in the docs and code, though!


I want to re-read the papers more closely, because I feel like I'm missing something there -- the most useful piece from the first paper seems to be e.g. the strategy of the SkipToRCUR rule, but that can be done in any PEG implementation using ordered choice without any special features. Medeiros describes the error handling in terms of labeled failures, but the paper doesn't make much use of that connection -- it doesn't use the extension of the ordered choice operator to catch failures with a particular label. The labels are merely a point to attach a recovery expression, which as your implementation shows, can be done inline instead of with labels, and I suggest is basically the same as falling back to a recovery expression with ordered choice anyway.

So I'm not sure where that leaves this PR. It's been interesting to try out, at least. If you have a grammar that you've used this on, I'd like to see it.

@kw217
Copy link
Contributor Author

kw217 commented Feb 20, 2022

Thanks, really appreciate the detailed response, and definitely keen to iterate. This is a hobby project for me so I apologise in advance if it takes me a while to get back to you.

This was a fairly straightforward implementation of the paper, without taking a lot of time to consider whether it was the best approach. You are correct re ParseResult::Error. You ask very sensible questions about backtracking and collection of errors, and I'm intrigued by the idea of separating the logging and recovery/abort. I'll ponder how this might work.

I see your point about reporting errors in the AST, but having them on the side makes sense in terms of the way a compiler or IDE expects to see them - when I compile a Rust program I am presented with a linear list of errors, and in order to generate wiggly lines in an IDE I have to present a list of source ranges. This could be extracted from the AST, but that adds an extra stage which feels artificial.

You are totally right that error_if! and error_unless! are just sugar. However when you actually write a grammar they represent patterns that come up a lot. error_unless! is already in the Medeiros paper, and I found I needed error_if! in order to get errors to show up in the right place - and the repetition of &e1 error!("message" e1 e2) was unpleasant, especially if e1 was complex. See the example arithmetic_recovery.rs for how this is used. The syntax may not be ideal, but I think something like this is necessary for ergonomic reasons.

The best example I have at the moment is arithmetic_recovery.rs; I have another project that inspired this but it's not ready for sharing just yet, which provides syntax highlighting and error reporting for the Monaco editor (the editor used by VS Code) via WASM.

Thanks again. I'll get back to you with a better example of how this can be used, and I'll ponder more broadly whether there's a better approach.

@kw217 kw217 mentioned this pull request May 20, 2022
@ProphetLamb
Copy link

ProphetLamb commented Nov 18, 2022

Looking at the current mechanism, error recovery is not moved to the lowest precedence alt, which can result in ambiguous matches. So that for example if we have a rule constant and expression , for a gplang, if we match a integer constant as show below, only integer literals are matched.

rule integer() -> Literal
  = val:$(['0'..°'9') ....
  / error!{"nan" 0}

pub rule literal() -> Literal
  = integer
  / floating
  / string
  / error!{"expected a constant value" 0}

If we look at ANTLR, the parser first tries all possible alternatives, then exports the parse trace to a error handler.

ANTLR would produce a state machine like so:

const -> integer>{v} | floating>{v} | string>{v}
expr -> const | interpolation | stmt
unit -> expr | expr>(const>integer>{"nan", 0} | expr>const>{"expected a constant value" 0})

(note that all error recovery is done at the highest level of the tree in the state machine)
while the error handling would produce a "state machine" like so:

integer -> [0..=9] {v} | {"nan" 0}. //this can never fail
const -> integer>{v} | floating>{v} | string>{v} | 
expr -> const | interpolation | stmt | {"expected a constant value" 0})
unit -> expr

which is obviously invalid.

@DasLixou
Copy link

DasLixou commented Nov 6, 2023

Bump
Any updates on this? 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants