Skip to content

Add line/column information consistently to every parser message. #179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Dandandan opened this issue Jun 2, 2020 · 4 comments
Open
Labels
help wanted Extra attention is needed

Comments

@Dandandan
Copy link
Contributor

It would be useful to have clear parser errors which have the position (line and column) printed.

Now for some errors there is information of where the error happened:

Unterminated string literal at Line: {}, Col: {}

For others, there isn't:

Unexpected EOF while in a multi-line comment

I think this would be a "good first issue" to work on, maybe I find some time to work on this soon.

@nickolay
Copy link
Contributor

nickolay commented Jun 2, 2020

I agree with this goal, and location-aware errors from sqlparser is part of the larger story of good error messages (the other part is keeping track of position in the happy case, so that semantic errors can be reported with a specific location as well - https://github.com/andygrove/sqlparser-rs/issues/161#issuecomment-633619241)

Note that currently the tokenizer and the parser do two independent passes (with the tokenizer returning a Result<Vec<Token>, TokenizerError>), and only the tokenizer keeps track of the current location.

I think there are two tasks here:

  • Changing TokenizerError to store the line/column information and introducing a helper to instantiate it based on tokenizer's current position - this would indeed be a "good first issue".
  • Making ParserError report the position (possibly removing the distinction between parser/tokenizer errors) would be more involved.

@nickolay nickolay added the good first issue Good for newcomers label Jun 2, 2020
@Dandandan
Copy link
Contributor Author

Sounds great! The first indeed is the more low hanging fruit I meant with this story (TokenizerError). The second one is really valuable but probably requires much more changes.

@Dandandan
Copy link
Contributor Author

@nickolay I saw that in the fork of Materialize column / line indexes are supported.
For that, the following is done:

I think this should be relatively simply to implement on the current version, compared to the lossless syntax (although it doesn't address the comment handling, lossless parsing). What do you think?

@nickolay
Copy link
Contributor

The source range (or rather the corresponding string) is needed for the Lossless Syntax Tree as well. I used token.to_string() as a workaround, but adding proper infrastructure would certainly be helpful!

Since the LST needs a data-less enum SyntaxKind to tag its nodes (both the leafs - i.e. tokens - and intermediate ones), I planned on changing Token to store only this tag and a slice of the source text, i.e. make Token = (SyntaxKind, SomeSortOfString). The start position of each token could then be calculated by adding up the lengths of slices for all the preceding tokens. Any processing of the token's text (e.g. reading '\\' as a one-character-long string literal) would have to be moved to the parser.

Materialize instead returns (Token, Range) tuple from the tokenizer, sometimes calling such tuple a token, which I think is a bit confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants