Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster parsing by changing source location representation #746

Open
expipiplus1 opened this issue Nov 1, 2020 · 6 comments
Open

Faster parsing by changing source location representation #746

expipiplus1 opened this issue Nov 1, 2020 · 6 comments

Comments

@expipiplus1
Copy link
Collaborator

I noticed that the documentation for megaparsec says that getSourcePos "is not cheap, do not call it e.g. on matching of every token, that's a bad idea".

We currently call this twice for every expression. I wonder if it would be possible to change the source position representation in the parser to be a token (character) offset instead, which is very cheap (getOffset). If the user has requested a parse without source positions then this is discarded, else it's transformed into full line and column representation (sufficient laziness could even make this automatic, the location translation isn't even performed until the pretty error message needs to be printed for example).

This would also benefit downstream users who actually want the character offset and who currently have to reconstruct this from the line and column position!

Defining getSourcePos = pure (SourcePos "" (mkPos 0) (mkPos 0)) does speed things up significantly:

image

@layus
Copy link
Collaborator

layus commented Nov 16, 2020

That's a neat improvement :-). How much work does this represent ?

@expipiplus1
Copy link
Collaborator Author

I suspect that given the very pleasing type of expressions in hnix it shouldn't be too much trouble at all.

  • Very easy to switch to the character offsets, just changing a handful of lines in the parser.
  • Writing a function to convert these to line/col representation, would just be a recursive traversal of the expression tree keeping state of where we are in the file.

I think writing tests would probably be the most time-consuming part :D

Not sure if there are any performance penalties lurking where I haven't thought about though.

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Jan 12, 2022

In #1026 I also note that the current Pos type produces in Expr.Types 10 orphan instances, notification on which currently suppressed.

Anton-Latukha added a commit to Anton-Latukha/hnix that referenced this issue Jan 21, 2022
Anton-Latukha added a commit that referenced this issue Jan 21, 2022
This is a type & type boundary lifting groundwork to do #1026 & #746 design in this release.

The `NPos` & `NSourcePos` next can be freely shaped into what comes out of the #1026 & #746.
@Anton-Latukha
Copy link
Collaborator

@Anton-Latukha
Copy link
Collaborator

@expipiplus1
Copy link
Collaborator Author

expipiplus1 commented Jan 22, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants