Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question, any plan about significant constant overhead per character of input #24

Open
lygstate opened this issue Oct 5, 2020 · 4 comments

Comments

@lygstate
Copy link

lygstate commented Oct 5, 2020

After quick reading of the paper, this is a wonderfull idea to resolve PEG's left recursive issue, it's suite for a lot of area about language parsing, only one issue:
For large grammars, bottom-up parsing can incur a significant constant overhead per character of input, which
may make pika parsers inappropriate for some performance-sensitive uses.

What's the cause of this issue? any idea to improve this issue?

@lukehutch
Copy link
Owner

Hello Yonggang,

The reason for this is described in the paper, and it's due to wasted work in building parse tree fragments bottom-up for possible parses that never end up being connected to the final parse tree. It's easiest to understand with say commented-out code: the pika parser has to fully parse the commented-out code, but when the comment characters are discovered, working right-to-left, then all those parse tree fragments that were build are discarded, and not linked into the final parse tree. Similarly, the parser has no idea whether any given character inside a quoted string is inside quotes or outside a quoted region. More generally, a given sequence of characters may be a variable name or may be a keyword, so has to be parsed as both -- etc.

I worked on some mitigations to limit this -- most principally, as I suggest in the paper, by running a lex preprocessing stage. However I stripped the lexing stage out of the released parser sourcecode, because as one reviewer pointed out, lexing actually changes the language recognized by the parser. I discovered that it was very hard to write a lexer for a given target language using PEG rules, due to the greedy nature of PEG rules. So basically I left this as future work for anyone that wants to attempt it. For simpler languages, lexing is not hard -- but for something the size of the Java grammar, it's really hard to write a reliable lexer that does not interfere with the operation of the parser.

@cal101
Copy link

cal101 commented Apr 10, 2022

Hello Luke!

I wonder if a DEL parser as used by comby is sufficient to be used as lexer.
Even for java that is pretty simple and should provide the most important classification: strings and comments
https://comby.dev/

@lukehutch
Copy link
Owner

@cal101 Writing a lexer is not the hard part, the hard part is making sure that you don't semantically change the language that is recognized by the parser by including a lex preprocessing stage.

My advice in general would be to do some lightweight lexing as a simple text preprocessing step, which could be done with some simple text processing code. For example, you could remove all comments, and you could parse out all string literals into a single token (since these are the two cases for standard programming language syntax that lead to lots of wasted work for the Pika Parser, because the comment begin/end characters and the start/end quotes change the parsing context and the recognized language into a different subset.

I took a quick look at Comby, but it looks like there is some fairly complex parser combinator theory behind how this works, so adding it would introduce a whole lot of extra complexity.

I came up with a much more efficient parsing algorithm, which I haven't written up yet, but it works faster than any other left recursion capable parsing algorithm. It does not have error recovery capability though, like most other recursive descent parsers. I'll be writing a paper on this algorithm sometime soon.

@exaexa
Copy link

exaexa commented Aug 12, 2023

Hey all, just to add a comment here-- I found that the grammar rules can be converted to lexers quite easily, and the parsing becomes really efficient especially if your grammar's terminals are able to be bigger than 1 character (thus allowing much partial & actually impossible matches to be skipped).

My implementation is in lex here https://github.com/LCSB-BioCore/PikaParser.jl/blob/master/src/parse.jl#L203-L243 -- the speedup I get in Julia is normally between 10x to 100x, depending on the language. Note the "left-to-right greediness" of the lexing -- it rules out large amounts of terminals that would never get matched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants