Question, any plan about significant constant overhead per character of input #24

lygstate · 2020-10-05T19:38:16Z

After quick reading of the paper, this is a wonderfull idea to resolve PEG's left recursive issue, it's suite for a lot of area about language parsing, only one issue:
For large grammars, bottom-up parsing can incur a significant constant overhead per character of input, which
may make pika parsers inappropriate for some performance-sensitive uses.

What's the cause of this issue? any idea to improve this issue?

lukehutch · 2020-10-05T19:49:28Z

Hello Yonggang,

The reason for this is described in the paper, and it's due to wasted work in building parse tree fragments bottom-up for possible parses that never end up being connected to the final parse tree. It's easiest to understand with say commented-out code: the pika parser has to fully parse the commented-out code, but when the comment characters are discovered, working right-to-left, then all those parse tree fragments that were build are discarded, and not linked into the final parse tree. Similarly, the parser has no idea whether any given character inside a quoted string is inside quotes or outside a quoted region. More generally, a given sequence of characters may be a variable name or may be a keyword, so has to be parsed as both -- etc.

I worked on some mitigations to limit this -- most principally, as I suggest in the paper, by running a lex preprocessing stage. However I stripped the lexing stage out of the released parser sourcecode, because as one reviewer pointed out, lexing actually changes the language recognized by the parser. I discovered that it was very hard to write a lexer for a given target language using PEG rules, due to the greedy nature of PEG rules. So basically I left this as future work for anyone that wants to attempt it. For simpler languages, lexing is not hard -- but for something the size of the Java grammar, it's really hard to write a reliable lexer that does not interfere with the operation of the parser.

cal101 · 2022-04-10T14:47:09Z

Hello Luke!

I wonder if a DEL parser as used by comby is sufficient to be used as lexer.
Even for java that is pretty simple and should provide the most important classification: strings and comments
https://comby.dev/

lukehutch · 2022-04-17T07:44:46Z

@cal101 Writing a lexer is not the hard part, the hard part is making sure that you don't semantically change the language that is recognized by the parser by including a lex preprocessing stage.

My advice in general would be to do some lightweight lexing as a simple text preprocessing step, which could be done with some simple text processing code. For example, you could remove all comments, and you could parse out all string literals into a single token (since these are the two cases for standard programming language syntax that lead to lots of wasted work for the Pika Parser, because the comment begin/end characters and the start/end quotes change the parsing context and the recognized language into a different subset.

I took a quick look at Comby, but it looks like there is some fairly complex parser combinator theory behind how this works, so adding it would introduce a whole lot of extra complexity.

I came up with a much more efficient parsing algorithm, which I haven't written up yet, but it works faster than any other left recursion capable parsing algorithm. It does not have error recovery capability though, like most other recursive descent parsers. I'll be writing a paper on this algorithm sometime soon.

exaexa · 2023-08-12T13:44:49Z

Hey all, just to add a comment here-- I found that the grammar rules can be converted to lexers quite easily, and the parsing becomes really efficient especially if your grammar's terminals are able to be bigger than 1 character (thus allowing much partial & actually impossible matches to be skipped).

My implementation is in lex here https://github.com/LCSB-BioCore/PikaParser.jl/blob/master/src/parse.jl#L203-L243 -- the speedup I get in Julia is normally between 10x to 100x, depending on the language. Note the "left-to-right greediness" of the lexing -- it rules out large amounts of terminals that would never get matched.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question, any plan about significant constant overhead per character of input #24

Question, any plan about significant constant overhead per character of input #24

lygstate commented Oct 5, 2020

lukehutch commented Oct 5, 2020

cal101 commented Apr 10, 2022

lukehutch commented Apr 17, 2022

exaexa commented Aug 12, 2023

Question, any plan about significant constant overhead per character of input #24

Question, any plan about significant constant overhead per character of input #24

Comments

lygstate commented Oct 5, 2020

lukehutch commented Oct 5, 2020

cal101 commented Apr 10, 2022

lukehutch commented Apr 17, 2022

exaexa commented Aug 12, 2023