Exponential blowup in recursive parsers #74

p-e-w · 2022-01-15T06:30:53Z

Consider this simple parser:

use chumsky::prelude::*;

fn parser() -> impl Parser<char, String, Error = Simple<char>> {
  recursive(|expr| {
    let atom = text::ident()
      .or(expr.clone().delimited_by('(', ')'));

    let expression = atom
      .clone()
      .then_ignore(just('+'))
      .then(atom.clone())
      .map(|(a, b)| format!("{}{}", a, b))
      .or(atom);

    expression
  }).then_ignore(end())
}

fn main() {
  println!("{:?}", parser().parse("((((((((((((((((((((((a+b))))))))))))))))))))))"));
}

Parsing the string

((((((((((((((((((((((a+b))))))))))))))))))))))

takes 8 seconds when compiling in release mode (around a minute in debug mode), and that time rapidly grows as more pairs of parentheses are added.

When the parser inside the recursive closure is more complex, the problem becomes even worse. With the parser I'm working on, I'm seeing multi-second parse times already for 3-4 levels of nested brackets.

Boxing parsers does not solve this issue. What can be done here?

The text was updated successfully, but these errors were encountered:

zesterer · 2022-02-07T15:58:48Z

The problem here is subtle, but is actually a more general problem with backtracking parsers, rather than an issue with chumsky itself. A similar problem will appear with a hand-written recursive descent parser.

Within your code, you have two patterns that can be parsed that both look very similar. They are:

expr

and

expr + expr

Notice something important: both patterns are ambiguous right the way up until the point at which the + is hit, if it exists. This means that chumsky (which is a recursive descent PEG parser internally) is forced to try one of the patterns first, then the next pattern if the first fails.

Normally, this is not a problem. For an expression like a + b, it's obvious - very quickly - that the second pattern is correct after just the second symbol. What really causes this to become a problem is that these patterns are defined recursively: at each level of recursion, the parser needs to try one pattern and then the other. If it guesses wrong the first time, it's just done a lot of work for nothing!

In fact, for every layer of recursion for which the wrong guess is made, the amount of work required to parse the input doubles. Your input has 21 layers of nesting, where the parser defaults to the wrong guess each time, meaning that the resulting parser is doing 2^21 = 2,097,152x the amount of work that strictly needs to be done. Hopefully, it should now be clear why we're seeing such pessimistic numbers here.

Fixing this

Fixing this problem requires restating the grammar in a way that gives the parser less work to do. While your parser is logically correct, in practice it explodes the space of possible parse routes and recursive descent isn't smart enough to handle such cases (there are techniques such as memoization that can avoid this exponential blowup, but they usually have worse performance for non-pessimistic grammars and inputs so chumsky does not implement them).

A better approach is to restate the grammar like so:

use chumsky::prelude::*;

fn parser() -> impl Parser<char, String, Error = Simple<char>> {
  recursive(|expr| {
    let atom = text::ident()
      .or(expr.clone().delimited_by(just('('), just(')')));

    let expression = atom.clone()
        .then(just('+').ignore_then(atom).or_not())
        .map(|(a, b)| if let Some(b) = b {
            format!("{}{}", a, b)
        } else {
            a
        });

    expression
  }).then_ignore(end())
}

fn main() {
  println!("{:?}", parser().parse("((((((((((((((((((((((a+b))))))))))))))))))))))"));
}

Here, we use or_not() to express the fact that + expr may or may not be present after the initial expression, allowing the parser to avoid backtracking and preventing the exponential blowup.

I hope this helps, and sorry about not replying to this earlier: GitHub has not been showing me all of my notifications for some reason.

p-e-w · 2022-02-09T07:13:43Z

Thank you for the in-depth explanation. I completely understand the underlying issue now and can easily fix it in my own parser.

I guess the recommended approach can for many cases be distilled down to "prefer conditional matching over alternation where possible, in order to prevent branching during recursion".

That being said, I can't help but feel that my original parser looks a lot more intuitive than your modified version. It would be nice if it were possible to write code like the first version that performs like the second version. Can't Chumsky do that transformation automatically somehow, at least in simple cases? That is, if there is an alternation (or) between two patterns, where one is a prefix of the other, perform parsing by conditionally matching the suffix instead?

zesterer · 2022-02-09T09:23:08Z

I'm not sure whether it's feasible to transform one into the other, but memoisation would at least prevent the exponential blowup by 'remembering' past parse failures at each position, at the cost of increased memory usage and poorer performance in less pessimistic cases.

I'll open an issue for memoisation, because it might be desirable for some particularly ambiguous grammars. It's not going to be an immediate focus for me, but it's possible that it might be worked into #82.

See zesterer/chumsky#74

zesterer closed this as completed Feb 9, 2022

zesterer mentioned this issue Feb 9, 2022

Implement memoisation #84

Closed

p-e-w added a commit to p-e-w/savage that referenced this issue Mar 1, 2022

Fix exponential blowup in parser

b562fb1

See zesterer/chumsky#74

zesterer mentioned this issue Feb 18, 2023

Added memoised parser #290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential blowup in recursive parsers #74

Exponential blowup in recursive parsers #74

p-e-w commented Jan 15, 2022

zesterer commented Feb 7, 2022 •

edited

Loading

p-e-w commented Feb 9, 2022

zesterer commented Feb 9, 2022 •

edited

Loading

Exponential blowup in recursive parsers #74

Exponential blowup in recursive parsers #74

Comments

p-e-w commented Jan 15, 2022

zesterer commented Feb 7, 2022 • edited Loading

Fixing this

p-e-w commented Feb 9, 2022

zesterer commented Feb 9, 2022 • edited Loading

zesterer commented Feb 7, 2022 •

edited

Loading

zesterer commented Feb 9, 2022 •

edited

Loading