Parsing Models Cheatsheet

From Zulip:

Parsing Models and Their Computational Complexity

NOTE: Speed isn't the most important thing when considering a model; I think the more important issues are expressiveness (what can it parse), readability, and debuggability.

Basic regexes (e.g. BRE grep/sed or ERE with awk, egrep): linear time matching, constant space
Perl style regexes (now in Python, Ruby, JS, etc.): exponential in the worst case. (Russ Cox's articles are all about this.)
Arbitrary CFG: O(n^3). There are like 10 different algorithms to recognize CFGs, with complex sets of advantages and disadvantages.
- LR family
  - LALR(1) grammar (yacc): O(n), accepting all the limitations with shift-reduce conflicts and such
- LL family
  - Python's pgen: LL(1) which can be matched in O(n) time.
  - ANTLR: started out as LL(k) which I believe is also O(n), but ANTLR v4 introduced a more powerful model ALL(*) (“all-star”).
PEG: exponential backtracking or linear time memoized packrat parsing.
Turing complete code: you can write arbitrarily slow code, but people generally don't, because it's obvious when you have an O(n^2) loop or are doing exponential backtracking.

Case Studies

sed: uses arbitrary code.
awk: LALR(1) parser with yacc [1].
Python: an LL(1) parser written with a bespoke grammar DSL pgen [2].

One of the major motivations for OSH it to test this theory that Chet Ramey wrote about [3], after 20+ years maintaining bash:

One thing I've considered multiple times, but never done, is rewriting the bash parser using straight recursive-descent rather than using bison. [...] Were I starting bash from scratch, I probably would have written a parser by hand. It certainly would have made some things easier.

Andy: In my opinion, this experiment was a big success. The OSH parser is more maintainable and less buggy than bash's parser (although it's admittedly slower, being in Python). bash is 20+ years old and they are still fixing corner cases involving matching { } and ( ).

It's because their code s a messy mix of yacc and C code, and it would be better off as well-structured code (in C, or a higher level language). The interface between the two models messy and ill-defined (and filled with global variables).

Looking at parse.y in bash, there's much more C code there than there is yacc grammar. The grammar solves maybe 25% of the problems you have. And subst.c has a ton of ad hoc parsing code outside the grammar.

$ wc -l parse.y
>6513 parse.y

[1] I forked Kernighan's original Awk and found a couple minor bugs in it. https://github.com/andychu/bwk

[2] The "update" here is due to a private e-mail discussion I had with Guido on pgen's design. http://python-history.blogspot.com/2018/05/the-origins-of-pgen.html

[3] http://aosabook.org/en/bash.html

Uh oh!

Parsing Models Cheatsheet

Parsing Models and Their Computational Complexity

Case Studies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!