Parsing Models Cheatsheet

From Zulip:

Parsing Models and Their Computational Complexity

NOTE: Speed isn't the most important thing when considering a model; I think the more important issues are expressiveness (what can it parse), readability, and debuggability.

Basic regexes (e.g. BRE grep/sed or ERE with awk, egrep): linear time matching, constant space
Perl style regexes (now in Python, Ruby, JS, etc.): exponential in the worst case. (Russ Cox's articles are all about this.)
Arbitrary CFG: O(n^3). There are like 10 different algorithms to recognize CFGs, with complex sets of advantages and disadvantages.
- LR family
  - LALR(1) grammar (yacc): O(n), accepting all the limitations with shift-reduce conflicts and such
- LL family
  - Python's pgen: LL(1) which can be matched in O(n) time.
  - ANTLR: started out as LL(k) which I believe is also O(n), but ANTLR v4 introduced a more powerful model ALL(*) (“all-star”).
PEG: exponential backtracking or linear time memoized packrat parsing.
Turing complete code: you can write arbitrarily slow code, but people generally don't, because it's obvious when you have an O(n^2) loop or are doing exponential backtracking.

Case Studies

sed: uses arbitrary code.
awk: LALR(1) parser with yacc [1].
Python: an LL(1) parser written with a bespoke grammar DSL pgen [2].

One of the major motivations for OSH it to test this theory that Chet Ramey wrote about [3], after 20+ years maintaining bash:

One thing I've considered multiple times, but never done, is rewriting the bash parser using straight recursive-descent rather than using bison. [...] Were I starting bash from scratch, I probably would have written a parser by hand. It certainly would have made some things easier.

Andy: In my opinion, this experiment was a big success. The OSH parser is more maintainable and less buggy than bash's parser (although it's admittedly slower, being in Python). bash is 20+ years old and they are still fixing corner cases involving matching { } and ( ).

It's because their code s a messy mix of yacc and C code, and it would be better off as well-structured code (in C, or a higher level language). The interface between the two models messy and ill-defined (and filled with global variables).

Looking at parse.y in bash, there's much more C code there than there is yacc grammar. The grammar solves maybe 25% of the problems you have. And subst.c has a ton of ad hoc parsing code outside the grammar.

$ wc -l parse.y
>6513 parse.y

[1] I forked Kernighan's original Awk and found a couple minor bugs in it. https://github.com/andychu/bwk

[2] The "update" here is due to a private e-mail discussion I had with Guido on pgen's design. http://python-history.blogspot.com/2018/05/the-origins-of-pgen.html

[3] http://aosabook.org/en/bash.html