-
Notifications
You must be signed in to change notification settings - Fork 0
/
combinators.tex
1289 lines (1130 loc) · 63.4 KB
/
combinators.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt]{article}
\title{Modern Parser Combinators in Python}
\date{\today}
\usepackage[sc,osf]{mathpazo}
\usepackage[T1]{fontenc}
\usepackage{microtype}
\usepackage{hyperref}
\usepackage{listings}
% \lstset{language=Python}
\lstset{escapechar=\!}
\usepackage[backend=bibtex8,style=authoryear]{biblatex}
\bibliography{combinators}
\begin{document}
\maketitle
\section{Overview}
\label{sec:overview}
The big idea behind this library is to combine techniques from the
parsing literature, some new, some old and neglected, to create a
library that can compete with hand-written recursive-descent parsers.
The key advantages that hand-written recursive-descent parsers have
over traditional LALR parser generators are simplicity, handling a
broader range of grammars, better error reporting, and easier
introduction of custom code. A well-designed library can have most of
these, too, and some advantages over a recursive-descent parser, at
the cost of an implementation that isn't as simple.
\begin{itemize}
\item Parser combinators make parsers easier to learn and write by
embedding a parsing library into a general-purpose programming
language, making it easy to add custom code to the parser and not
requiring that anyone learn a DSL like EBNF or formal grammars.
\item GLL parsing can parse any CFG, including ambiguous and
left-recursive grammars, in $O(n^3)$ time, and unambiguous grammars
in linear time, all while remaining top-down depth-first like
recursive descent and thus easier to understand and debug.
\item Attribute grammars can allow the insertion of arbitrary code
into the parsing process, principally allowing the parsing of
context-sensitive grammars that a general CFG algorithm can't handle
alone, and with the data-dependent grammar restrictions, parse them
in $O(n^3)$ time. They also provide an easy mechanism for adding
semantic actions.
\item Non-correcting error handling can locate and provide reasonable
messages for possible errors in an input string without
providing any spurious messages.
\item Metaprogramming techniques like macros or staging can make
parser combinators almost as efficient as recursive descent.
\item Attribute grammars can be inverted to allow writing both a
parser and an unparser at the same time.
\end{itemize}
\section{Design Decisions}
\label{sec:design_decisions}
\subsection{Grammars}
\label{sec:grammar}
My goal for these parser combinators is that they should be able to
handle common languages and formats \emph{without} having to write
custom code except for extremely weird cases. LL and LR grammars are
too limited: many natural grammars for common problems are not in
LL($k$) or LR($k$) for any $k$, and LR parsers, which can handle more
grammars for any given value of $k$, are difficult to debug.
Meanwhile, there now several algorithms capable of parsing any CFG
that can also parse all LL and LR grammars in linear time, giving the
same asymptotic performance as LL and LR parsers while handling a much
broader range of grammars. CFGs also have better closure properties
than LL or LR grammars. They're closed under language composition,
that is, they're closed under union, concatenation, and Kleene star.
Parsing problems with one language embedded in another are becoming
increasingly common, and the ability to write parsers for two
different languages and then combine them is useful. Also, the suffix
language of a CFL is another CFL \parencite[p. 401]{grune_jacobs} and
so is its reverse language, which are useful for producing better
error messages (\ref{sec:errors}).
Unfortunately, CFGs are also not powerful enough to handle common
parsing problems. Real-world programming languages and data formats
are context sensitive: C has the \texttt{typedef-name: identifier}
problem, Python has context-sensitive indentation, real-world HTML has
context-sensitive features like where \texttt{<input>} tags can
appear, and \TeX's DVI files have references to byte offsets in the
file. Thus, a good parsing library needs to be able to handle context
sensitivity. Note that one example often presented as context
sensitive, a field prefixed with a byte giving the field length like a
Pascal string, is actually regular: because a byte can only hold a
finite set of values, it's possible to write a regular expression for
the whole construct like,
\begin{lstlisting}
0 | 1. | 2.. | !\ldots! | 255.{255}
\end{lstlisting}
where \texttt{.} represents any character and a number in curly braces
represents repetition as usual.
This technique generalizes to any field prefixed with a finite-size
integer and beyond. In any language with finite symbols, any
length-type-value array (also type-length-value;
\url{https://en.wikipedia.org/wiki/Type-length-value}) can produce
only a finite language because the set of possible lengths and
possible types are limited. When the production rule for a
length-type-value array is inserted into a grammar producing an
infinite language, the length-type-value array effectively acts as a
terminal because it can only create a finite set of productions, thus
it can't introduce context-sensitivity unless it's explicitly
introduced elsewhere. If all the integers are single symbols for the
lengths, \texttt{a}, \texttt{b}, \texttt{c}, \ldots are types, and
\texttt{A}, \texttt{B}, \texttt{C}, \ldots are the production rules
corresponding to each type, these can be expressed using the
production rule:
\begin{lstlisting}
S !$\to$! 0a | 0b | 0c | !\ldots! | 1aA | 1bB | 1cC | !\ldots! |
2aAA | 2bBB | 2cCC | !\ldots!
\end{lstlisting}
This kind of use of large but finite productions can handle even some
weird cases like lengths of code that are then treated like their own
languages. For instance, if nonterminal \texttt{A} has length 1
always and \texttt{B} length 2,
\begin{lstlisting}
S !$\to$! 0 | 1A | 2AA | 2B | 3AAA | 3AB | 3BA | !\ldots!
\end{lstlisting}
However, \emph{efficiently} handling these finite production rules is
another can of worms, because they can be exponential in the number of
symbols in the language and thus any DFA for them would be
impractical.
I chose as my context-sensitive grammar the data-dependent grammars of
\textcite{yakker1}, which are an extension of CFGs related to
L-attribute grammars. DDGs have several attractive properties.
First, they're easier to understand and write than the primary
alternatives, PEGs and Boolean grammars, because they use familiar
concepts like variable binding and string recognition based on
semantic values, for instance treating bytes as integers rather than
abstract symbols. While it's often easy enough to write a PEG or a
Boolean grammar for simple languages like $a^nb^nc^n$, something like
a field prefixed with a length in decimal as ASCII digits requires
multiple rules like the above for length-type-value arrays.
Meanwhile, a DDG simply calls a Turing-computable function to
transform the ASCII into an integer. This touches on DDGs' second
major advantage, which is the ability to include arbitrary
Turing-computable code in a disciplined fashion.
\begin{quote}
This lack of enthousiasm in incorporating attribute grammars into
formal language and automata theory may be due partly to the
complexity of the model and partly to the (obvious) fact that
attribute grammars of a very simple type, having only one
synthesized attribute and no inherited attributes, can already
realize all possible translations [18] and can e.g. easily simulate
type 0 Chomsky grammars (cf. [19]). \parencite{1V_AGs}
\end{quote}
The power of DDGs makes analyzing them harder but means that they can
parse anything and can incorporate existing specialized parser code
when necessary. Third, they're based on CFG parsing algorithms, which
makes them easier to adapt for one of the existing efficient CFG
algorithms, with the original authors implementing them in particular
for combinators, for Earley, and for GLR \parencite{yakker2}. Fourth,
their languages can be parsed in $O(n^3)$ time in the worst case, or
only $O(n)$ for deterministic languages and $O(n^2)$ for
locally-ambiguous languages. This is a consequence of the single-pass
nature of DDGs: similar to L-attribute grammars, information flows up
and from left to right in the parse tree with no backtracking allowed.
Fifth, the parsing algorithm for DDGs is proved correct assuming the
underlying CFG parsing algorithm is correct. Sixth, DDGs inherit from
CFGs important closure properties, also being closed under union,
concatenation, and Kleene star. A similar construction to the one for
CFGs shows that the suffix language of a DDG can be described by
another DDG, with one wrinkle: the parser always has to take all
options, treating them as alternatives, when encountering a constraint
based on a variable that has yet to be defined.
TODO: work out the reverse language for a DDG and figure out
something to say about it here. Also work out a formal notion of
undefined for attributes: if a constraint tries to access an
undefined attribute, it falls back to parsing all alteratives. If a
attribute function receives an undefined attribute as an input, it
always returns undefined.
The two major alternatives I considered were PEGs and Boolean
grammars. They both have the disadvantage that to parse arbitrary
Turing-computable properties requires further extending them. The
major advantage of PEGs is that the corresponding parsing algorithm,
the packrat parser, always runs in $O(n)$ time. However, modern CFG
algorithms can parse most CFGs in linear time, and the ones that they
can't are highly ambiguous and only found in a subset of real parsing
problems like parsing DNA. My experiments indicate that it's possible
to parse many CFGs in $O(n)$ time without any memoization, and
experiments in \textcite{packrat_is_it_worth_it} back this up, showing
that no memoization is often close to optimal. With the high
preexisting memory cost of the parse tree, the memory costs of the
packrat parser make running out of memory even more likely. Aside
from the memory issues, PEGs have three other disadvantages. Making a
PEG parser handle left recursion makes the implementation much more
complicated, particularly when dealing with indirect left recursion,
and costs the linear-time guarantee. PEGs can't be composed in the
same way that CFGs can, since PEGs ``are closed under union, but only
under a non-standard definition of what it means for a string to be in
a language (the PEG need only succeed on a prefix of the
string)'' \parencite[p. 4]{yakker1}. This and other problems are due
to the non-commutativity of alternation in PEGs. Similarly, the way
PEGs avoid ambiguity makes it hard to explicitly deal with it where it
naturally arises, for instance in unparsing the AST for a language
that ignores white space or when trying to do error handling using the
suffix language.
Boolean grammars are a promising approach because they share
properties like closure under union, concatenation, Kleene star,
suffix, and reversal. However, there hasn't been enough work yet done
on developing practical parsers for them. Many of the algorithms
known to work for CFGs have not been extended to Boolean grammars. In
particular, neither of the algorithms most friendly for parser
combinators, GLL nor Earley, have been, if they even can be extended.
There's also nothing on other aspects of building a practical parser
like error handling.
\subsection{Algorithms}
\label{sec:algorithms}
TODO: Reach a new decision here.
On closer analysis, there are two major questions about algorithm
performance. First, while the worst-case performance for all general
CFG algorithms is $O(n^3)$, the set of grammars which they can handle
in linear time differs. One notable advantage of Earley's algorithm
is that with Leo's modifications it can parse all LR-regular languages
in linear time \parencites{leo, marpa}, and the set of LR-regular
languages contains the deterministic languages \parencite{lr-regular}.
Leo accomplishes this by memoizing certain cases of right recursion
that would otherwise take $O(n^2)$. It's not clear to me what
grammars GLL can handle in linear time. Scott and Johnstone claim
GLL, ``runs in linear time on LL grammars and which also allows
grammar rule factorisation, with consequential speed
up'' \parencite{gll1}, which could mean that GLL runs in linear time
on all LL($k$) for all $k$, but that set is a proper subset of the
deterministic grammars. However, Spiewak says, ``Thus, the class of
grammars which can be handled by GLL in O(n) time is precisely
equivalent to LL(1)'' \parencite{spiewak}, which is a much smaller set
than all LL($k$) grammars. Both of these sets are much smaller than
the LR-regular grammars. ANTLR3 and ANTLR4 run in linear time on a
set the authors call LL(*) \parencites{antlr3, antlr4}, which I think
is equivalent to LL-regular, more or less by directly using DFAs to
predict which alternative to pick. It's possible the same approach
might work for GLL to allow it to run in linear time on LL-regular
grammars, though LL-regular is a proper subset of the LR-regular
grammars\parencite{ll-regular}.
The second question involves constant factors. There are conflicting
reports in the literature about how different parsing algorithms
compare to each other. The most extensive comparison of modern
parsing algorithms is in \textcite{antlr4}, where the authors
benchmark their ALL(*) algorithm against GLR, GLL, packrat/PEG, and
hand-written recursive-descent implementations as well as some others.
On their test case, Java code, all of the limited parsers were much
faster than the general parsers, by orders of magnitude in some cases,
with Rascal's GLL parser performing particularly poorly. They don't
directly compare their algorithm against an Earley parser but state
that \textcite{tomita1985efficient} ``shows GLR to be 5x-10x faster
than Earley.'' It's unclear where this penalty comes from, and it
should be noted the comparison may no longer be valid because Tomita
compared the algorithms before Leo published his modifications that
make Earley run in linear time on LR-regular grammars. This still
seems to be the general impression about Earley versus GLR, but I
haven't seen any evidence backing it up outside Tomita's work. They
suggest that based on disabling the caching of lookahead DFAs, ``This
performance is in line with the high cost of GLL and GLR parsers that
also do not reduce parser speculation by memoizing parsing
decisions,'' which seems to suggest part of the problem the general
were experiencing was O($n^2$) performance because of local
ambiguities, not constant factors, while in their testing, ALL(*) ran
in linear time on all of the grammars they tried. [TODO: Is this a
consequence of the general parsers only handling a too-limited set of
grammars in linear time? I don't know what grammars GLR can parse in
linear time or if any of these Java grammars fall into that set.]
However, part of the problem could easily be constant factors, with
even a linear GSS being slower than a standard stack, for instance.
They note, ``Of the GLR tools, Elkhound has the best performance
primarily because it relies on a linear LR(1) stack instead of a GSS
whenever possible. Further, we allowed Elkhound to disambiguate
during the parse like ALL(*),'' As ANTLR generates recursive-descent
parsers, this suggests that a parser combinator approach that uses
metaprogramming could be competitive.
TODO: rewrite this.
The backend parsing algorithm will be GLL modified to include the
DDG/L-AG features. While Valiant's algorithm, which is based on the
CYK algorithm and Boolean matrix multiplication, has the best known
worst-case run time, CYK has worse average-case run time than other
CFG parsing algorithms, and linear-time average-case performance is
more important than sub-cubic-time worst-case performance. The main
reason I favor GLL is that the alternative algorithms with linear-time
average-case performance, GLR and Earley, are both harder to
understand, while GLL is almost like a recursive-descent parser.
GLL's similarities to recursive descent make it much easier to
implement as parser combinators and easier to combine with DDGs/L-AGs
because of the top-down left-right information flow in both GLL and
DDGs/L-AGs.
Earley is the clear runner-up here, since there exist implementations
of it using parser combinators and efficient theoretical developments
of the algorithm that include all the key features I want like SPPF
creation. The original DDG paper \textcite{yakker1} implements DDGs
on top of Earley's algorithm.
As far as I know, there are no parser combinators for bottom-up
parsers like GLR, and bottom-up depth-first algorithms are harder to
understand than both top-down depth-first ones like GLL and
breadth-first algorithms like Earley. The original versions of both
GLR and Earley have theoretical problems with producing parse trees,
and for GLR, fixing those problems has created a plethora of
alternatives. There being no canonical version of GLR is a good
secondary reason to avoid it.
\subsection{Error Recovery}
\label{sec:errors}
For error handling I'm using Richter's non-correcting error recovery.
When parsing binary data, one major problem with finding errors is
that the point where the parser fails is often not the point where the
input or parser went wrong. Non-correcting error recovery brackets a
range where the error may have occurred, making it easier to find
errors. It's also more suited to the kinds of errors in binary data
in general, which are usually not like typos or other small isolated
errors but large blocks of broken data, and to Python's interactive
debugging. It uses the suffix grammar and the reverse grammar of the
original input grammar. While deterministic grammars are not closed
under suffix (I don't know about reverse), CFGs are, and because I'm
using GLL, I can use the same algorithm for both normal parsing and
error-handling.
\subsection{Architecture}
\label{sec:architecture}
The reason to write my own library in Python rather than using
Libmarpa CFFI bindings is to support Python variants. CFFI only works
to allow calls from the interpreter level to C functions, not RPython
to C. Meanwhile, Jython and IronPython don't, as far as I know, have
anything to allow calling into C at all. Writing it in Python with
metaprogramming to convert it to RPython will make it
platform-independent. Also worth noting is that after seeing the
performance benchmarks for Scala parser combinators with macros and
staging, I'm skeptical that it's necessary to write a parsing library
in a low-level language to get good performance, and that the kind of
code that's fast doesn't change much from interpreter to interpreter:
it's all simple, imperative-style code with obvious optimizations.
\section{Implementation Order}
\label{sec:implementation_order}
\begin{enumerate}
\item GLL combinators.
\item Non-correcting error recovery, for debugging.
\item Tests.
\item Performance benchmarks.
\item Macros, staging, or other metaprogramming for performance.
\item Data-dependent grammar functionality, in some as yet TBD order.
\item Unparsing.
\end{enumerate}
\section{Major Implementation Choices}
\label{sec:implementation_choices}
\subsection{GSS Representation}
\label{sec:gss_representation}
\textcite{afroozeh_izmaylova} improved the performance of GLL with a
different representation of the GSS that places information on the
edges as well as the nodes to avoid creation of duplicate nodes and
moves the sets $\mathcal{U}$ and $\mathcal{P}$ from global hash tables
to local hash tables on the nodes. This results in some clear
performance gains, and since there's no real cost---their changes only
move information around---and Python doesn't have a native linked-list
data type that would work for the GSS layout described in
\textcite{gll2}, there's no reason not to use their changes. In their
representation, GSS nodes are labeled with a nonterminal (a class
instance) and an index into the input and have links to their outgoing
edges and two local hash tables associated with $\mathcal{U}$ and
$\mathcal{P}$, the former containing descriptors as a grammar slot
plus an index and the latter containing an index. GSS edges are
labeled with a grammar slot, i.e. a nonterminal/class instance and an
integer representing an index into a nonterminal. Each edge has a
link to a GSS node and an SPPF node.
\subsection{Return an iterator over parse trees or a shared packed
parse forest.}
\label{sec:iterator_sppf}
Spiewak chose the former option for his GLL combinators in Scala, but
it has a couple of problems. First, it involves freezing the parser
state, which means that starting a parse requires creating a new
parser state. This has some performance implications, particularly
with respect to memory, and makes cleaning up afterwards more
difficult, probably demanding some kind of context manager to close
running coroutines. Second, it means that changing the implementation
of the parse forest almost certainly involves changing the
implementation of the parser itself because the parser state is
entangled with its output. Third, iterating over subtrees as well as
alternatives trees will require nested iterators, i.e. an iterator
that returns another iterator, which gets confusing and requires some
doubly-suspended state. Grune and Jacobs bring up another problem in
what they call the producer-consumer model: anyone wanting to use the
parse trees will have to analyze multiple trees to figure out how they
differ to figure out what to do with them. (Note that it's easy to
build an infinite iterator so that method should be able to handle
infinitely-ambiguous grammars. Avoiding analyzing tree-by-tree will
require users to understand the SPPF and a good API for analyzing it
without generating trees.) They point out that coroutines can reduce
some of these problems by putting the parser and the consumer of the
parse tree on an equal footing. Moura and Ierusalimschy (2009,
"Revisiting coroutines") show that asymmetric coroutines are as
powerful as symmetric ones, so it ought to be possible to implement
this in Python, but I suspect the implementation would involve
creating a trampoline that passes data between the producer and
consumer coroutines, and I'm not sure if the added complexity is
worthwhile, especially given it would involve a performance hit.
Thus, I'm running with the SPPF. Initially, I'm going to implement
Tomita's worst-case unbounded-polynomial version rather than Scott and
Johnstone's binarized worst-case cubic version because binarization
will make the resulting parse forest harder to understand and work
with, and I doubt anything this library will ever process will
approach the worst case because that should only happen with
pathologically-ambiguous grammars. However, I should make sure to
encapsulate the SPPF so if I need to binarize the implementation later
I don't need to change anything else.
\subsection{Tree and SPPF API}
\label{sec:tree_sppf_api}
With a traditional recursive-descent parser, the definitions of the
nonterminals in the grammar and their corresponding functions provide
natural nodes for the parse tree. In a parser built from combinators,
however, because the combinators don't intrinsically have nodes, it's
not clear how to build the parse tree. As far as I can tell,
traditional parser combinators always use the sequence combinator to
build the tree out of nested lists/tuples, either by having it build a
flat list/tuple out of two or more alternatives like Construct's
Sequence (Struct, which makes an ordered map, is a variation on this)
or nested binary lists/tuplesx, like Spiewak's combinators. Other
non-monadic combinator implementations follow the same general
approach: Hughes in \emph{Generalizing Monads to Arrows} observes,
\begin{quote}
For example, it is fairly clear that a library for parsing should
include a combinator to invoke two parsers in sequence, but there
are many possible ways in which such a combinator might handle the
two parsers' results. In some early parsing libraries the two
results were paired together, in others the sequencing combinator
took an extra parameter, a function to combine the results.
\end{quote}
The clearest explanation I've found of the relationship between this
traditional sequence combinator and monadic combinators is in Vegard
\O ye's article (2012, \url{https://github.com/epsil/gll}): they
define an operation bind (this is the monadic bind) that takes a
parser and a function, applies the parser to the input, applies the
function to a success to get another parser, and then applies the
resulting parser to the success to get the final output. They then
define the sequence combinator in terms of a double bind, passing the
list constructor as the function to bind; for a sequence combinator
with more than two elements, they use fold/reduce with list-append to
get a flat list rather than nested lists from one combinator. The
Hutton and Meijer paper (\emph{Monadic Parser Combinators}) also shows
how to build a sequence combinator using bind and a concatenation
function for lists, in this case specifically to avoid nested tuples.
While the standard approach to defining monads, settled on by Haskell
and seemingly imitated by everyone else, is to define a monad in terms
of return (sometimes called result) and bind, one can also define
monads in terms of three functions, return, join, and map (sometimes
fmap), and then define bind in terms of join and map.
Writing monadic (or arrow-style) parsers in Python would be stupid, so
I need a sequence combinator and will follow this universal (as far as
I know) practice and define the nodes in the tree using it. I can let
it take a constructor with which to build the parse tree with a
default constructor or expect that standard usage involves
transforming the parse tree with semantic actions. While monadic bind
combines the choice of parse tree with the sequence combinator, my
understanding is that separating join and map separates the two, so
having a semantic action combinator following a sequence combinator
can simulate any choice of bind for monadic combinators.
Irrespective of the internal representation, I also need to figure out
the interface the SPPF presents to the users. The SPPF itself is hard
to understand so a direct implementation would make for a bad API. My
best idea for handling this is to build a tree-centric API where users
interact with the SPPF as if it was a collection of trees. One
obvious problem after creating the SPPF is how to allow user code to
examine individual trees. Luckily, this problem has an obvious
solution: instead of returning a copy of an individual tree, I can
return an object that contains references to the correct nodes in the
SPPF. However, implementing anything more than simple views on the
SPPF is \emph{hard}. I still haven't come up with a better idea for
the tree API itself than Construct's: because trees are a recursive
data structure, allowing recursive references using Python's []
addressing provides a natural API.
That said, for users to deal with ambiguity in any efficient manner
will probably require users to both understand the SPPF and have
direct access to it; I have no idea how to implement such an API or to
integrate it with the hopefully-simpler tree-centric API.
How semantic actions interact with the SPPF presents some thorny
issues. Semantic actions are essential, as aside from needing them to
emulate the power of monadic combinators, they're required for
directing parsing with previously-parsed input, essential in many
common parser tasks like ignoring whitespace, one of the better reasons
for using a top-down parser in the first place, and possibly important
in performing disambiguations while the parser is running. Because
I can't figure out how to transform the SPPF as if it was a collection
of trees, though, without brute force and maybe not even then, I still
haven't settled on how to set semantic actions up. I have some
incompletely-realized partial ideas:
\begin{itemize}
\item Higher-order functions on trees: filter, map, reduce.
\item A higher-order function that converts a function acting on trees
to a function acting on SPPFs. I don't know how to write it, though.
\item The visitor pattern and other kinds of traversals with functions
acting on the tree/SPPF.
\end{itemize}
\subsection{Tree and SPPF representations}
\label{sec:tree_sppf_representations}
Traditional parser combinators don't provide any natural node labeling
because of the absence of labeled nonterminals. A tree representation
based on high-level sequences and ordered mappings, in Python
lists/tuples and OrderedDicts, doesn't need these node labels, and
it's possible to implement combinators that return trees or an
iterator over trees without them. However, for a shared packed
forest, labels are required for implementing subtree sharing because
without them, combinators in different branches of the parse traversal
have no way to know when they've generated the same subtree.
Moreover, GLL needs labels to merge different branches of the parse in
the GSS, so while it's possible to omit node labels altogether in a
traditional recursive-descent parser, any GLL parser needs labels.
There's an natural labeling scheme for parse trees based on
partitioning the input that Scott and Johnstone use but never
explicitly explain. Each terminal or uninterrupted sequence of
terminals has a starting and ending offset so the corresponding leaf
node can be labeled with (terminals, starting\_offset,
ending\_offset). The leaf nodes read in offset order must cover the
entire input, and in fact reading them in order in the absence of
semantic actions will return the original input. The rest of the tree
nodes represent partitions further subdivided into subpartitions
labeled with nonterminals, with the root node corresponding to
(start\_symbol, 0, length(input) - 1) (zero-indexed). Because parser
combinators don't have nonterminal labels, Spiewak uses parser
identity instead. Like Spiewak, I intend to use parser combinator
class instances as node labels in GLL. Note that generator objects
can't be used for this, since a new generator object is created every
time a generator function is called, and neither can the method or
code objects, because they're shared by all instances of the same
parser combinator class.
Both trees and forests are fundamentally directed graphs, normally
acyclic but in the case of infinitely-ambiguous grammars cyclic. In
Python, one possible graph representation is a dict of lists, where
the lists contain the labels of other nodes. By using the labels
discussed above as the labels for the nodes, it's possible to
represent a parse tree as a digraph where the values of leaf nodes are
Python objects representing terminals and the values of non-leaf nodes
are lists of either pairs of integers, the direct labels for other
nodes, or single integers representing the subpartitions for that
node. Eliding the terminal and nonterminal symbols for clarity, for
instance:
\begin{lstlisting}
(1, 8) : [(1, 2), (2, 5), (5, 8)]
(1, 8) : [2, 5]
\end{lstlisting}
The superfluous integers correspond to the "repeated dimensions" that
Johnstone and Scott describe in \emph{Modelling GLL Parser
Implementations} (2010). The latter representation obviously takes
less memory but node operations will be slower because correct
offset-pairs will have to be generated from the partitions when
they're needed.
The SPPF can have almost an identical representation to the parse
trees themselves because it's also fundamentally a digraph. There are
only two differences: nodes can have more than one parent,
corresponding to subtree sharing, and there has to be a way to
distinguish packed nodes from normal nodes because they have different
meanings. In the binarized SPPF, Johnstone and Scott define normal
nodes using two offsets, called "left extent" and "right extent", and
packed nodes using a single integer, called "pivot." In the
partition representation any node with only two children can be
represented with a single integer, the division between the subtrees.
A packed node is the result of combining at least two nodes, so a
minimal binarized packed node looks like:
Original nodes:
\begin{lstlisting}
(0, 4): [(0, 1), (1, 4)]
(0, 4): [(0, 3), (3, 4)]
\end{lstlisting}
Packed node:
\begin{lstlisting}
(0, 4): [((0, 3), (3, 4)), ((0, 1), (1, 4))]
\end{lstlisting}
Binarized nodes:
\begin{lstlisting}
(0, 4): Packed(1), Packed(3)
Packed(1): [(0, 1), (1, 4)]
Packed(3): [(0, 3), (0, 4)]
\end{lstlisting}
However, I don't understand the relationship between Scott and
Johnstone's partition model and parses that don't consume all the
input. The natural way to implement nondeterminism leads to returning
all incomplete parses as well as the complete parses. Both Spiewak
and Hutton and Meijer mention this natural ambiguity.
\begin{quote}
It is also worth noting that we do not allow \texttt{Success}(es)
which have failed to consume the entire input stream. We will
actually receive a \texttt{Success} every time a \texttt{Parser}
reduces successfully. While this is technically correct (reflecting
the ambiguity between greedy and non-greedy matching), it is almost
never the desired semantics. (Spiewak)
\end{quote}
Scott and Johnstone's constructions of parsers from recognizers seem
to rely on greedy matching, because their partitioning scheme only
works with matches of the full input. They also restrict packing to
nodes that share the same partition, ``Nodes can be packed only if
their yields correspond to the same portion of the input string''
(Scott, \emph{SPPF-Style Parsing from Earley Recognisers}). This
doesn't seem to work in cases with partial/non-greedy matching because
every nontrivial node will have partial matches that don't cover the
same partition.
A further observation that may or may not be related is that their
node labeling scheme seems to include unneeded information. A given
parser started at the same position should always produce the same
output, so all that's needed for a unique node label is (parser label,
starting index into the input). In fact, in one of their early papers
(\emph{BRNGLR: a cubic Tomita-style GLR parsing algorithm}), they use
this labeling scheme, rather than the later three-element labels. I
don't understand why, and if or how this might be connected with
greedy matching. A final note on the greedy-matching issue is
that the obvious implementation of nondeterminism applied naively
leads to the wrong recognizer, because a partial match will report
success even if the whole string is not matched.
In the aforementioned paper, they use a table-based representation
for all the data structures, including the SPPF. Following the data
structure refinement process outlined there, the declarations for the
SPPF are:
\begin{verbatim}
SPPFSymbolNode (SPPFLabel s, Int leftExtent, Int rightExtent)
SPPFPackNode (SPPFLabel s, Int pivot)
SPPF (SPPFSymbolNode symbolNode, SPPFPackNode packNode, SPPFSymbolNode left, SPPFSymbolNode right)
SPPF sppf
\end{verbatim}
Substituting:
\begin{verbatim}
sppf ((*symbolNode) SPPFLabel, Int, Int)
(*packNode) SPPFLabel, Int)
(*left) SPPFLabel, Int, Int)
(*right) SPPFLabel, Int, Int)
)
\end{verbatim}
The left extent for the left node is the same as parent's, the left
node's right extent is the same as the left extent of the right node,
and the right extent for the right node is the same as the parent's.
The SPPFLabel for the pack node and the parent node are also the same.
\begin{verbatim}
sppf ((*symbolNode) SPPFLabel, Int, Int)
(*packNode) Int)
(*left) SPPFLabel, Int)
(*right) SPPFLabel)
)
\end{verbatim}
This still seems like too much: it's a seven-dimensional table with
four dimensions indexed by integers, which as they note in the paper
is too large.
The clearly-fastest implementation with the best encapsulation is to
use Cython to write a specialized SPPF object, with the exact methods
needed for building it, with an appropriate internal representation,
possibly based on the ideas from the \emph{Modeling GLL
Implementations} refinement discussed above. All other approaches
will require methods implemented in Python (which will be slow) to do
the necessary transformations. To achieve anywhere near acceptable
performance, I have to use native Python data structures or C/C++
extensions. Any encapsulation has a performance overhead because I
won't be able to use comprehensions to speed up object creation, but
beyond that full encapsulation from the parsers will cost more speed
because they will have to call methods in Python. On the whole, I'm
convinced enough of the advantages of encapsulation, particularly the
possible need to rewrite the whole thing in Cython, that the first
version will involve partial encapsulation with methods written in
Python for the user API and hard-coding the parsers. On that point,
there's very little reason, as far as I can tell, to expose the
internal node labels to the user.
There's one possible other representation of the SPPF about which I
know little called parse-forest grammars, introduced on pp. 91-93 of
Grune and Jacobs. They have production rules labeled almost
identically to Johnstone and Scott's SPPF, a nonterminal, a starting
position, and a length (which is isomorphic to using the ending
position). I personally don't see how any of the supposed advantages
Grune and Jacobs list for them are good for writing practical parsers
and the API they'd provide would be very unintuitive for people used
to conventional parsers, but they're another possible internal
representation.
\subsection{Immutability vs. mutability}
\label{sec:immutable_mutable}
There are two different approaches to the data flow through the
parser, the functional-immutable style where functions, or in this
case, coroutines, create objects and return them and the mutable-state
style where a mutable object is passed into a function or coroutine,
which then mutates it and returns nothing. The choice of which style
to use can be made on a per-object basis, but because in most
languages including Python functions can only return one object, and
in the case of Python coroutines can only yield and accept through
send() one object, functions that need to return more than one object
have to aggregate all the objects they need to return into one object.
Handling the aggregation and disaggregation for parsers is an example
of the monadic pattern, as even recognizer combinators need to return
a Boolean representing whether a given input is in the language
generated by a grammar and also a stream object representing the
truncation of its input. Real parsers always need to return the same
Boolean, a stream, and a parse tree. Other patterns are possible:
Construct uses a Python stream, which is a mutable object, and thus
doesn't need to return a stream. An empty parse tree or some kind of
null result (None in Python) can represent failure, but this is a bad
idea since it provides no information about where the failure occurred
or what happened.
In traditional object-oriented recursive-descent parsers, the whole
parser would be one class and mutable objects could be handled as
shared state. As it is, since my combinators are classes, I have to
pass mutable objects instead. This isn't the worst thing in the world
since passing objects will avoid self look-ups, helping performance,
and is explicit rather than implicit. Several of the implementations
I've looked at do similar things. Spiewak's GLL combinators use a
functional-immutable style for the tree, returning a list in the
simplified implementation and a stream/iterator in the real
implementation, but an OO class for the trampoline. Scott and
Johnstone's example GLL implementation uses an OO class for the parser
with a mutable shared SPPF. Jim and Mandelbaum's transducers for
data-dependent grammars also use a mutable SPPF.
For GLL, the combinators could potentially return success/failure, the
GSS, the SPPF, the stream, and the pointer. Mutable objects only need
to be passed in when starting a new coroutine instance and don't need
to be yielded or passed via send() because every coroutine will
already have access to them. The advantage of mutability is speed,
and its main problem is that it always has more potential for creating
hard-to-isolate bugs. Unfortunately, Python is simply not designed
for the functional-immutable style and obviously has no optimizations
in the compiler/interpreter for handling immutable object creation for
non-built-ins. Thus, the functional-immutable style with an
encapsulated Python object will be many times over too slow.
(\url{http://www.valuedlessons.com/2008/03/why-are-my-monads-so-slow.html}
gives some numbers suggesting how much too slow.)
For the SPPF, the only way to implement sharing in a functional
fashion is to pass a memoized cache and discard it after building the
SPPF. Otherwise, combinators in different branches of the parse won't
be able to share subtrees. However, a memoized cache itself is most
of the way to an SPPF, so there's no real reason to use the former in
place of the latter. The SPPF is so large (I'm estimating a factor of
>100 times over the input, and that's optimistic) that recreating it
in every parser is wildly impractical, so it has to be mutable.
However, there's a further problem with building the SPPF as a mutable
object passed among combinators that Jim and Mandelbaum describe in
\emph{Delayed Semantic Actions in Yakker}: nodes that get added to the
SPPF during branches of the parse that eventually dead-end will
continue to exist in the final SPPF. The solution they use in OCaml
is to use a weak hash table. The best option in Python is similarly
to use a weak dictionary. This creates several knock-on effects.
First, strong references to the nodes in the SPPF have to be stored in
the SPPF \emph{somewhere} or else the whole structure will get
garbage-collected, and the only reasonable place to put them is in the
nodes themselves, thus a weak dictionary of objects that hold
references to other nodes. Second, most Python built-in types and
subclasses thereof are not weak-referencable in CPython. Subclasses
of lists and dicts can be weak-referenced, while lists and dicts
themselves can't. (None of this behavior is defined for all
implementations, though PyPy seems to follow CPython here.) Changing
from subclasses of tuple to list has memory and speed penalties, but
the alternative to using Python's weakref module is doing manual
object destruction myself, and that's just awful. Third, to keep
nodes alive while building the SPPF I need strong references outside
it, which means the combinators need to return strong references.
Fourth, terminals have to be enclosed in some kind of
weak-referencable proxy object.
That said, while the SPPF has to mutable, the nodes making it up can
be immutable. There are advantages and disadvantages to both
approaches.
\begin{itemize}
\item Mutable nodes make it easier to transform trees and reduce the
memory overhead of transformation operations since they avoid
copying. How common is it to need to transform a parse tree,
though? For writing a compiler/interpreter or a converter for a
binary data format, for instance, the parse tree is used to build
another kind of object, not mutated in-place.
\item Immutable nodes prevent a broad class of potentially
hard-to-find bugs related to mutating a tree in one node in an SPPF
and causing changes that propagate to other trees incorrectly. This
is a particular problem with semantic actions where users shouldn't
need to understand the particulars of the parsers or the SPPF.
\item Immutable objects could provide memory advantages, but Python's
implementation makes this difficult. There's no built-in immutable
mapping type, and tuples can't be weak-referenced. To get the
memory savings, I'd either have to give up speed by using
composition and methods implemented in Python to enclose tuples in a
weak-referencerable object or use Cython to implement my own types
in C.
\item Immutable containers are hashable, which is essential for any
kind of elegant way of handling which packed nodes to traverse in a
tree view of an SPPF and useful for implementing packed nodes as
frozensets, gaining the automatic ability to avoid duplicates when
packing.
\end{itemize}
Spiewak passes the trampoline that contains the GSS as a mutable
object for ``convenience and efficiency,'' and I'm leaning towards
following his lead. For parsing, I'm passing the stream as an
immutable object. That leaves only success/failure, the pointer, and
a node reference as immutable objects I need to create in each combinator.
\begin{itemize}
\item GSS: mutable, passed
\item SPPF: mutable, passed
\item Stream: immutable, passed
\item Success/Failure: immutable, returned
\item Stream pointer: immutable, returned
\item Node: immutable, stored in SPPF and returned
\end{itemize}
There is one other possible way to circumvent the bad performance of
the functional-immutable style, which is to use compilation/code
generation to eliminate the intermediate data structures. This may
turn out to be the best option, and in the end is almost certain to be
the fastest, especially if I go all the way to a two-stage compiler
that compiles Python to Cython and Cython to C. At the moment, I
don't understand it well enough to enumerate its advantages and
disadvantages.
\subsection{Outer-Loop Trampoline}
\label{sec:trampoline}
In Python, the trampoline has to run as the outermost loop and call
the coroutines because Python coroutines are asymmetric, they can only
return control to their callers with yield. If I instead try to pass
the trampoline through the combinators, there's no way for them to
pass control to the trampoline. Conveniently, GLL is organized with a
dispatch function such that every parser returns control to it after
finishing execution, which is my trampoline.
\subsection{Yield from}
\label{sec:yield_from}
The main problem with "yield from" is that it's only available on
Python 3. Also, experimentation suggests that as of 3.4, using "yield
from" still hits the maximum recursion depth which means that "yield
from" still adds a stack frame for each call, and Python has a small
stack limit. On the other hand, without "yield from", Python's
coroutines are not stackful (see Moura 2009, "Revisiting Coroutines"),
which means they may not be as powerful as one-shot continuations and
might not be powerful enough for GLL. That said, I know that "yield
from" is internally implemented with a trampoline, and the existence
of Eby's "fiendishly clever" trampoline implementation for Python 2
suggests it ought to be possible to use a trampoline to convert Python
2's stackless coroutines into stackful coroutines by manually
implementing a stack. I also might end up needing a trampoline
anyways to handle the graph-structured stack, in which case the
advantage offered by "yield from" may be diminished.
My tentative analysis is that "yield from" simply isn't useful for
GLL. GLL's dispatch stack contains information about the GSS and SPPF
as well as which parser needs to resume control, so replacing the
stack with "yield from" isn't going to get me very far since I'll
still need another stack to hold the GSS and SPPF information, even if
it's possible to handle the dispatching without that added
information. I suspect that it's either impossible to use "yield
from" with GLL or that the added complexity from trying to integrate
the two would completely negate any performance or simplicity gains.
This, combined with the lack of "yield from" in 2.7, means I'm not
going to try to use it.
\subsection{Indices versus Slicing}
\label{sec:indices_slicing}
With Python 2.7+'s memoryviews, it's possible to slice the stream
without copying, embedding slices in the parse forests passed between
combinators. The main disadvantage of this approach is that it means
the combinators can't handle text, particularly Unicode, since there's
no equivalent for Python strings. There are also three other factors
to consider: all of the Python library functions (string methods,
struct, re, and so on) take an optional argument that's a starting
index anyways, there's no equivalent stringview object (I'd have to
write one), and indices provide natural labels for the GSS and SPPF
(and memoryviews are only hashable in 3.3+, so they can't take over
that role in 2.7). Primarily because of the last two considerations,
I'm going with indices.
To my surprise, when I profiled indices versus string slicing
directly, string slicing was slower than indices on both CPython and
PyPy 3, though as expected the indexed version consumed much, much
less memory on CPython. The extra time seemed to be spent in the
hash-heavy functions, so this might have something to do with
CPython's optimizations for hash tables with string-only keys. I
still don't understand why the cost of additional allocations didn't
overwhelm that, but this is why we profile. However, the memory
consumption issue and aforementioned considerations mean I'm sticking
with indices.
\subsection{Module for handling bits}
\label{sec:bits_module}
The main module implemented in C for bit arrays/bit vectors is
\href{https://github.com/ilanschnell/bitarray}{bitarray}
(\href{https://pypi.python.org/pypi/bitarray/}{PyPi}). I don't know
how hard it will be to make this work with PyPy, but I'm going to give
it a shot. The main two pure-Python implementations are
\href{https://engineering.purdue.edu/kak/dist/BitVector-3.3.2.html}{BitVector}
(\href{https://pypi.python.org/pypi/BitVector/3.3.2}{PyPi}) and
\href{https://code.google.com/p/python-bitstring/}{bitstring}
(\href{https://pypi.python.org/pypi/bitstring/3.1.3}{PyPi}). For the
prototype, I'm going with bitstring but will return to bitarray once I
look at performance optimizations.
\subsubsection{bitarray}
\label{sec:bitarray}
This ought to be the fastest because it's implemented in C, but by the
same token, it will probably pose the biggest compatibility problems,
particularly with PyPy. The API is clean and mimics the standard
sequence types, but has the major disadvantage of having no offset
option in any of the constructors, which would require copying without
using memoryview, and some imperfections in the handling of
conversions (it has a tostring() method, but I don't know if calling
str() works, for instance). There's no immutable bitarray type, which
would mean that I'd have to make my own. Bitarray is packaged for
Ubuntu.
\subsubsection{bitstring}
\label{sec:bitstring}
Implemented in pure Python, this ought to be compatible with
everything. There's also a Cython version, though from looking at it
doesn't seem to have many optimizations over the pure-Python version.
The API includes mutable and immutable types and supports reading from
at a binary object at an arbitrary offset, though obviously it doesn't
support the buffer protocol, and works the way it should with respect
to functions like str(). On the whole, it has clearly the best API.
\subsubsection{BitVector}
\label{sec:bitvector}