-
Notifications
You must be signed in to change notification settings - Fork 0
/
existing-approaches.tex
720 lines (539 loc) · 83.9 KB
/
existing-approaches.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
% !TEX root = omar-thesis.tex
\section{Existing Approaches}\label{sec:existing-approaches}
The definitions in the previous section adequately encode the semantics of lists and regular expressions, but they are not particularly convenient. Our task in this section is to consider various mechanisms of syntactic control, i.e. mechanisms that can be deployed to help to decrease the syntactic cost of expressions and patterns involving these constructs (without changing their meaning.)
% There are many approaches that a library provider might consider. These differ with regard to how much control they give the library provider over syntactic form, and the broader cognitive burdens that they impose on client programmers, particularly when programming in the large.
We begin in Sec. \ref{sec:standard-abstraction-mechanisms} by considering standard abstraction mechanisms available in languages like ML. We then consider a system of dynamic quotation parsing available in some dialects of ML in Sec. \ref{sec:dynamic-quotation}.
These methods give library providers only limited control over form and operate at ``run-time.'' To gain more precise control over form at ``compile-time'', a library provider, or another interested party, can define a ``library-specific'' syntax dialect using a \emph{syntax definition system}. The next several sections consider various syntax definition systems:
\begin{itemize}
\item In Sec. \ref{sec:Fixity-directives}, we consider infix operator definition systems.
\item In Sec. \ref{sec:mixfix}, we consider somewhat more expressive mixfix systems.
\item In Sec. \ref{sec:grammars}, we consider grammar-based syntax definition systems.
\item In Sec. \ref{sec:parser-combinators}, we consider parser combinator systems.
\end{itemize}
The systems in Sec. \ref{sec:grammars} and Sec. \ref{sec:parser-combinators} give essentially complete control over form to their users. We give examples of dialects that can be constructed using these systems in Sec. \ref{sec:examples-of-syntax-dialects}. Then, in Sec. \ref{sec:problems-with-syntax-dialects}, we discuss the difficulties that programmers can expect to encounter if they use these systems when programming in the large (as a follow-up to what was discussed in Section \ref{sec:problems-with-dialects}.)
An alternative approach is to leave the syntax of the language fixed but allow programmers to contextually repurpose existing forms using a \emph{term rewriting system}. We consider {non-local term rewriting systems} in Sec. \ref{sec:non-local-term-rewriting} and {local term rewriting systems}, which are also known as \emph{macro systems}, in Sec. \ref{sec:macro-systems}.
% \subsection*{More General Syntax Definition Systems}\label{sec:syntax-dialects}
% More general syntax definition systems give users more direct control over form than the infix and mixfix systems just described. In the next two subsections, we will begin by describing some notable systems, then give two examples that demonstrate their expressive power. We will
\subsection{Standard Abstraction Mechanisms}\label{sec:standard-abstraction-mechanisms}
The simplest way to decrease syntactic cost is to capture idioms using the standard abstraction mechanisms of our
language, e.g. functions and modules.
We already saw examples of this approach in the previous section. For example, we defined the list value constructors, which capture the idioms of list construction. Such definitions are common enough that VerseML generates them automatically.
We also defined a utility functor for regexes, \li{RXUtil}, in Figure \ref{fig:RXUtil}. As more idioms involving regexes arise, the library provider can capture them by adding additional definitions to this functor. For example, the library provider might add the definition of a value that matches single digits to \li{RXUtil} as follows:
\begin{lstlisting}[numbers=none]
val digit = R.Or(R.Str "SSTR0ESTR", R.Or(R.Str "SSTR1ESTR", ...))
\end{lstlisting}
Similarly, the library provider might define a function \li{repeat : R.t -> int -> R.t} that constructs a regex by sequentially repeating the given regex a given number of times (not shown.)
Using these definitions, a client can define a regex that matches U.S. social security numbers (SSNs) as follows:
\begin{lstlisting}[numbers=none]
val dash = R.Str "SSTR-ESTR"
val repeat_d = RU.repeat RU.digit
val ssn = R.Seq(repeat_d 3, R.Seq(dash, R.Seq(repeat_d 2,
R.Seq(dash, repeat_d 4))))
\end{lstlisting}
The syntactic cost of this program fragment is lower than the syntactic cost of the equivalent program fragment that applies the regex value constructors directly.
One limitation of this approach is that there is no standard way to capture idioms at the level of patterns. Pattern synonyms have been informally explored in some languages, e.g. in an experimental extension of Haskell implemented by GHC \cite{GHC-extensions} and in the $\Omega$mega language \cite{DBLP:conf/cefp/SheardL07}, but these are limited in that arbitrary computations cannot be performed.
Another limitation is that this approach does not give library providers control over form. For example, we cannot ``approximate'' SML-style derived list forms using only auxiliary values like those above.
Similarly, consider the textual syntax for regexes defined in the POSIX standard \cite{STD95954}. Under this syntax, the regex that matches DNA bases is drawn as follows:
\begin{lstlisting}[numbers=none]
A|T|G|C
\end{lstlisting}
Similarly, the regex that matches SSNs is drawn:
\begin{lstlisting}[numbers=none]
\d\d\d-\d\d-\d\d\d\d
\end{lstlisting}
or
\begin{lstlisting}[numbers=none]
\d{3}-\d{2}-\d{4}
\end{lstlisting}
These drawings have substantially lower syntactic cost than the drawings of the corresponding VerseML encodings shown above. Data suggests that most professional programmers are familiar with POSIX regex forms \cite{Omar:2012:ACC:2337223.2337324}. These programmers would likely agree that the POSIX forms have lower cognitive cost as well.
\vspace{-6px}
\subsubsection{Dynamic String Parsing}\label{sec:dynamic-string-parsing}
We might attempt to approximate the POSIX standard regex syntax by defining a function \li{parse : string -> R.t} in \li{RXUtil} that parses a VerseML string representation of a POSIX regex form, producing a regular expression value or raising an exception if the input is malformed with respect to the POSIX specification.
Given this function, a client could construct the regex matching DNA bases as follows:
\begin{lstlisting}[numbers=none]
RU.parse "SSTRA|T|G|CESTR"
\end{lstlisting}
This approach, which we refer to as \emph{dynamic string parsing}, has several limitations:
\begin{enumerate}
\item First, there are syntactic conflicts between standard string escape sequences and standard regex escape sequences. For example, the following is not a well-formed drawing according to the textual syntax of SML (and many other languages):
\begin{lstlisting}[numbers=none,mathescape=|]
val ssn = RU.parse "SSTR\d\d\d-\d\d-\d\d\d\dESTR" (* ERROR *)
\end{lstlisting}
In practice, most parsers report an error message like the following:\footnote{This is the error message that \texttt{javac} produces. When compiling an analagous expression using SML of New Jersey (SML/NJ), we encounter a more confusing error message: \texttt{Error: unclosed string}.}
\begin{lstlisting}[numbers=none]
error: illegal escape character
\end{lstlisting}
In a small lab study, we observed that even experienced programmers made this class of mistake and could not quickly diagnose the problem and determine a workaround if they had not used a regex library recently \cite{Omar:2012:ACC:2337223.2337324}.
The workaround -- escaping all backslashes -- nearly doubles syntactic cost here:
\begin{lstlisting}[numbers=none]
val ssn = RU.parse "SSTR\\d\\d\\d-\\d\\d-\\d\\d\\d\\dESTR"
\end{lstlisting}
Some languages build in alternative ``raw'' string forms that leave escape sequences uninterpreted. For example, OCaml supports alternative string literals delimited by matching marked curly braces, e.g.
\begin{lstlisting}[numbers=none]
val ssn = RU.parse {rx|SSTR\d\d\d-\d\d-\d\d\d\dESTR|rx}
\end{lstlisting}
\item The next limitation is that dynamic string parsing does not capture the idioms of compositional regex construction.
For example, the function \li{lookup_rx} in Figure \ref{fig:lookup_rx} constructs a regex from the given string and another regex. We cannot apply \li{RU.parse} to redraw this function equivalently, but at lower syntactic cost.
\begin{figure}[h]
\begin{lstlisting}[numbers=none]
fun lookup_rx(name : string) =>
R.Seq(R.Str name, R.Seq(R.Str "SSTR: ESTR", ssn))
\end{lstlisting}
\caption{Compositional construction of a regex}
\label{fig:lookup_rx}
\end{figure}
%We needed to use both dynamic string parsing and explicit applications of pattern constructors to achieve the intended semantics.
We will describe derived forms that do capture the idioms of compositional regex construction in Sec. \ref{sec:syntax-dialects} (in particular, we will compare Figure \ref{fig:lookup_rx} to \ref{fig:derived-spliced-subexpressions}.)
Dynamic string parsing cannot capture the idioms of list construction for the same reason -- list expressions can contain sub-expressions.
%(we will see an example of syntax that does capture such idioms below).
\item Using strings to introduce regexes also creates a \emph{cognitive hazard} for programmers who are coincidentally working with other data of type \li{string}. For example, consider the following na\"ively ``more readable definition of \lstinline{lookup_rx}'', where the infix operator \li{^} means string concatenation:
\begin{lstlisting}[numbers=none,escapechar=~]
fun lookup_rx_insecure(name : string) =>
RU.parse (name ^ {rx|SSTR: \d\d\d-\d\d-\d\d\d\dESTR|rx})
\end{lstlisting}
or equivalently, given the regex \li{ssn} as above and an auxiliary function \li{RU.to_string} that can compute the string representation of a given regex:
\begin{lstlisting}[numbers=none,escapechar=~]
fun lookup_rx_insecure(name : string) =>
RU.parse (name ^ "SSTR: ESTR" ^ (RU.to_string ssn))
\end{lstlisting}
%The (unstated) intent here was to treat \lstinline{name} as a sub-pattern matching only itself, but this is not the observed behavior when \lstinline{name} contains special characters that have other meanings in patterns.
Both \lstinline{lookup_rx} and \lstinline{lookup_rx_insecure} have the same type, \li{string -> R.t}, and behave identically at many inputs, particularly the ``typical'' inputs (i.e. alphabetic strings.) It is only when \li{lookup_rx_insecure} is applied to a string that parses as a regex that matches \emph{other} strings that it behaves incorrectly (i.e. differently from \li{lookup_rx}.)
In applications that query sensitive data, mistakes like this lead to \emph{injection attacks}, which are among the most common and catastrophic security threats today \cite{owasp2013}.
This problem is fundamentally attributable to the programmer making a mistake in a misguided effort to decrease syntactic cost. However, the availability of a better approach for decreasing syntactic cost would help make this class of mistakes less common \cite{Bravenboer:2007:PIA:1289971.1289975}. %Given that our design philosophy is explicitly concerned with issues of cognitive cost, it is natural to consider these common cognitive hazards.
%Proving that mistakes like this have not been made involves reasoning about complex run-time data flows.
%Ultimately, of course, mistakes like this are the fault of a programmer using a flawed heuristic, and they could be avoided with discipline. The problem is once again that it is difficult to detect violations of this discipline automatically.
%Ideally, our library would be able to make it more difficult to inadvertently introduce subtle security bugs like this.
\item The final problem is that regex parsing does not occur until the call to \li{RU.parse} is dynamically evaluated. For example, the malformed regex form in the program fragment below will only trigger an exception when this expression is evaluated during the full moon: %Achieving this goal is an explicit goal of this proposal, so we are obviously not happy with this.
\begin{lstlisting}[numbers=none]
match moon_phase with
Full => RU.parse "SSTR(GCESTR" | _ => (* ... *)
end
\end{lstlisting}
Malformed string encodings of regexes can sometimes be discovered by testing, though empirical data gathered from large open source projects suggests that many malformed regexes remain undetected by test suites ``in the wild'' \cite{spishak2012type}.
One workaround is for the programmer to lift all such calls where the argument is a string literal out to the top level of the program, so that the exception is raised every time the program is evaluated. There is a cognitive penalty associated with moving the description of a regex away from its use site (but for statically determined regexes, this might be an acceptable trade-off.) For regexes constructed compositionally, this may not be possible. % Moreover, the dynamic cost of parsing the regex is incurred on every invocation of the program, even when the regex will never be used.
% Statically verifying that pattern formation errors will not dynamically arise requires reasoning about arbitrary dynamic behavior. This is an undecidable verification problem in general and can be difficult to even partially automate. In this example, the verification procedure would first need to be able to establish that the variable \lstinline{rxparse} is equal to the parse function \lstinline{RUtil.parse}. If the string argument had not been written literally but rather computed, e.g. as \lstinline{"SSTR(GESTR" ^ "SSTRCESTR"} where \lstinline{^} is the string concatenation function applied in infix style, it would also need to be able to establish that this expression is equivalent to the string \lstinline{"SSTR(GCESTR"}. For patterns that are dynamically constructed based on input to a function, evaluating the expression statically (or, more generally, in some earlier ``stage'' of evaluation \cite{Jones:Gomard:Sestoft:93:PartialEvaluation}) also does not suffice.
Another approach is to perform a static analysis that attempts to discover malformed statically determined regexes wherever they appear \cite{spishak2012type}.
\item Finally, to reiterate, this approach is not suitable for abbreviating patterns.
% Of course, asking the client to provide a proof of well-formedness would defeat the purpose of lowering syntactic cost.
% In contrast, were our language to support derived regex syntax, pattern parsing would occur at compile-time and so malformed patterns would produce a compile-time error, no matter where they appear in a program.
% \item Dynamic string parsing also necessarily incurs dynamic cost. Regular expression patterns are common when processing large datasets, so it is easy to inadvertently incur this cost repeatedly. For example, consider mapping over a list of strings:
% \begin{lstlisting}[numbers=none]
% map exmpl_list (fn s => rxmatch (rxparse "SSTRA|T|G|CESTR") s)
% \end{lstlisting}
% To avoid incurring the parsing cost for each element of \lstinline{exmpl_list}, the programmer or compiler must move the parsing step out of the closure (for example, by eta-reduction in this simple example).\footnote{Anecdotally, in major contemporary compilers, this optimization is not automatic.} If the programmer must do this, it can (in more complex examples) increase syntactic cost and cognitive cost by moving the pattern itself far away from its use site. Alternatively, an appropriately tuned memoization (i.e. caching) strategy could be used to amortize some of this cost, but it is difficult to reason compositionally about performance using such a strategy. %If the programmer does it, it can sometimes make the program more difficult to read.
% %This too is difficult if a portion of the pattern is dynamically generated. % Regular expressions are often used across large datasets in scientific applications, so the absolute peformance penalty can be non-trivial.
% In contrast, were our language to primitively support derived pattern syntax, the expansion would be computed at compile-time and incur no dynamic cost.
\end{enumerate}
Difficulties like these arise whenever a programmer attempts to deploy dynamic string parsing as a solution to the problem of high syntactic cost. % The reason is that syntactic cost is a property of a drawing of a program, so trying to address it by drawing a different program requires establishing that the alternative program is equivalent to the program that the client would write if syntactic cost was not a consideration (which is, at worst, an ill-posed problem, and at best, a rather difficult problem.) %Moreover, logically equivalent programs can differ in terms of performance.
(There are, of course, legitimate applications of dynamic string parsing that are not motivated by the desire to decrease syntactic cost, e.g. when parsing string encodings of regexes received as dynamic input to the program.)%Strings are, simply put, not ideally suited for this task.
\subsection{Dynamic Quotation Parsing}\label{sec:dynamic-quotation}
Some syntax dialects of ML, e.g. a syntax dialect that can be activated by toggling a compiler flag in SML/NJ \cite{SML/Quote,conf/icfp/Slind91}, define \emph{quotation literals}, which are derived forms for expressions of type \li{'a frag list} where \li{'a frag} is defined as follows:
\begin{lstlisting}[numbers=none]
datatype 'a frag = QUOTE of string | ANTIQUOTE of 'a
\end{lstlisting}
Quotation literals are delimited by backticks, e.g. \li{`SCSSA|T|G|CECSS`} is the same as writing \li{[QUOTE "SSTRA|T|G|CESTR"]}. Expressions of variable or parenthesized form that appear prefixed by a caret in the body of a quotation literal are parsed out and appear wrapped in the \li{ANTIQUOTE} constructor, e.g. \li{`SCSSGC^(ECSSdna_rxSCSS)GCECSS`} is the same as writing:
\begin{lstlisting}[numbers=none]
[QUOTE "SSTRGCESTR", ANTIQUOTE dna_rx, QUOTE "SSTRGCESTR"]
\end{lstlisting}
Unlike dynamic string parsing, \emph{dynamic quotation parsing} allows library providers to capture idioms involving subexpressions. For example:
\begin{itemize}
\item The regex library provider can define a function \li{qparse : R.t frag list -> R.t} in \li{RXUtil} that parses the given fragment list according to the POSIX standard extended to support antiquotation, producing a regex value or raising an exception if the fragment list cannot be parsed. Appyling this function to the examples above produces the corresponding regex values at lower syntactic cost:
\begin{lstlisting}[numbers=none]
val dna = RU.qparse `SCSSA|T|G|CECSS`
val bisI = RU.qparse `SCSSGC^(ECSSdna_rxSCSS)GCECSS`
\end{lstlisting}
\item The list library provider can also define a function \li{qparse : 'a frag list -> 'a list} in the \li{List} module that constructs a list from a quoted list:
\begin{lstlisting}[numbers=none]
List.qparse `SCSS[^(ECSSx + ySCSS), ^ECSSySCSS, ^ECSSzSCSS]ECSS`
\end{lstlisting}
\end{itemize}
%This addresses some of the problems of dynamic string parsing, in that we no longer need to use string concatenation to emulate splicing.
There remain some problems with dynamic quotation parsing:
\begin{enumerate}
\item The library provider cannot specify alternative outer delimiters or antiquotation delimiters -- backticks and the caret, respectively, are the only choices in SML/NJ. This is problematic for regexes, for example, because the caret has a different meaning in the POSIX standard.
\item Another problem is that all antiquoted values within a quotation literal must be of the same type. If, for example, we sought to support both spliced regexes and spliced strings in quoted regexes, we would need to define an auxiliary sum type in \li{RXUtil}
and the client would need to wrap each antiquoted expression with a call to the corresponding constructor to mark its type.
For example, \li{lookup_rx} would be drawn as follows (assuming suitable definitions of \li{RU.QS} and \li{RU.QR}, not shown):
\begin{lstlisting}[numbers=none]
fun lookup_rx(string : name) =>
RU.qparse' `SCSS^(ECSSRU.QS nameSCSS): ^(ECSSRU.QR readingSCSS)ECSS`
\end{lstlisting}
Similarly, if we sought to support quoted lists where the tail is explicitly given by the client (following OCaml's revised syntax \cite{ocaml-manual}), clients would need to apply marking constructors to each antiquoted expression:
\begin{lstlisting}[numbers=none]
List.qparse `SCSS[^(ECSSList.V xSCSS), ^(ECSSList.V ySCSS) :: ^(ECSSList.VS zsSCSS)]ECSS`
\end{lstlisting}
Marking constructors increase syntactic cost (rather substantially in such examples.)
\item As with dynamic string parsing, parsing occurs dynamically. We cannot use the trick of lifting all calls to \li{qparse} to the top level because the arguments are not closed string literals. At best, we can lift these calls out as far as the binding structure allows, i.e. into the earliest possible ``dynamic phase.'' Parse errors are detected only when this phase is entered, and the dynamic cost of parsing is incurred each time this phase is entered. For example, \li{List.qparse} is called $n$ times below, where $n$ is the length of \li{input}:
\begin{lstlisting}[numbers=none]
List.map (fn x => List.qparse `SCSS[^ECSSxSCSS, ^(ECSS2 * xSCSS)]ECSS`) input
\end{lstlisting}
One way to detect parse errors early and reduce the dynamic cost of parsing is to use a system of \emph{staged partial evaluation} \cite{Jones:Gomard:Sestoft:93:PartialEvaluation}. For example, if we integrated Davies' temporal logic based approach into our language \cite{DBLP:conf/lics/Davies96}, we could rewrite the list example above as follows:
\begin{lstlisting}[numbers=none,morekeywords={next,prev}]
List.map (fn x => prev (List.sqparse
`SCSS[^(ECSSnext xSCSS), ^(ECSSnext (2 * x)SCSS)]ECSS`)) input
\end{lstlisting}
Here, the operator \lstinline[morekeywords={next,prev}]{prev} causes the call to \li{List.sqparse} to be evaluated in the previous stage. \li{List.sqparse} differs from \li{List.qparse} in that the antiquoted values in the input must be encapsulated expressions from the next stage, indicated by the \lstinline[morekeywords={next, prev}]{next} operator. The return value is also an encapsulated expression from the next stage. By composing this value with \lstinline[morekeywords={next,prev}]{prev}, we achieve the desired staging. Other systems, e.g. MetaML \cite{Sheard:1999:UMS} and MacroML \cite{ganz2001macros}, provide similar staging primitives.
The main problem with this approach is that it incurs substantial annotation overhead. Here, the staged call to \li{List.sqparse} has higher syntactic cost than if we had simply manually applied \li{Nil} and \li{Cons}. This problem is compounded if marking constructors like those described above are needed.
% the application of \li{qparse} is evaluated.
\item Finally, quotation parsing, like the other approaches considered so far, helps only with the problem of abbreviating expressions. It provides no solution to the problem of abbreviating patterns (because parse functions compute values, not patterns.)% The reason is simple: these approaches require applying functions, which, by nature, are expressions, not patterns.
\end{enumerate}
Due to these problems, VerseML does not build in quotation literals.\footnote{In fact, quotation syntax can be expressed using parametric TLMs, which are the topic of Chapter \ref{chap:ptsms}.}
% \subsection{Syntax Dialects}
% To gain more precise control over form, a library provider, or another interested party, might also consider defining a syntax dialect.
% A dialect of a syntax definition, $\mathcal{D}$, is a new syntax definition, $\mathcal{D}'$, that:
% \begin{enumerate}
% \item extends $\mathcal{D}$, meaning that all drawings that are well-formed in $\mathcal{D}$ are well-formed in $\mathcal{D}'$ and identify the same AST; and
% \item defines additional derived forms.
% \end{enumerate}
% We leave the notion of a ``syntax definition'' undefined here for the sake of generality -- there are many different syntax definition systems.
\subsection{Fixity Directives}\label{sec:Fixity-directives}
We will now consider various syntax definition systems.
The simplest syntax definition systems allow programmers to introduce new infix operators. For example, the syntax definition system integrated into Standard ML allows the programmer to designate \li{::} as a right-associative infix operator at precedence level 5 by placing the following directive in the program text:
\begin{lstlisting}[numbers=none]
infixr 5 ::
\end{lstlisting}
This directive causes expressions of the form \li{e1 :: e2} to desugar to \li{op:: (e1, e2)}, i.e. the variable \li{op::} is applied to the pair \li{(e1, e2)}. Given that \li{op::} is a list value constructor in SML, this expression constructs a list with head \li{e1} and tail \li{e2}.
The fixity directive above also causes patterns of the form \li{p1 :: p2} to desugar to \li{op:: (p1, p2)}, i.e. to pattern constructor application. Again, because \li{op::} is a list pattern constructor in SML, the desugaring of this pattern matches lists where the head matches \li{p1} and the tail matches \li{p2}. (If we had used the identifier \li{Cons}, rather than \li{op::}, in the definition of the \li{list} datatype, we would never be able to use the \li{::} operator in list patterns because SML does not support pattern synonyms.)%This is why the SML Basis library uses the label \li{op::} rather than \li{Cons} directly in its definition of the list datatype for this reason.
Figure \ref{fig:infix-RX} shows three fixity declarations related to our regex library together with a functor \li{RXOps} that binds the corresponding identifiers to the appropriate functions. Assuming that a library packaging system has brought the fixity declarations and the definition of \li{RXOps} from Figure \ref{fig:infix-RX} into scope, we can instantiate \li{RXOps} and then \li{open} this instantiated module to bring the necessary bindings into scope as follows:
\begin{figure}
\begin{lstlisting}
infix 5 ::
infix 6 <*>
infix 4 <|>
functor RXOps(R : RX) =
struct
structure RU = RXUtil(R)
val op:: = R.Seq
val op<*> = RU.repeat
val op<|> = R.Or
end
\end{lstlisting}
\caption{Fixity declarations and related bindings for \li{RX}}
\label{fig:infix-RX}
\vspace{-5px}
\end{figure}
\begin{lstlisting}[numbers=none]
structure ROps = RXOps(R)
open ROps
\end{lstlisting}
We can now draw the previous examples equivalently as follows:
\begin{lstlisting}[numbers=none]
val dna = (R.Str "SSTRAESTR") <|> (R.Str "SSTRTESTR") <|> (R.Str "SSTRGESTR") <|>
(R.Str "SSTRCESTR")
val ssn = (RU.digit)<*>3 :: (RU.digit)<*>2 :: (RU.digit)<*>4
fun lookup_rx(name : string) =>
(Str name) :: (Str "SSTR: ESTR") :: ssn
\end{lstlisting}
This demonstrates two other problems with this approach.
First, it grants only limited control over form -- we cannot express the POSIX forms in this way, only \emph{ad hoc} (and in this case, rather poor) approximations thereof.
% as given cannot be used in patterns.%The desugaring of an infix operator application is always of function application form. In particular, it is always the operator variable (e.g. \li{op::}) applied to two arguments, first the expression on the left, then the expression on the right (with the precedence and associativity determining what these expressions are.)
Second, there can be syntactic conflicts between libraries. Here, both the list library and the regex library have defined a fixity directive for the \li{::} operator, but each specifies a different associativity. As such, clients cannot use both forms in the same scope. There is no mechanism that allows a client to explicitly qualify an infix operator as referring to the fixity directive from a particular library -- fixity directives are not exported from modules or otherwise integrated into the binding structure of SML (libraries are extralinguistic packaging constructs, distinct from modules.)
%This identifier-oriented approach is also rather \emph{ad hoc}, in that renaming or substituting for an identifier can break or change the meaning of the program.
Formally, each fixity directive induces a dialect of the subset of SML's textual syntax that does not allow the declared identifier to appear in prefix position. When two such dialects are combined, the resulting dialect is not necessarily a dialect of both of the constituent dialects (one fixity declaration overrides the other, according to the order in which the dialects were combined.)
Due to these limitations, VerseML does not inherit this mechanism from SML (the infix operators that are available in VerseML, like \li{^} for string concatenation, have a fixed precedence, associativity and desugaring.)
\subsection{Mixfix Syntax Definitions}\label{sec:mixfix}
Fixity directives do not give direct control over desugaring -- the desugaring of a binary operator form introduced by a fixity directive is always of function application or pattern constructor application form. ``Mixfix'' syntax definition systems generalize SML-style fixity directives in that newly defined forms can contain any number of sub-trees (rather than just two) and their desugarings are determined by a programmer-defined rewriting.
The simplest of these systems, e.g. Griffin's system of notational definitions \cite{5134}, later variations on this system with stronger theoretical properties \cite{DBLP:conf/gpce/TahaJ03}, and the syntax definition system integrated into the Agda programming language \cite{DBLP:conf/ifl/DanielssonN08}, support only forms that contain a fixed number of sub-trees, e.g. \li{if _ then _ else _}. We cannot define SML-style derived list forms using these systems, because list forms can contain any number of sub-trees.
More advanced notational definition systems support new forms that contain $n$-ary sequences of sub-trees separated by a given token. For example, Coq's notation system \cite{Coq:manual} can be used to express list syntax as follows:
\begin{lstlisting}[numbers=none]
Notation " [ ] " := nil (format "[ ]") : list_scope.
Notation " [ x ] " := (cons x nil) : list_scope.
Notation " [ x ; y ; .. ; z ] " :=
(cons x (cons y .. (cons z nil) ..)) : list_scope.
\end{lstlisting}
Here, the final declaration handles a sequence of $n > 1$ semi-colon separated trees.
Even under these systems, we cannot define POSIX-style regex syntax. The problem is that we can only extend the syntax of the existing sorts of trees, e.g. types, expressions and patterns. We cannot define new sorts of trees, with their own distinct syntax. For example, we cannot define a new sort for regular expressions, where sequences of characters are not recognized as Coq identifiers but rather as regex character sequences.
As with other mechanisms for defining syntax dialects, we cannot reason modularly about syntactic determinism. The Coq manual acknowledges this \cite{Coq:manual}:
\begin{quote}
\emph{Mixing different symbolic notations in [the] same text may cause serious parsing ambiguity.}
\end{quote}
To help library clients manage conflicts when they arise, most of these systems include various precedence mechanisms. For example, Agda supports a system of directed acyclic precedence graphs \cite{DBLP:conf/ifl/DanielssonN08} (this is related to earlier work by Aasa where a complete precedence graph was necessary \cite{DBLP:journals/tcs/Aasa95}.) In Coq, the programmer can associate notation definitions with named ``scopes'', e.g. \li{list_scope} in the example above. A scope can be activated or deactivated explicitly using scope directives to control the availability of notation definitions. The innermost scope has the highest precedence. In some situations, Coq is able to use type information to activate a scope implicitly. Mixfix syntax definition systems that use types more directly to disambiguate from several possibilities have also been developed \cite{missura1997higher,wieland2009parsing}. These only reduce the likelihood of a conflict -- they do not eliminate the possibility entirely.
Aasa et al. developed a system whereby each constructor of a datatype definition could have its own syntax \cite{DBLP:conf/lfp/AasaPS88,DBLP:conf/fpca/Aasa93}. This syntax was delimited from the rest of the language using a fixed quotation-antiquotation system like that described in Sec. \ref{sec:dynamic-quotation}. Parsing was integrated into the type inference mechanism of the language. However, this system is also not expressive enough to handle POSIX regex syntax, again because it forces an immediate, one-to-one correspondence between constructors and syntactic forms. For example, it is not possible to treat arbitrary character sequences as regex character sequences, which are governed by the \li{Str} constructor. It is also not possible to capture idioms that do not correspond immediately to datatype constructor application (e.g. idioms involving modules.)
%VerseML does not integrate a mixfix syntax definition system.
\subsection{Grammar-Based Syntax Definition Systems}\label{sec:grammars}\label{sec:syntax-dialects}
Many syntax definition systems are oriented around \emph{formal grammars} \cite{hopcroft1979introduction}. Formal grammars have been studied since at least the time of P\~anini, who developed a grammar for Sanskrit in or around the 4th century BCE \cite{Ingerman:1967:LFS:363162.363165}.
\emph{Context-free grammars (CFGs)} were first used to define the textual syntax of a major programming language -- Algol 60 -- by Backus \cite{naur1963revised}. Since then, countless other syntax definition systems oriented around CFGs have emerged. In these systems a syntax definition consists of a CFG (perhaps from some restricted class of CFGs) equipped with various auxiliary definitions (e.g. a lexer definition in many systems) and logic for computing an output value (e.g. a tree) based on the determined form of the source text.% Some of the systems that we will describe operate as \emph{textual preprocessors}, transforming source text into text accepted by a language's original syntax definition. Others directly generate syntax trees.
% Some compilers, e.g. the OCaml compiler \cite{ocaml-manual}, integrate preprocessing into the build system. Other systems use a layer of directives placed in the source text to control preprocessor invocation. For example, in Racket's reader macro system, the programmer can direct the lexer (called the ``reader'') to shift control to a given parser when a designated directive or token is seen \cite{Flatt:2012:CLR:2063176.2063195}. A few systems are integrated directly into a language definition -- we will point these out when we introduce them.
Perhaps the most established CFG-based syntax definition systems within the ML ecosystem are ML-Lex and ML-Yacc, which are distributed with SML/NJ \cite{TarditiDR:mly}, and Camlp4, which was (until recently) integrated into the OCaml system (in recent releases of the OCaml system, it has been deprecated in favor of a simpler system, \li{ppx}, that we discuss in the next section) \cite{ocaml-manual}. In these systems, the output is an ML value computed by ML functions that appear associated with productions in the grammar (these functions are referred to as the \emph{semantic actions}.)
The \emph{syntax definition formalism (SDF)} \cite{journals/sigplan/HeeringHKR89} is a syntactic formalism for describing CFGs. SDF is used by a number of syntax definition systems, e.g. the Spoofax ``language workbench'' \cite{kats2010spoofax}. These systems commonly use Stratego, a rule-based rewriting language, as the language that output logic is written in \cite{Visser-RTA01}. SugarJ is an extension of Java that allows programmers to define and combine fragments of SDF+Stratego-based syntax definitions directly from within the program text \cite{erdweg2011sugarj}. SugarHaskell is a similar system based on Haskell \cite{erdweg2012layout} and Sugar* simplifies the task of defining similar extensions of other languages \cite{erdweg2013framework}. SoundExt and SugarFOmega add the requirement that new derived forms must come equipped with derived typing rules \cite{conf/icfp/LorenzenE13}. The system must be able to verify that the rewrite rules are sound with respect to these derived typing rules (their verification system defers to the proof search facilities of PLT-Redex \cite{Felleisen-Findler-Flatt09}.) SoundX generalizes this idea to other base languages, and adds the ability to define type-dependent rewritings \cite{conf/popl/LorenzenE16}. We will say more about SoundExt/SugarFOmega and SoundX when we discuss abstract reasoning under syntax dialects below.
%Erdweg et al. have developed many non-trivial examples, including . \to
Copper implements a CFG-based syntax definition system that uses a context-aware scanner \cite{conf/gpce/WykS07}. We will say more about Copper when we discuss modular reasoning about syntactic determinism below.% Silver is an \emph{attribute grammar} system that incorporates Copper \cite{VanWyk:2010:SEA}. Attribute grammars make it easier to compositionally define output logic that requires information that is not local to each form \cite{knuth1968semantics}.
Some other syntax definition systems are instead oriented around \emph{parsing expression grammars} (PEGs) \cite{Ford04a}. PEGs are similar to CFGs, distinguished mainly in that they are deterministic by construction (by allowing only for explicitly prioritized choice between alternative parses.) \emph{Packrat parsers} implement PEGs \cite{DBLP:journals/corr/abs-cs-0603077}.
\vspace{-5px}\subsection{Parser Combinator Systems}\label{sec:parser-combinators}\vspace{-4px}
\emph{Parser combinator systems} specify a functional interface for defining parsers, together with various functions that generate new parsers from existing parsers and other values (these functions are referred to as the \emph{parser combinators}) \cite{Hutton1992d}.
In some cases, the composition of various parser combinators can be taken as definitional (as opposed to the usual view, where a parser is an {implementation} of a syntax definition.)
For example, Hutton describes a system where parsers are functions of some type in the following parametric type family:% \li{'char 'tree parser}, bound in VerseML's syntax:
\begin{lstlisting}[numbers=none]
type parser('c, 't) = list('c) -> list('t * list('c))
\end{lstlisting}
Here, a parser is a function that takes a list of (abstract) characters and returns a list of valid parses, each of which consists of an (abstract) output (e.g. a tree) and a list of the characters that were not consumed. An input is ambiguous if this function returns more than one parse. A deterministic parser is one that never returns more than one parse. The non-deterministic choice combinator \li{alt} has the following signature:
\begin{lstlisting}[numbers=none]
val alt : parser('c, 't) -> parser('c, 't) -> parser('c, 't)
\end{lstlisting}
The \li{alt} combinator combines the two given parsers by applying them both to the input and appending the lists that they return.
Various alternative designs that better control dynamic cost or that maintain other useful properties have also been described. For example, Hutton and Meijer describe a parser combinator system in monadic style \cite{hutton1998monadic}. Okasaki has described an alternative design that uses continuations to control cost \cite{Okasaki98b}.
% Given a parser (generated using parser combinators, or by using a CFG processed by a parser generator), the programmer must
% Some compilers, e.g. the OCaml compiler \cite{ocaml-manual}, integrate preprocessing into the build system.
Some systems use a layer of directives placed in the source text to control parser invocation. For example, in Racket's reader macro system, the programmer can direct the initial token reader to shift control to a given parser when a designated directive or token is seen \cite{Flatt:2012:CLR:2063176.2063195,DBLP:journals/jfp/FlattCDF12}. Honu is another reader based system, which uses a simple syntactic pattern language to initially ``enforest'' the token stream, i.e. to turn it into a simple tree structure, before passing it to the parser \cite{DBLP:conf/gpce/RafkindF12}. %A few systems are integrated directly into a language definition -- we will point these out when we introduce them.
%\subsubsection{Preprocessing}
%Syntax dialects implemented using parser combinators, or generated from a grammar-based syntax definition by a \emph{parser generator}, usually
%The most minimal syntax definition systems, e.g. Racket's dialect preprocessor \cite{Flatt:2012:CLR:2063176.2063195}, take any function of a type like \li{string -> exp}, where \li{exp} is a system-defined encoding of the syntax of expressions (enriched perhaps with source code locations and other ``metadata''), as a syntax definition. Programmers using these systems are free to use an implementation of some other syntax definition system to define this function.
\vspace{-8px}\subsection{Examples of Syntax Dialects}\label{sec:examples-of-syntax-dialects}\vspace{-4px}
Now that we have given an overview of a number of syntax definition systems, let us consider two specific examples of syntax dialects to motivate our subsequent discussion of the problems with the dialect oriented approach.
\vspace{-10px}\subsubsection{Example 1: $\mathcal{V}_\texttt{rx}$}
Using any of the more general syntax definition systems described in the two previous sections, we can define a dialect of VerseML's textual syntax called $\mathcal{V}_\texttt{rx}$ that builds in derived regex forms.
In particular, $\mathcal{V}_\texttt{rx}$ extends the syntax of expressions with \emph{derived regex literals}, which are delimited by forward slashes, e.g. \li{/SURLA|T|G|CEURL/}. The desugaring of this form is equivalent to the following if we assume that \li{Or} and \li{Str} stand for the corresponding constructors of the recursive labeled sum type \li{rx} that was defined in Figure \ref{fig:datatype-rx}:
\begin{lstlisting}[numbers=none]
Or(Str "SSTRAESTR", Or (Str "SSTRTESTR", Or (Str "SSTRGESTR", Str "SSTRCESTR")))
\end{lstlisting}
Of course, it is unreasonable to assume that \li{Or} and \li{Str} are bound appropriately at every use site. In order to maintain \emph{context independence}, the desugaring instead applies the explicit \li{fold} and \li{inj} operators as discussed in Sec. \ref{sec:lists}.\footnote{In SML, where datatypes are abstract and explicit fold and injection operators are not exposed, it is more difficult to maintain context independence. We would need to provide a module containing the constructors as a ``syntactic argument'' to each form -- we describe this technique as it relates to our modular encoding of regexes in Example 2 below.}
\begin{figure}
\begin{lstlisting}[numbers=none]
val ssn = SURL/\d\d\d-\d\d-\d\d\d\d/EURL
fun lookup_rx(name : string) => SURL/@EURLnameSURL: %EURLssnSURL/EURL
\end{lstlisting}
\vspace{-5px}
\caption{Derived regex expression forms in $\mathcal{V}_\texttt{rx}$}
\label{fig:derived-spliced-subexpressions}
\end{figure}
\begin{figure}
\begin{lstlisting}[numbers=none]
fun is_dna_rx(r : rx) : boolean =>
match r with
| SURL/A/EURL => True
| SURL/T/EURL => True
| SURL/G/EURL => True
| SURL/C/EURL => True
| SURL/%(EURLr1SURL)%(EURLr2SURL)/EURL => (is_dna_rx r1) andalso (is_dna_rx r2)
| SURL/%(EURLr1SURL)|%(EURLr2SURL)/EURL => (is_dna_rx r1) andalso (is_dna_rx r2)
| SURL/%(EURLrSURL)*/EURL => is_dna_rx r'
| _ => False
end
\end{lstlisting}
\vspace{-8px}
\caption{Derived regex pattern forms in $\mathcal{V}_\texttt{rx}$}
\label{fig:derived-pattern-syntax}
\end{figure}
$\mathcal{V}_\texttt{rx}$ also supports regex literals that contain {subexpressions}. These capture the idioms that arise when constructing regex values compositionally. For example, the definition of \li{lookup_rx} in Figure \ref{fig:derived-spliced-subexpressions} is equivalent to the definition of \li{lookup_rx} that was given in Figure \ref{fig:lookup_rx}. The prefix \li{SURL@EURL} followed by the identifier \li{name} causes the expression \lstinline{name} to appear in the desugaring as if wrapped in the \li{Str} constructor, and the prefix \li{SURL%EURL}
followed by the identifier \li{ssn} causes \lstinline{ssn} to appear in the desugaring directly. We refer to the expressions that appear inside literal forms as \emph{spliced expressions}.
% The body of \li{example_rx} could equivalently be written as follows:
% \begin{lstlisting}[numbers=none]
% Seq(Str(name), Seq(Str "SSTR: ESTR", ssn))
% \end{lstlisting}
% (Again, the desugaring itself must use the explicit \li{fold} and \li{inj} operators to maintain context-independence.)
%Notice that \li{name} appears wrapped in the label \li{Str} because it was prefixed by \li{@}, whereas \li{ssn} appears unadorned because it was prefixed by \li{%}.
To splice in an expression that is not of variable form, e.g. a function application, we must delimit it with parentheses: \li{/SURL@(EURLcapitalize nameSURL)EURL/}.
Finally, $\mathcal{V}_\texttt{rx}$ extends the syntax of patterns with analagous \emph{derived regex pattern literals}. For example, the definition of \li{is_dna_rx} in Figure \ref{fig:derived-pattern-syntax} is equivalent to the definition of \li{is_dna_rx} that was given in Figure \ref{fig:is_dna_rx}. Notice that the variables bound by the patterns in Figure \ref{fig:derived-pattern-syntax} appear inside \emph{spliced patterns}.
\subsubsection{Example 2: $\mathcal{V}_\text{RX}$}
\begin{figure}
\begin{lstlisting}[numbers=none]
fun is_dna_rx'(r : R.t) : boolean =>
match R.unfold_norm r with
| SURL/A/EURL => True
| SURL/T/EURL => True
| SURL/G/EURL => True
| SURL/C/EURL => True
| SURL/%(EURLr1SURL)%(EURLr2SURL)/EURL => (is_dna_rx' r1) andalso (is_dna_rx' r2)
| SURL/%(EURLr1SURL)|%(EURLr2SURL)/EURL => (is_dna_rx' r1) andalso (is_dna_rx' r2)
| SURL/%(EURLrSURL)*/EURL => is_dna_rx r'
| _ => False
end
\end{lstlisting}\vspace{-5px}
\caption{Derived regex unfolding pattern forms in $\mathcal{V}_\text{RX}$}
\label{fig:VRX-pats}
\end{figure}
In Sec. \ref{sec:syntax-examples-regexps}, we also considered a more sophisticated formulation of our regex library organized around the signature \li{RX} defined in Figure \ref{fig:signature-RX}. Let us define another dialect of VerseML's textual syntax called $\mathcal{V}_\text{RX}$ that defines derived forms whose desugarings involve modules that implement \li{RX}. For this to work in a context-independent manner, these forms must take the particular module that is to appear in the desugaring as a spliced subterm. For example, in the following program fragment, the module \li{R} is ``passed into'' each derived form for use in its desugaring:
\begin{lstlisting}[numbers=none]
val ssn = RSURL./\d\d\d-\d\d\d\d-\d\d\d/EURL
fun lookup_rx'(name : string) => RSURL./@EURLnameSURL: %EURLssnSURL/EURL
\end{lstlisting}
The desugaring of the body of \li{lookup_rx'} is:
\begin{lstlisting}[numbers=none]
R.Seq(R.Str(name), R.Seq(R.Str "SSTR: ESTR", ssn))
\end{lstlisting}
This desugaring logic is context-independent because the constructors are explicitly qualified (i.e. \li{Seq} and \li{Str} are \emph{component labels} here, not variables.) The only variables that appear in the desugaring are \li{R}, \li{name} and \li{ssn}. All of these were specified by the client at the use site, so they are subject to renaming.
Recall that \li{RX} specifies a function \li{unfold_norm : t -> u(t)} for computing the normal unfolding of the given regex. $\mathcal{V}_\text{RX}$ defines derived forms for patterns matching values of types in the type family \li{u('a)}. These are used in the definition of \li{is_dna_rx'} given in Figure \ref{fig:VRX-pats}.
\subsection{Problems with Syntax Dialects}\label{sec:problems-with-syntax-dialects}
\subsubsection{Conservatively Combining Syntax Dialects}
Notice that the derived regex pattern forms that appear in Figure \ref{fig:VRX-pats} are identical to those that appear in Figure \ref{fig:derived-pattern-syntax}. Their desugarings are, however, different. In particular, the patterns in Figure \ref{fig:VRX-pats} match values of type \li{u('a)}, whereas the patterns in Figure \ref{fig:derived-pattern-syntax} match values of type \li{rx}.
It would be useful to have derived forms for values of type \li{rx} available even when we are working with the modular encoding of regexes, because we have defined a function \li{view : R.t -> rx} in \li{RXUtil}. This brings us to the first of the two main problems with the dialect-oriented approach, already described in Chapter \ref{chap:intro}: there is no good way to conservatively combine $\mathcal{V}_\text{rx}$ and $\mathcal{V}_\text{RX}$. In particular, any such ``combined dialect'' will either fail to conserve determinism (because the forms overlap), or the combined dialect will not be a dialect of both of the constituent dialects, i.e. some of the forms from one dialect will ``shadow'' the overlapping forms from the other dialect (depending on the order in which they were combined \cite{Ford04a}.)
In response to this problem, Schwerdfeger and Van Wyk have developed a modular analysis that accepts only deterministic extensions of a base LALR(1) grammar where all new forms must start with a ``marking'' terminal symbol and obey certain other constraints related to the follow sets of the base grammar's non-terminals \cite{conf/pldi/SchwerdfegerW09,schwerdfeger2010context}. By relying on a context-aware scanner (a feature of Copper \cite{conf/gpce/WykS07}) to transfer control when the marking terminals are seen, extensions of a base grammar that pass this analysis and specify disjoint sets of marking terminals can be combined without introducing conflict. %The analysis is ``nearly'' modular in that only a relatively simple ``combine-time'' check that the set of marking terminals is disjoint is necessary.
For the two dialects just considered, these conditions are not satisfied. If we modify the grammar of $\mathcal{V}_\text{RX}$ so that, for example, the regex literal forms are marked with \li{#\dolla#r} and the regex unfolding forms are marked with \li{#\dolla#u}, the analysis will accept both grammars, and the combine-time disjointness check will pass, solving our immediate problem at only a small cost. However, a conflict could still arise later when a client combines these extensions with another extension that also uses the marking terminals \li{#\dolla#r}, \li{#\dolla#u} or \li{/}. %There is no reason to believe that other dialect providers will avoid these marking terminals.
\begin{figure}
\begin{lstlisting}[numbers=none]
fun is_dna_rx'(r : R.t) : boolean =>
match R.unfold_norm r with
| SURL$cmu_edu_comar_rx $u/A/EURL => True
| SURL$cmu_edu_comar_rx $u/T/EURL => True
| SURL$cmu_edu_comar_rx $u/G/EURL => True
| SURL$cmu_edu_comar_rx $u/C/EURL => True
(* and so on *)
| _ => False
end
\end{lstlisting}
\vspace{-4px}
\caption{Using URI-based grammar names together with marking tokens to avoid syntactic conflicts}
\label{fig:vanwyk}
\vspace{-6px}
\end{figure}
The solution proposed by Schwerdfeger and Van Wyk \cite{conf/pldi/SchwerdfegerW09,schwerdfeger2010context} is 1) to allow for the grammar's name to be used as an additional syntactic prefix when a conflict arises, and 2) to adopt a naming convention for grammars based on the Internet domain name system (or some similar coordinating system) that makes conflicts unlikely. For example, Figure \ref{fig:vanwyk} shows how a client would need to draw \li{is_dna_rx'} if a conflict arose. Clearly, this drawing has higher syntactic cost than the drawing in Figure \ref{fig:VRX-pats}. Moreover, there is no simple way for clients to selectively control this cost by defining scoped abbreviations for marking tokens or grammar names (as one does for types, modules or values that are exported from deeply nested modules) because this mechanism is purely syntactic, i.e. agnostic to the binding structure of the language. A facility for defining unscoped abbreviations of marking tokens at combine-time could partially alleviate this cost.
Another approach aimed at making conflicts less likely, though not impossible, is to use types to choose from amongst several possible parses. Some approaches require generating the full \emph{parse forest} before typechecking proceeds, e.g. the \emph{MetaBorg} system \cite{bravenboer2005generalized}. This approach is inefficient, particularly when a large number of grammars have been composed. The method of \emph{type-oriented island parsing} integrates parsing and typechecking so that disambiguation occurs as early as possible \cite{DBLP:conf/sfp/SilkensenS12}.
A more radical approach would be to insist that programmers use a \emph{language composition editor} like Eco \cite{diekmann2014eco}. Language composition editors allow programmers to explicitly switch from one syntax to another with an editor command. This is an instance of the more general concept of \emph{structure editing} (also called \emph{structured editing}, \emph{projectional editing} or \emph{syntax-directed editing}.) This concept, pioneered by the Cornell Program Synthesizer \cite{teitelbaum_cornell_1981}, has various costs and benefits, summarized in \cite{DBLP:conf/sle/VolterSBK14}. In this work, our interest is in text-based syntax, but we consider structure editors as future work in Sec. \ref{sec:future-work}.
\subsubsection{Abstract Reasoning About Derived Forms}
In addition to the difficulties of conservatively combining syntax dialects, there are a number of other difficulties related to the fact that there is often no useful notion of syntactic abstraction that a programmer can rely on to reason about an unfamiliar derived form. The programmer may need to examine the desugaring, the desugaring logic or even the definitions of all of the constituent dialects, to definitively answer the questions given in Sec. \ref{sec:abs-reasoning-intro}. These questions were stated relative to a particular example involving the query processing language K.
Here, we generalize from that example to develop an informal classification of the properties that programmers might have difficulty reasoning about in analagous situations. In each case, we will discuss exceptional systems where these difficulties are ameliorated or avoided entirely.% We discuss some exceptions from amongst the related work above:
% \begin{enumerate}
\paragraph{Responsibility} It is not always straightforward to determine which constituent dialect is responsible for any particular derived form.
The system implemented by Copper \cite{conf/pldi/SchwerdfegerW09,schwerdfeger2010context} is an exception, in that the marking terminal (and the grammar name, if necessary) allows clients to search across the constituent dialect definitions for the corresponding declaration without needing to understand any of them deeply.
\paragraph{Segmentation} It is not always possible to segment a derived form such that each segment consists either of a spliced base language term (which we have drawn in black in the examples in this document) or a sequence of characters that are parsed otherwise (which we have drawn in color.) Even when a segmentation exists, determining it is not always straightforward.
For example, consider a production in a grammar that looks like this:
\begin{lstlisting}[numbers=none]
start <- "%(" verseml_exp ")"
\end{lstlisting}
The name of the non-terminal \li{verseml_exp} suggests that it will match any VerseML expression, but it is not certain that this is the case. Moreover, even if we know that this non-terminal matches VerseML expressions, it is not certain that the output logic will insert that expression as-is into the desugaring -- it may instead only examine its form, or transform it in some way (in which case highlighting it as a spliced expression might be misleading.)
Systems that support the generation of editor plug-ins, such as Spoofax \cite{kats2010spoofax} and Sugarclipse for SugarJ \cite{Erdweg:2012:GLE}, can generate syntax coloring logic from an annotated grammar definition, which often give programmers some indication of where a spliced term occurs. However, there is no definitive information about segmentation in how the editor displays the derived form. (Moreover, these editor plug-ins can themselves conflict, even if the syntax itself is deterministic.)
\paragraph{Capture} The desugaring of a derived form might place spliced terms under binders. These binders are not visible in the program text, but can shadow those that are. As a result, the spliced terms will inadvertently capture these expansion-internal bindings. This significantly obscures the binding structure of the program.
For derived forms that desugar to module-level definitions (e.g. to one or more \li{val} definitions), a desugaring might also introduce exported module components that are similarly invisible in the text. This can cause non-local capture when a client \li{open}s that module into scope.
In most cases, capture is inadvertent. For example, a desugaring might bind an intermediate value to some temporary variable, \li{tmp}. This can cause problems at use sites where \li{tmp} is bound. It is easy to miss this problem in testing (particularly if the types of both bindings are compatible.)
In some syntax dialects, capture is by design. For example, in (Sugar)Haskell, \li{do} notation for monadic values operates as a new binding construct \cite{erdweg2012layout}. For programmers who {are} familiar with \li{do} notation, this can be useful. But when a programmer encounters an unfamiliar form, this forces them to determine whether it similarly is designed as a new binding construct. A simple grammar provides no information about capture.%The point is simply that this is a double-edged sword.
In most systems, it is possible for dialect providers to generate identifiers that are guaranteed to be fresh at the use site. If dialect providers are disciplined about using this mechanism, they can prevent capture. However, this is awkward and most systems provide no guarantee that the dialect provider maintained this freshness discipline \cite{conf/ecoop/ErdwegSD14}.
To enforce a prohibition on capture, the system must be integrated into or otherwise made aware of the binding structure of the language. For example, some of the language-integrated mixfix systems discussed above, e.g. Coq's notation system \cite{Coq:manual}, enforce a prohibition on capture by alpha-renaming desugarings as necessary. Erdweg et al. have developed a formalism for directly describing the ``binding structure'' of program text, as well as contextual transformations that use these descriptions to rename the identifiers that appear in a desugaring to avoid capture \cite{conf/ecoop/ErdwegSD14,conf/sle/RitschelE15}.
\paragraph{Context Dependence} If the desugaring of a derived form assumes that certain identifiers are bound at the application site (e.g. to particular values, or to values of some particular type), we refer to the desugaring as being \emph{context dependent}.
%One might describe such desugarings as having ``captured'' bindings from the use site. Notice that this is distinct from the situation described above, where it is a term at the use site that ``captures'' bindings from the desugaring. %We will avoid the term ``capture''. %We use the word ``hygiene'' to refer collectively to context independence and shadowing avoidance.
Context dependent desugarings take control over naming away from clients. Moreover, it is difficult to determine the assumptions that a desugaring is making. As such, it becomes difficult to reason about whether renaming an identifier or moving a binding is a meaning-preserving transformation.
In our examples above, we maintained context independence as a ``courtesy'' by explicitly applying the \li{fold} and \li{inj} operators, or by taking the module for use in the desugaring as a ``syntactic argument''.
To enforce context independence, the system must be aware of binding structure and have some way to distinguish those subterms of a desugaring that originate in the text at the use site (which should have access to bindings at the use site) from those that do not (which should only have access to bindings internal to the desugaring.)
For example, language-integrated mixfix systems, e.g. Coq's notation system, use a simple rewriting system to compute desugarings, so they satisfy these requirements and can enforce context independence. Coq gives desugarings access only to the bindings visible where the notation was defined.
More flexible systems where desugarings are computed functionally, or language-external systems that have no understanding of binding structure, do not enforce context independence.
\paragraph{Typing} Finally, it is not always clear what type an expression drawn in derived form has, or what type of value that a pattern drawn in derived form matches. Similarly, it is not always straightforward to determine what type a spliced expression has, or what type of value that a spliced pattern matches.
SoundExt/SugarFomega \cite{conf/popl/LorenzenE16} and SoundX \cite{conf/sle/RitschelE15} allow dialect providers to define derived typing rules alongside derived forms and desugaring rules. These systems automatically verify that the desugaring rules are sound with respect to these derived typing rules. This ensures that type errors are never reported in terms of the desugaring (which is the stated goal of their work.) However, this helps only to a limited extent in answering the questions just given. In particular, the programmer must first assign Responsibility (which is difficult for the reasons just given.) Next, the programmer must identify the spliced terms (which is difficult because these systems to not make it easy to reason about Segmentation, as just described.) Then, the programmer must construct a derivation using the relevant derived typing rules. Finally, the programmer must traverse the derivation to find out where the spliced terms appear within it to answer questions about their type. Even for relatively simple base languages, like System $\mathbf{F}_\omega$, understanding a typing derivation requires significantly more effort and expertise than programmers usually need.\footnote{At CMU, we teach ML to all first-year students (in 15-150 -- Functional Programming.) However, understanding a judgmental specification of a language like System $\mathbf{F}_\omega$ involves skills that are taught only to some third and fourth year students (in 15-312 -- Principles of Programming Languages.)} For languages like ML, the judgement forms are substantially more complex (no one has yet attempted to apply the SoundX methodology to a language as large as ML.)
Systems like MetaBorg that require that the type of a derived form be known from context so that disambiguation can occur (see above) also address the problem of determining the type of a derived expression or pattern form as a whole. However, it is not always clear what the types of the spliced terms within these derived forms should be.
% As discussed in Sec. \ref{sec:problems-with-dialects}, languages with a rich type and binding structure are designed to minimize or eliminate the cognitive cost of answering analagous questions. These reasoning principles are central to the claim that such languages are suitable for ``programming in the large'' \cite{DeRemer76}.
% Due to the problems described above, and the problem of conservatively combining syntax dialects, we do not integrate a syntax definition system into VerseML.% (There is, of course, no way to stop programmers from defining dialects of VerseML using any language-external syntax definition system of their choosing. Our goal is only to make this a less compelling option for library providers seeking only to capture idioms like those that we have discussed.)
% reasonable to obscure the type bind and binding structure by using these mechanisms.
% One approach would be to define both this encoding and the recursive labeled sum type \li{Rx} and define a parameterized module (i.e. a \emph{functor} in SML) that maps between them, given any module \li{R : RX}:
% \begin{lstlisting}[numbers=none]
% structure RxHelper(R : RX) =
% struct
% fun to_R : Rx -> R.t = (* ... *)
% fun of_R : R.t -> Rx = (* ... *)
% end
% \end{lstlisting}
% For example, given a particular module \li{R : RX}, we can generate the helper module \li{RH} as follows:
% \begin{lstlisting}[numbers=none]
% structure RH = RxHelper(R)
% \end{lstlisting}
% then, if we are using the VerseML/Rx syntax dialect, we can use the derived forms described previously in the argument position of \li{RH.to_r}:
% \begin{lstlisting}[numbers=none]
% let ssn = RH.to_R /SURL\d\d\d-\d\d\d\d-\d\d\dEURL/
% \end{lstlisting}
% One problem with this approach is that it makes using the spliced forms awkward. For example, consider writing the function \li{example_rx} in this manner:
% \begin{lstlisting}[numbers=none]
% fun example_R(name : string) => RH.to_R /SURL@EURLnameSURL: %(EURLRH.of_R ssnSURL)EURL/\end{lstlisting}
% Notice that we had to transform \li{ssn}, which is of type \li{R.t}, back into a value of type \li{Rx} in order to splice it into the expression above. The value of this expression is then immediately transformed back into a value of type \li{R.t} by \li{RH.to_R}. This is both syntactically awkward and incurs dynamic cost, i.e. it is an $\mathcal{O}(n)$ operation, where $n$ is the size of the regex being spliced. In this particular case, the cost may be negligible, but for large data structures, this may no longer be the case.
% \subsubsection{Direct Syntax Extension}\label{sec:direct-syntax-extension}
% One tempting alternative to dynamic string parsing is to use a system that gives the users of a language the power to directly extend its concrete syntax with new derived forms. %for regular expression patterns.% for patterns.
% The simplest such systems are those where the elaboration of each new syntactic form is defined by a single rewrite rule. For example, Gallina, the ``external language'' of the Coq proof assistant, supports such extensions \cite{Coq:manual}. A formal account of such a system has been developed by Griffin \cite{5134}. Unfortunately, a single equation is not enough to allow us to express pattern syntax following the usual conventions. For example, a system like Coq's cannot handle escape characters, because there is no way to programmatically examine a form when generating its expansion.
% Other syntax extension systems are more flexible. For example, many are based on context-free grammars, e.g. Sugar* \cite{erdweg2013framework} and Camlp4 \cite{ocaml-manual} (amongst many others). Other systems give library providers direct programmatic access to the parse stream, like Common Lisp's \emph{reader macros} \cite{steele1990common} (which are distinct from its term-rewriting macros, described in Sec. \ref{sec:term-rewriting} below) and Racket's preprocessor \cite{Flatt:2012:CLR:2063176.2063195}. All of these would allow us to add pattern syntax into our language's grammar, perhaps following Unix conventions and supporting splicing syntax as described above:
% \begin{lstlisting}[numbers=none]
% let val ssn = /SURL\d\d\d-\d\d-\d\d\d\dEURL/
% fun example_shorter(name : string) => /SURL@EURLnameSURL: %EURLssn/
% \end{lstlisting}
% %The body of this function elaborates to the body of \lstinline{example_fixed} as shown above.
% %Had we mistakenly written \lstinline{%name}, we would encounter only a static type error, rather than the silent injection vulnerability discussed above.
% We sidestep the problems of dynamic string parsing described above when we directly extend the syntax of our language using any of these systems. Unfortunately, direct syntax extension introduces serious new problems. First, the systems mentioned thus far cannot guarantee that {syntactic conflicts} between such extensions will not arise. As stated directly in the Coq manual: ``mixing different symbolic notations in [the] same text may cause serious parsing ambiguity''. If another library provider used similar syntax for a different implementation or variant of regular expressions, or for some other unrelated construct, then a client could not simultaneously use both libraries at the same time. So properly considered, every combination of extensions introduced using these mechanisms creates a \emph{de facto} syntactic dialect of our language. The benefit of these systems is only that they lower the implementation cost of constructing syntactic dialects. % Resolving such parsing amibiguities is left to each client of the library.
% In response to this problem, Schwerdfeger and Van Wyk developed a modular analysis that accepts only context-free grammar extensions that begin with an identifying starting token and obey certain constraints on the follow sets of base language's non-terminals \cite{conf/pldi/SchwerdfegerW09}. Extensions that specify distinct starting tokens and that satisfy these constraints can be used together in any combination without the possibility of syntactic conflict. However, the most natural starting tokens like \lstinline{rx} cannot be guaranteed to be unique. To address this problem, programmers must agree on a convention for defining ``globally unique identifiers'', e.g. the common URI convention used on the web and by the Java packaging system. However, this forces us to use a more verbose token like \lstinline{edu_cmu_VerseML_rx}. There is no simple way for clients of our extension to define scoped abbreviations for starting tokens because this mechanism operates purely at the level of the context-free grammar.
% In particular, if det(H) and det(R) and the set of marking terminals on dialects such that if OK(H) and OK(R) and starttokens(H) disjoint from starttokens(R) then det(H cup R). This is not quite modular, in that we still need to check that the start tokens are disjoint at ``combination-time''. To be confident that this check will not fail, a community might adopt a social convention, e.g. using URIs as start tokens.
% Putting this aside, we must also consider another modularity-related question: which particular module should the expansion use? Clearly, simply assuming that some module identified as \lstinline{R} matching \lstinline{RX} is in scope is a brittle solution. In fact, we should expect that the system actively prevents such capture of specific variable names to ensure that variables (here, module variables) can be freely renamed. Such a \emph{hygiene discipline} is well-understood only when performing term-to-term rewriting (discussed below) or in simple language-integrated rewrite systems like those found in Coq. For mechanisms that operate strictly at the level of context-free grammars or the parse stream, it is not clear how one could address this issue. The onus is then on the library provider to make no assumptions about variable names and instead require that the client explicitly identify the module they intend to use as an ``argument'' within the newly introduced form:
% \begin{lstlisting}[numbers=none]
% let val ssn = edu_cmu_VerseML_rx R /SURL\d\d\d-\d\d-\d\d\d\dEURL/
% \end{lstlisting}
% Another problem with the approach of direct syntax extension is that, given an unfamiliar piece of syntax, there is no straightforward method for determining what type it will have, causing difficulties for both humans (related to code comprehension) and tools.
% \todo{Related work I haven't mentioned yet:}
% \begin{itemize}
% \item Fan: http://zhanghongbo.me/fan/start.html
% \item Well-Typed Islands Parse Faster: \\\url{http://www.ccs.neu.edu/home/ejs/papers/tfp12-island.pdf}
% \item User-defined infix operators
% \item SML quote/unquote
% \item That Modularity paper
% \item Template Haskell and similar
% \end{itemize}
% \subsection{Rewriting Systems}\label{sec:term-rewriting}
\subsection{Non-Local Term Rewriting Systems}\label{sec:non-local-term-rewriting}
Another approach is to leave the textual syntax of the language fixed, but repurpose it for novel ends using a \emph{term rewriting system}. Term rewriting systems transform syntactically well-formed terms into other syntactically well-formed terms (unlike syntax definition systems, which operate on the program text.)
Non-local term rewriting systems typically operate over an entire compilation unit (e.g. a file). For example, one could define a preprocessor that rewrites every string literal that is followed by the comment \li{(*rx*)} to the corresponding expression (or pattern) of type \li{rx}. For example, the following expression would be rewritten to a regex expression, with \li{dna} treated as a spliced subexpression as described in the previous section:
\begin{lstlisting}[numbers=none]
"SSTRGC%(dna)GCESTR"(*rx*)
\end{lstlisting}
OCaml 4.02 introduced \emph{preprocessor extension (ppx) points} into its textual syntax \cite{ocaml-manual}. Extension points serve as markers for the benefit of a non-local term rewriting system. They are less \emph{ad hoc} than comments, in that each extension point is associated with a single term in a well-defined way, and the compiler gives an error if any extension points remain after preprocessing is complete. For example, in the following program fragment,
\begin{lstlisting}[numbers=none]
let%lwt (x, y) = f in x + y
\end{lstlisting}
the \li{%lwt}
annotation on the let expression is recognized by a preprocessor distributed with \li{Lwt}, a lightweight threading library. This preprocessor rewrites this fragment to:
\begin{lstlisting}[numbers=none]
Lwt.bind f (fun (x, y) -> x + y)
\end{lstlisting}
The OCaml system is distributed with a library called \li{ppx_tools} that simplifies the task of writing preprocessors that operate on terms annotated with extension points.
There are a number of other systems that support non-local term rewriting. For example, the \li{GHC} compiler for Haskell \cite{jones2001playing} and the \li{xoc} compiler for C \cite{conf/asplos/CoxBCKK08} both support user-defined non-local rewritings.
These systems present several difficulties with abstract reasoning, many of which are directly analagous to those that syntax definition systems present:
\begin{enumerate}
\item \textbf{Conflict:} Different preprocessors may recognize the same markers or code patterns.
\item \textbf{Responsibility:} It is not always clear which preprocessor handles each rewritten form.
\item \textbf{Localization:} A non-local term rewriting system might insert code anywhere in the program, complicating reasoning efforts.
\item \textbf{Segmentation:} It is not always clear where spliced terms appear inside rewritten string literal forms.
\item \textbf{Capture:} The rewriting might place terms under binders that shadow bindings visible in the program text.
\item \textbf{Context Dependence:} The rewriting might assume that certain identifiers are bound at particular locations, making it difficult to reason about refactoring.
\item \textbf{Typing:} It is not always clear what type the rewriting of a marked form will have (if indeed the rewriting happens to be local.) Similarly, the type that terms that appear within the rewritten form should have is often unclear.
\end{enumerate}
\subsection{Term-Rewriting Macro Systems}\label{sec:macro-systems}
Macro systems are language-integrated local term rewriting systems, i.e. they allow programmers to designate functions that implement rewritings as macros. Clients apply macros directly to terms (e.g. expressions, patterns and other sorts of terms.). The rewritten term is known as the \emph{expansion} of the macro application.
Macro systems do not suffer from problems related to reasoning about \textbf{Conflict}, \textbf{Responsibility} and \textbf{Localization} described above because macros are applied explicitly and operate locally.
Na\"ive macro systems, like the earliest variants of the LISP macro system \cite{Hart63a}, early compile-time quotation expanders in ML \cite{mauny1994complete}, Template Haskell macros \cite{SheardPeytonJones:Haskell-02} and GHC quasiquotes \cite{mainland2007s}, do not escape from the remaining problems described above, because they can generate arbitrary code for insertion at the macro application site. For example, it is possible in early LISP dialects and in these other less disciplined modern macro systems to define a macro \li{rx!} that can be applied to rewrite a string form containing a spliced subexpression to a regex:
\begin{lstlisting}[numbers=none]
(rx! "SSTRGC%(dna)GCESTR")
\end{lstlisting}
The problem with these systems is that without examining the macro's implementation or the generated expansion, there is no way to reason about \textbf{Segmentation}, \textbf{Capture}, \textbf{Context Dependence} or \textbf{Typing}.\footnote{It is not enough that the generated expansions be typechecked -- it must be possible for the user to reason about \emph{what the type of the expansion is}.}
The problem of \textbf{Capture} was addressed by the design of Scheme's \emph{hygienic macro system} \cite{Kohlbecker86a,DBLP:conf/popl/Adams15,Herman10:Theory,DBLP:conf/esop/HermanW08,DBLP:journals/lisp/DybvigHB92,DBLP:conf/popl/ClingerR91}, which automatically alpha-renames identifiers bound in the expansion so that they do not shadow those that appear at the macro application site. %modern macro systems (e.g. Racket's macro systems \cite{DBLP:journals/jfp/FlattCDF12} and Scala's macro system \cite{ScalaMacros2013}) incorporate such a hygiene mechanism. %Some of these allow the macro provider to override the automatic renaming in situations where shadowing is explicitly intended, though \todo{some make allowances...}
The problem of \textbf{Context Dependence} is typically confronted by allowing macro expansions to explicitly refer only to those bindings in scope at the macro definition site. These references are preserved even if the identifiers involved have been shadowed at the macro application site \cite{DBLP:conf/popl/ClingerR91,DBLP:journals/lisp/DybvigHB92,DBLP:conf/popl/Adams15}. Any references to application site bindings must originate in one of the macro's arguments. There are two problems with this approach:
\begin{enumerate}
\item It does not make explicit which of the definition site bindings the expansions generated by a macro might refer to, so reasoning abstractly about the renaming of definition site bindings remains problematic.
\item Preventing access to the application site bindings makes defining a macro like \li{rx!} impossible, because spliced subexpressions (like \li{dna} above) do not appear as subexpressions of an argument to \li{rx!} -- they are parsed out of a string literal programmatically. From the perspective of the macro system, such spliced subexpressions are indistinguishable from inappropriate references to bindings tracked by the application site context.
The only choice, then, is to repurpose other forms that do contain subexpressions. For example, the macro might repurpose infix operators that usually have a different meaning, e.g. \li{^}:
\begin{lstlisting}[numbers=none]
(rx! ("SSTRGCESTR" ^ dna ^ "SSTRGCESTR"))
\end{lstlisting}
This is rather confusing, in that it appears that string concatenation is occuring when that is not the case -- \li{rx!} is simply repurposing the infix \li{^} form.
\end{enumerate}
The problem of reasoning about \textbf{Typing} is relatively understudied, because most research on macro systems has been done in languages in the LISP tradition that do not define a rich static semantics.
Herman and Wand's calculus of macros \cite{DBLP:conf/esop/HermanW08,Herman10:Theory} does use a type system to reason about the binding structure of the expansion that a macro generates, but the expansions themselves are not written in a language with rich type structure.
Some macro systems for languages with non-trivial type structure, like Template Haskell \cite{SheardPeytonJones:Haskell-02}, do not support reasoning about types in that the guarantee is only that the expansion is well-typed -- clients cannot reason about \emph{what that type is}.
Other macro systems, like MacroML \cite{ganz2001macros,Sheard:1999:UMS}, support reasoning about typing, but these systems are \emph{staging macro systems}, rather than \emph{term-rewriting macro systems}, meaning that the macro does not have access to the syntax tree of the arguments at all. Staging macros cannot be used for syntactic control -- macro application syntactically coincides with function application. These macro systems are instead motivated primarily by concerns about performance.
The Scala macro system is a notable example of a term-rewriting macro system that does allow reasoning about typing \cite{ScalaMacros2013}. In particular, Scala's ``black box'' macros include type annotations on the arguments. We are not aware of a typed macro system that has been integrated into a language with an ML-style module system. The main problem with Scala's macro system, then, is that it does not give us enough syntactic control -- we must repurpose Scala's existing syntactic forms, as discussed in point 2 above.
% The most immediate problem with using these for our example is that we are not aware of any such statically-typed macro system that integrates cleanly with an ML-style module system. In other words, macros cannot be parameterized by modules. However, let us imagine such a macro system. We could use it to repurpose string syntax as follows:
% \begin{lstlisting}[numbers=none]
% let val ssn = rx R {rx|SSTR\d\d\d-\d\d-\d\d\d\dESTR|rx}
% \end{lstlisting}
% The definition of the macro \lstinline{rx} might look like this:
% \begin{lstlisting}
% macro rx[Q : RX](e) at Q.t {
% static fun f(e : Exp) : Exp => case(e) {
% StrLit(s) => (* regex parser here *)
% | BinOp(Caret, e1, e2) => `SQTQ.Seq(Q.Str(%EQTe1SQT), %(EQTf e2SQT))EQT`
% | BinOp(Plus, e1, e2) => `SQTQ.Seq(%(EQTf e1SQT), %(EQTf e2SQT))EQT`
% | _ => raise Error
% }
% }
% \end{lstlisting}
% Here, \lstinline{rx} is a macro parameterized by a module matching \lstinline{rx} (we identify it as \lstinline{Q} to emphasize that there is nothing special about the identifier \lstinline{R}) and taking a single argument, identified as \lstinline{e}. The macro specifies a type annotation, \lstinline{at Q.t}, which imposes the constraint that the expansion the macro statically generates must be of type \lstinline{Q.t} for the provided parameter \lstinline{Q}. This expansion is generated by a \emph{static function} that examines the syntax tree of the provided argument (syntax trees are of a type \lstinline{Exp} defined in the standard library; cf. SML/NJ's visible compiler \cite{SML/VisibleCompiler}). If it is a string literal, as in the example above, it statically parses the literal body to generate an expansion (the details of the parser, elided on line 3, would be entirely standard).
% By parsing the string statically, we avoid the problems of dynamic string parsing for statically-known patterns.
% For patterns that are constructed compositionally, we need to get more creative. For example, we might repurpose the infix operators that are normally used for other purposes to support string and pattern splicing, e.g. as follows:
% \begin{lstlisting}[numbers=none,escapechar=|]
% fun example_using_macro(name : string) =>
% rx R (name ^ "SSTR: ESTR" + ssn)
% \end{lstlisting}
% The binary operator \lstinline{^} is repurposed to indicate a spliced string and \lstinline{+} is repurposed to indicate a spliced pattern. The logic for handling these forms can be seen above on lines 4 and 5, respectively. We assume that there is derived syntax available at the type \lstinline{Exp}, i.e. \emph{quasiquotation} syntax as in Lisp \cite{Bawd99a} and Scala \cite{shabalin2013quasiquotes}, here delimited by backticks and using the prefix \lstinline{%} to indicate a spliced value (i.e. unquote).
% Having to creatively repurpose existing forms in this way limits the effect a library provider can have on cognitive cost (particularly when it would be desirable to express conventions that are quite different from the conventions adopted by the language). It also can create confusion for readers expecting parenthesized expressions to behave in a consistent manner. However, this approach is preferable to direct syntax extension because it avoids many of the problems discussed above: there cannot be syntactic conflicts (because the syntax is not extended at all), we can define macro abbreviations because macros are integrated into the language, there is a hygiene discipline that guarantees that the expansion will not capture variables inadvertently, and by using a typed macro system, programmers need not examine the expansion to know what type the expansion produced by a macro must have.
% \subsection{Active Libraries}
% The design we are proposing also has conceptual roots in earlier work on \emph{active libraries}, which similarly envisioned using compile-time computation to give library providers more control over various aspects of a programming system, including its concrete syntax (but did not take an approach rooted in the study of type systems) \cite{active-libraries-thesis}.
% TODO FLESH THIS OUT