1
1
Rust's lexical grammar is not context-free. Raw string literals are the source
2
2
of the problem. Informally, a raw string literal is an ` r ` , followed by ` N `
3
3
hashes (where N can be zero), a quote, any characters, then a quote followed
4
- by ` N ` hashes. This grammar describes this as best possible:
4
+ by ` N ` hashes. Critically, once inside the first pair of quotes,
5
+ another quote cannot be followed by ` N ` consecutive hashes. e.g.
6
+ ` r###""###"### ` is invalid.
7
+
8
+ This grammar describes this as best possible:
5
9
6
10
R -> 'r' S
7
11
S -> '"' B '"'
@@ -22,8 +26,39 @@ accepted as one by the above grammar, using the derivation:
22
26
(Where ` T : U ` means the rule ` T ` is applied, and ` U ` is the remainder of the
23
27
string.) The difficulty arises from the fact that it is fundamentally
24
28
context-sensitive. In particular, the context needed is the number of hashes.
25
- I know of no way to resolve this, but also have not come up with a proof that
26
- it is not context sensitive. Such a proof would probably use the pumping lemma
27
- for context-free languages, but I (cmr) could not come up with a proof after
28
- spending a few hours on it, and decided my time best spent elsewhere. Pull
29
- request welcome!
29
+
30
+ To prove that Rust's string literals are not context-free, we will use
31
+ the fact that context-free languages are closed under intersection with
32
+ regular languages, and the
33
+ [ pumping lemma for context-free languages] ( https://en.wikipedia.org/wiki/Pumping_lemma_for_context-free_languages ) .
34
+
35
+ Consider the regular language ` R = r#+""#*"#+ ` . If Rust's raw string literals are
36
+ context-free, then their intersection with ` R ` , ` R' ` , should also be context-free.
37
+ Therefore, to prove that raw string literals are not context-free,
38
+ it is sufficient to prove that ` R' ` is not context-free.
39
+
40
+ The language ` R' ` is ` {r#^n""#^m"#^n | m < n} ` .
41
+
42
+ Assume ` R' ` * is* context-free. Then ` R' ` has some pumping length ` p > 0 ` for which
43
+ the pumping lemma applies. Consider the following string ` s ` in ` R' ` :
44
+
45
+ ` r#^p""#^{p-1}"#^p `
46
+
47
+ e.g. for ` p = 2 ` : ` s = r##""#"## `
48
+
49
+ Then ` s = uvwxy ` for some choice of ` uvwxy ` such that ` vx ` is non-empty,
50
+ ` |vwx| < p+1 ` , and ` uv^iwx^iy ` is in ` R' ` for all ` i >= 0 ` .
51
+
52
+ Neither ` v ` nor ` x ` can contain a ` " ` or ` r ` , as the number of these characters
53
+ in any string in ` R' ` is fixed. So ` v ` and ` x ` contain only hashes.
54
+ Consequently, of the three sequences of hashes, ` v ` and ` x ` combined
55
+ can only pump two of them.
56
+ If we ever choose the central sequence of hashes, then one of the outer sequences
57
+ will not grow when we pump, leading to an imbalance between the outer sequences.
58
+ Therefore, we must pump both outer sequences of hashes. However,
59
+ there are ` p+2 ` characters between these two sequences of hashes, and ` |vwx| ` must
60
+ be less than ` p+1 ` . Therefore we have a contradiction, and ` R' ` must not be
61
+ context-free.
62
+
63
+ Since ` R' ` is not context-free, it follows that the Rust's raw string literals
64
+ must not be context-free.
0 commit comments