Skip to content

Commit 2b92b94

Browse files
committed
syntax: fix HIR printer
This fixes some corner cases in the HIR printer where it would print the concrete syntax of a regex that does not match the natural interpretation of the HIR. One such example of this is: concat(a, alt(b, c)) This would get printed as ab|c But clearly, it should be printed as: a(?:b|c) The issue here is that the printer only considers the current HirKind when determining how to print it. Sometimes a group is needed to print an alt (and even a concat, in the case of 'rep(+, concat(a, b))'), but sometimes it isn't. We could address this in a few different ways: 1) Always print concats and alts inside a non-capturing group. 2) Make the printer detect precisely the cases where a non-capturing group is needed. 3) Make the HIR smart constructors insert non-capturing groups when needed. 4) Do some other thing to change the HIR to prevent these sorts of things by construction. This patch goes with (1). The reason in favor of it is that HIR printer was always about printing an equivalent regex and never about trying to print a "nice" regex. Indeed, the HIR printer can't print a nice regex, because the HIR represents a rigorously simplifed view of a regex to make analysis easier. (The most obvious such example are Unicode character classes. For example, the HIR printer never prints '\w'.) So inserting some extra groups (which it already does) even when they aren't strictly needed is perfectly okay. But still, it's useful to say why we didn't do the other choices: 2) Modifying the printer to only print groups when they're actually needed is pretty difficult. I tried this briefly, and handling this case requires some notion of what the parent expression is. This winds up being a possible but hairy change. 3) Making the HIR more complicated to make the printer correct seems like it's optimizing for the wrong thing. Inserting extra groups in places just obfuscates HIR values that already have clear semantics. That is, use concat(a, alt(b, c)) over concat(a, group(alt(b, c))). 4) It's not clear how we would change the HIR to guarantee this sort of thing wouldn't happen. At the very least, it seems likely it would require a more complex data type. At first, I had thought (1) seemed inelegant. But the more I thought about it, the more it seemed quite consistent with how the HIR printer already worked. So that's the path I took here. Closes #516, Closes #731
1 parent 58e7b70 commit 2b92b94

File tree

1 file changed

+32
-17
lines changed

1 file changed

+32
-17
lines changed

regex-syntax/src/hir/print.rs

+32-17
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,9 @@ impl<W: fmt::Write> Visitor for Writer<W> {
8989

9090
fn visit_pre(&mut self, hir: &Hir) -> fmt::Result {
9191
match *hir.kind() {
92-
HirKind::Empty
93-
| HirKind::Repetition(_)
94-
| HirKind::Concat(_)
95-
| HirKind::Alternation(_) => {}
92+
// Empty is represented by nothing in the concrete syntax, and
93+
// repetition operators are strictly suffix oriented.
94+
HirKind::Empty | HirKind::Repetition(_) => {}
9695
HirKind::Literal(hir::Literal::Unicode(c)) => {
9796
self.write_literal_char(c)?;
9897
}
@@ -162,6 +161,22 @@ impl<W: fmt::Write> Visitor for Writer<W> {
162161
self.wtr.write_str("(?:")?;
163162
}
164163
},
164+
// Why do this? Wrapping concats and alts in non-capturing groups
165+
// is not *always* necessary, but is sometimes necessary. For
166+
// example, 'concat(a, alt(b, c))' should be written as 'a(?:b|c)'
167+
// and not 'ab|c'. The former is clearly the intended meaning, but
168+
// the latter is actually 'alt(concat(a, b), c)'.
169+
//
170+
// It would be possible to only group these things in cases where
171+
// it's strictly necessary, but it requires knowing the parent
172+
// expression. And since this technique is simpler and always
173+
// correct, we take this route. More to the point, it is a non-goal
174+
// of an HIR printer to show a nice easy-to-read regex. Indeed,
175+
// its construction forbids it from doing so. Therefore, inserting
176+
// extra groups where they aren't necessary is perfectly okay.
177+
HirKind::Concat(_) | HirKind::Alternation(_) => {
178+
self.wtr.write_str(r"(?:")?;
179+
}
165180
}
166181
Ok(())
167182
}
@@ -172,9 +187,7 @@ impl<W: fmt::Write> Visitor for Writer<W> {
172187
HirKind::Empty
173188
| HirKind::Literal(_)
174189
| HirKind::Class(_)
175-
| HirKind::Look(_)
176-
| HirKind::Concat(_)
177-
| HirKind::Alternation(_) => {}
190+
| HirKind::Look(_) => {}
178191
HirKind::Repetition(ref x) => {
179192
match (x.min, x.max) {
180193
(0, Some(1)) => {
@@ -206,8 +219,10 @@ impl<W: fmt::Write> Visitor for Writer<W> {
206219
self.wtr.write_str("?")?;
207220
}
208221
}
209-
HirKind::Group(_) => {
210-
self.wtr.write_str(")")?;
222+
HirKind::Group(_)
223+
| HirKind::Concat(_)
224+
| HirKind::Alternation(_) => {
225+
self.wtr.write_str(r")")?;
211226
}
212227
}
213228
Ok(())
@@ -374,12 +389,12 @@ mod tests {
374389

375390
#[test]
376391
fn print_alternation() {
377-
roundtrip("|", "|");
378-
roundtrip("||", "||");
392+
roundtrip("|", "(?:|)");
393+
roundtrip("||", "(?:||)");
379394

380-
roundtrip("a|b", "a|b");
381-
roundtrip("a|b|c", "a|b|c");
382-
roundtrip("foo|bar|quux", "foo|bar|quux");
395+
roundtrip("a|b", "(?:a|b)");
396+
roundtrip("a|b|c", "(?:a|b|c)");
397+
roundtrip("foo|bar|quux", "(?:(?:foo)|(?:bar)|(?:quux))");
383398
}
384399

385400
// This is a regression test that stresses a peculiarity of how the HIR
@@ -415,7 +430,7 @@ mod tests {
415430
}),
416431
Hir::literal(hir::Literal::Unicode('y')),
417432
]);
418-
assert_eq!(r"x(?:ab)+y", expr.to_string());
433+
assert_eq!(r"(?:x(?:ab)+y)", expr.to_string());
419434
}
420435

421436
// Just like regression_repetition_concat, but with the repetition using
@@ -437,7 +452,7 @@ mod tests {
437452
}),
438453
Hir::literal(hir::Literal::Unicode('y')),
439454
]);
440-
assert_eq!(r"x(?:a|b)+y", expr.to_string());
455+
assert_eq!(r"(?:x(?:a|b)+y)", expr.to_string());
441456
}
442457

443458
// This regression test is very similar in flavor to
@@ -460,6 +475,6 @@ mod tests {
460475
Hir::literal(hir::Literal::Unicode('c')),
461476
]),
462477
]);
463-
assert_eq!(r"a(?:b|c)", expr.to_string());
478+
assert_eq!(r"(?:a(?:b|c))", expr.to_string());
464479
}
465480
}

0 commit comments

Comments
 (0)