Description
What version of regex are you using?
regex 1.9.1
regex-syntax 0.7.3
Describe the bug at a high level.
The literal extractor in the regex_syntax
crate is not extracting the literals I expect from some repetition nodes. Specifically, when the repetition doesn't have a max value it doesn't produce optimal literals.
For instance, the regexp ab{2,}
could produce the inexact literal I("abab")
, because ab
is guarantee to appear at least twice, however it produces I("ab")
. If a max value is explicitly specified, like in ab{2,10}
, the result is I("abab")
as expected.
The issue resides in this code snippet:
hir::Repetition { min, max: Some(max), .. } if min < max => {
assert!(min > 0); // handled above
let limit =
u32::try_from(self.limit_repeat).unwrap_or(u32::MAX);
let mut seq = Seq::singleton(Literal::exact(vec![]));
for _ in 0..cmp::min(min, limit) {
if seq.is_inexact() {
break;
}
seq = self.cross(seq, &mut subseq.clone());
}
seq.make_inexact();
seq
}
hir::Repetition { .. } => {
subseq.make_inexact();
subseq
}
The condition if min < max
could be removed if we assume that when max is None
it means that it should be >= min, which is the case in regular expressions. The snippet above could be replaced with:
hir::Repetition { min, .. } {
assert!(min > 0); // handled above
let limit =
u32::try_from(self.limit_repeat).unwrap_or(u32::MAX);
let mut seq = Seq::singleton(Literal::exact(vec![]));
for _ in 0..cmp::min(min, limit) {
if seq.is_inexact() {
break;
}
seq = self.cross(seq, &mut subseq.clone());
}
seq.make_inexact();
seq
}
If I'm not missing something obvious I can prepare a PR and send it for review.
What are the steps to reproduce the behavior?
use regex_syntax::{hir::literal::{Extractor, Seq, Literal}, parse};
fn main() {
let expected = Seq::from_iter([
Literal::inexact("abab")
]);
let hir = parse(r"(ab){2,10}").unwrap();
let got = Extractor::new().extract(&hir);
// This is OK, the literal extracted is I("abab") as expected.
assert_eq!(expected, got);
let hir = parse(r"(ab){2,}").unwrap();
let got = Extractor::new().extract(&hir);
// This is not OK, even though the regexp also guarantees that
// "ab" must appear at least twice, the literal extracted is I("ab")
// instead of I("abab") as I was expecting.
assert_eq!(expected, got);
}
What is the actual behavior?
The program above fails in the second assert_eq
statement because the literal produced for (ab){2,}
is I("ab")
instead of `I("abab")
What is the expected behavior?
Both assert_eq
statements should pass.