Skip to content

hir::literal::Extractor not producing optimal literals in some cases #1032

Closed
@plusvic

Description

@plusvic

What version of regex are you using?

regex 1.9.1
regex-syntax 0.7.3

Describe the bug at a high level.

The literal extractor in the regex_syntax crate is not extracting the literals I expect from some repetition nodes. Specifically, when the repetition doesn't have a max value it doesn't produce optimal literals.

For instance, the regexp ab{2,} could produce the inexact literal I("abab"), because ab is guarantee to appear at least twice, however it produces I("ab"). If a max value is explicitly specified, like in ab{2,10}, the result is I("abab") as expected.

The issue resides in this code snippet:

https://github.com/rust-lang/regex/blob/28e16fa5c34ab30a84b20de730cbdbe636e8a6df/regex-syntax/src/hir/literal.rs#L480C13-L497

            hir::Repetition { min, max: Some(max), .. } if min < max => {
                assert!(min > 0); // handled above
                let limit =
                    u32::try_from(self.limit_repeat).unwrap_or(u32::MAX);
                let mut seq = Seq::singleton(Literal::exact(vec![]));
                for _ in 0..cmp::min(min, limit) {
                    if seq.is_inexact() {
                        break;
                    }
                    seq = self.cross(seq, &mut subseq.clone());
                }
                seq.make_inexact();
                seq
            }
            hir::Repetition { .. } => {
                subseq.make_inexact();
                subseq
            }

The condition if min < max could be removed if we assume that when max is None it means that it should be >= min, which is the case in regular expressions. The snippet above could be replaced with:

            hir::Repetition { min, .. } {
                assert!(min > 0); // handled above
                let limit =
                    u32::try_from(self.limit_repeat).unwrap_or(u32::MAX);
                let mut seq = Seq::singleton(Literal::exact(vec![]));
                for _ in 0..cmp::min(min, limit) {
                    if seq.is_inexact() {
                        break;
                    }
                    seq = self.cross(seq, &mut subseq.clone());
                }
                seq.make_inexact();
                seq
            }

If I'm not missing something obvious I can prepare a PR and send it for review.

What are the steps to reproduce the behavior?

use regex_syntax::{hir::literal::{Extractor, Seq, Literal}, parse};


fn main() {
    let expected = Seq::from_iter([
        Literal::inexact("abab")
    ]);
    
    let hir = parse(r"(ab){2,10}").unwrap();
    let got = Extractor::new().extract(&hir);

    // This is OK, the literal extracted is I("abab") as expected.
    assert_eq!(expected, got);
    
    let hir = parse(r"(ab){2,}").unwrap();
    let got = Extractor::new().extract(&hir);

    // This is not OK, even though the regexp also guarantees that
    // "ab" must appear at least twice, the literal extracted is I("ab")
    // instead of I("abab") as I was expecting.
    assert_eq!(expected, got);
}

What is the actual behavior?

The program above fails in the second assert_eq statement because the literal produced for (ab){2,} is I("ab") instead of `I("abab")

What is the expected behavior?

Both assert_eq statements should pass.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions