Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hir::literal::Extractor not producing optimal literals in some cases #1032

Closed
plusvic opened this issue Jul 10, 2023 · 1 comment · Fixed by #1033
Closed

hir::literal::Extractor not producing optimal literals in some cases #1032

plusvic opened this issue Jul 10, 2023 · 1 comment · Fixed by #1033

Comments

@plusvic
Copy link
Contributor

plusvic commented Jul 10, 2023

What version of regex are you using?

regex 1.9.1
regex-syntax 0.7.3

Describe the bug at a high level.

The literal extractor in the regex_syntax crate is not extracting the literals I expect from some repetition nodes. Specifically, when the repetition doesn't have a max value it doesn't produce optimal literals.

For instance, the regexp ab{2,} could produce the inexact literal I("abab"), because ab is guarantee to appear at least twice, however it produces I("ab"). If a max value is explicitly specified, like in ab{2,10}, the result is I("abab") as expected.

The issue resides in this code snippet:

https://github.com/rust-lang/regex/blob/28e16fa5c34ab30a84b20de730cbdbe636e8a6df/regex-syntax/src/hir/literal.rs#L480C13-L497

            hir::Repetition { min, max: Some(max), .. } if min < max => {
                assert!(min > 0); // handled above
                let limit =
                    u32::try_from(self.limit_repeat).unwrap_or(u32::MAX);
                let mut seq = Seq::singleton(Literal::exact(vec![]));
                for _ in 0..cmp::min(min, limit) {
                    if seq.is_inexact() {
                        break;
                    }
                    seq = self.cross(seq, &mut subseq.clone());
                }
                seq.make_inexact();
                seq
            }
            hir::Repetition { .. } => {
                subseq.make_inexact();
                subseq
            }

The condition if min < max could be removed if we assume that when max is None it means that it should be >= min, which is the case in regular expressions. The snippet above could be replaced with:

            hir::Repetition { min, .. } {
                assert!(min > 0); // handled above
                let limit =
                    u32::try_from(self.limit_repeat).unwrap_or(u32::MAX);
                let mut seq = Seq::singleton(Literal::exact(vec![]));
                for _ in 0..cmp::min(min, limit) {
                    if seq.is_inexact() {
                        break;
                    }
                    seq = self.cross(seq, &mut subseq.clone());
                }
                seq.make_inexact();
                seq
            }

If I'm not missing something obvious I can prepare a PR and send it for review.

What are the steps to reproduce the behavior?

use regex_syntax::{hir::literal::{Extractor, Seq, Literal}, parse};


fn main() {
    let expected = Seq::from_iter([
        Literal::inexact("abab")
    ]);
    
    let hir = parse(r"(ab){2,10}").unwrap();
    let got = Extractor::new().extract(&hir);

    // This is OK, the literal extracted is I("abab") as expected.
    assert_eq!(expected, got);
    
    let hir = parse(r"(ab){2,}").unwrap();
    let got = Extractor::new().extract(&hir);

    // This is not OK, even though the regexp also guarantees that
    // "ab" must appear at least twice, the literal extracted is I("ab")
    // instead of I("abab") as I was expecting.
    assert_eq!(expected, got);
}

What is the actual behavior?

The program above fails in the second assert_eq statement because the literal produced for (ab){2,} is I("ab") instead of `I("abab")

What is the expected behavior?

Both assert_eq statements should pass.

@BurntSushi
Copy link
Member

Yeah I think this sounds right to me!

plusvic added a commit to plusvic/regex that referenced this issue Jul 10, 2023
When repetitions didn't have an explicit max value, like in `(ab){2,}` the literal extractor was producing sub-optimal literals, like `"ab"` instead of `"abab"`.

Close rust-lang#1032
BurntSushi pushed a commit that referenced this issue Jul 11, 2023
When repetitions didn't have an explicit max value, like in `(ab){2,}`
the literal extractor was producing sub-optimal literals, like `"ab"`
instead of `"abab"`.

Close #1032
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants