Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex with Unicode word boundary and suffix literal and a haystack with non-ASCII codepoint is handled incorrectly #1046

Closed
BurntSushi opened this issue Jul 16, 2023 · 0 comments

Comments

@BurntSushi
Copy link
Member

This fails, but it should succeed:

use regex_automata::{meta::regex, nfa::thompson::pikevm::pikevm};

fn main() {
    env_logger::init();

    let pattern = r".+\b\n";
    let haystack = "β77\n";
    let baseline = pikevm::new(pattern).unwrap();
    let mut cache = baseline.create_cache();
    let re = regex::new(pattern).unwrap();

    let found1 = re.find(haystack);
    let found2 = baseline.find(&mut cache, haystack);
    if let some(found1) = found1 {
        let found2 = found2.expect("found in target, but not in baseline!");
        assert_eq!(found1, found2);
    }
}

From looking at RUST_LOG=trace cargo run, my guess (without looking at the code yet) is that something in the reverse suffix optimization isn't handling the DFA quit error correctly. That is, when it does a reverse scan after a literal match, it's stopping its search for the starting point early... Probably because of the beta character (a non-ASCII codepoint). The higher level code should see it as a quit error and fall back to another strategy but instead it's seeing it as a correct match. Or perhaps a quit error isn't being returned at all somehow.

Ahhhhhhhhhhhh yeah it's not returning a quit error:

} else if sid.is_quit() {
if mat.is_some() {
return Ok(mat);
}

This also afflicts the "stop at" engine which is used in the reverse inner optimization:

} else if sid.is_quit() {
if mat.is_some() {
return Ok(mat.ok_or(at));
}

And afflicts the fully compiled DFAs for both of those as well.

@BurntSushi BurntSushi added the bug label Jul 16, 2023
BurntSushi added a commit that referenced this issue Oct 6, 2023
This fixes a bug that can occur when:

1. The regex has a Unicode word boundary.
2. The haystack contains some non-ASCII Unicode scalar value.
3. An inner or suffix literal optimization is in play.

Specifically, this provokes a case where a match is detected in one of
the meta engine's ad hoc DFA search routines, but before the match
reaches its correct endpoint, a quit state is entered. (Because DFAs
can't deal with Unicode word boundaries on non-ASCII haystacks.) The
correct thing to do is to return a quit error and let the higher level
logic divert to a different engine, but it was returning the match that
it had found up until that point instead. The match returned is not
technically incorrect in the sense that a match does indeed exist, but
the offsets it reports may be shorter than what the true match actually
is.

So... if a quit state is entered, return an error regardless of whether
a match has been found.

Fixes #1046
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant