Skip to content

regex with Unicode word boundary and suffix literal and a haystack with non-ASCII codepoint is handled incorrectly #1046

Closed
@BurntSushi

Description

@BurntSushi

This fails, but it should succeed:

use regex_automata::{meta::regex, nfa::thompson::pikevm::pikevm};

fn main() {
    env_logger::init();

    let pattern = r".+\b\n";
    let haystack = "β77\n";
    let baseline = pikevm::new(pattern).unwrap();
    let mut cache = baseline.create_cache();
    let re = regex::new(pattern).unwrap();

    let found1 = re.find(haystack);
    let found2 = baseline.find(&mut cache, haystack);
    if let some(found1) = found1 {
        let found2 = found2.expect("found in target, but not in baseline!");
        assert_eq!(found1, found2);
    }
}

From looking at RUST_LOG=trace cargo run, my guess (without looking at the code yet) is that something in the reverse suffix optimization isn't handling the DFA quit error correctly. That is, when it does a reverse scan after a literal match, it's stopping its search for the starting point early... Probably because of the beta character (a non-ASCII codepoint). The higher level code should see it as a quit error and fall back to another strategy but instead it's seeing it as a correct match. Or perhaps a quit error isn't being returned at all somehow.

Ahhhhhhhhhhhh yeah it's not returning a quit error:

} else if sid.is_quit() {
if mat.is_some() {
return Ok(mat);
}

This also afflicts the "stop at" engine which is used in the reverse inner optimization:

} else if sid.is_quit() {
if mat.is_some() {
return Ok(mat.ok_or(at));
}

And afflicts the fully compiled DFAs for both of those as well.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions