You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
use regex_automata::{meta::regex, nfa::thompson::pikevm::pikevm};fnmain(){
env_logger::init();let pattern = r".+\b\n";let haystack = "β77\n";let baseline = pikevm::new(pattern).unwrap();letmut cache = baseline.create_cache();let re = regex::new(pattern).unwrap();let found1 = re.find(haystack);let found2 = baseline.find(&mut cache, haystack);iflet some(found1) = found1 {let found2 = found2.expect("found in target, but not in baseline!");assert_eq!(found1, found2);}}
From looking at RUST_LOG=trace cargo run, my guess (without looking at the code yet) is that something in the reverse suffix optimization isn't handling the DFA quit error correctly. That is, when it does a reverse scan after a literal match, it's stopping its search for the starting point early... Probably because of the beta character (a non-ASCII codepoint). The higher level code should see it as a quit error and fall back to another strategy but instead it's seeing it as a correct match. Or perhaps a quit error isn't being returned at all somehow.
Ahhhhhhhhhhhh yeah it's not returning a quit error:
This fixes a bug that can occur when:
1. The regex has a Unicode word boundary.
2. The haystack contains some non-ASCII Unicode scalar value.
3. An inner or suffix literal optimization is in play.
Specifically, this provokes a case where a match is detected in one of
the meta engine's ad hoc DFA search routines, but before the match
reaches its correct endpoint, a quit state is entered. (Because DFAs
can't deal with Unicode word boundaries on non-ASCII haystacks.) The
correct thing to do is to return a quit error and let the higher level
logic divert to a different engine, but it was returning the match that
it had found up until that point instead. The match returned is not
technically incorrect in the sense that a match does indeed exist, but
the offsets it reports may be shorter than what the true match actually
is.
So... if a quit state is entered, return an error regardless of whether
a match has been found.
Fixes#1046
This fails, but it should succeed:
From looking at
RUST_LOG=trace cargo run
, my guess (without looking at the code yet) is that something in the reverse suffix optimization isn't handling the DFA quit error correctly. That is, when it does a reverse scan after a literal match, it's stopping its search for the starting point early... Probably because of the beta character (a non-ASCII codepoint). The higher level code should see it as a quit error and fall back to another strategy but instead it's seeing it as a correct match. Or perhaps a quit error isn't being returned at all somehow.Ahhhhhhhhhhhh yeah it's not returning a quit error:
regex/regex-automata/src/meta/limited.rs
Lines 123 to 126 in 855c5c4
This also afflicts the "stop at" engine which is used in the reverse inner optimization:
regex/regex-automata/src/meta/stopat.rs
Lines 124 to 127 in 855c5c4
And afflicts the fully compiled DFAs for both of those as well.
The text was updated successfully, but these errors were encountered: