-
Couldn't load subscription status.
- Fork 483
Description
Here is a failing test case:
#[test]
fn bm_backstop_boundary() {
let haystack = b"\
// aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
e_data.clone_created(entity_id, entity_to_add.entity_id);
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
".to_vec();
let needle = b"clone_created".to_vec();
let searcher = BoyerMooreSearch::new(needle);
let result = searcher.find(&haystack);
assert_eq!(Some(43), result);
}This was originally reported against ripgrep:
- ripgrep misses some matching lines BurntSushi/ripgrep#780
- ripgrep misses a match but succeeds with
-Sand-iBurntSushi/ripgrep#781
Both bugs are related to TBM. That is, shutting TBM off fixes the issue. The above unit test was derived by widdling down the example corpus given in BurntSushi/ripgrep#781
I've tried to fix this myself, but the code is dense. As best I can tell, the md2 shift is wrong, or this is hitting up into a boundary condition with respect to the backstop (more likely?). That is, this particular case enters the memchr skip routine, which uses _ as its guard char. This takes window_end from 12 to 44, which is positioned at the l in clone_created. This then fails check_match (which seems correct), but this in turn causes the md2_shift (which is 12) to get added to the window_end index, which causes the code to skip ahead too far and miss the match.
The part that doesn't make sense to me here is the relationship between check_match and the md2_shift. Namely, check_match appears to be used as a guard against whether md2_shift is used. If the match fails, then the code assumes we can skip md2_shift bytes. It almost seems like the case analysis is a little off. For example, this code seems to fix this particular bug:
if self.check_match(haystack, window_end) {
return Some(window_end - (self.pattern.len() - 1));
} else if window_end + self.md2_shift >= backstop {
break;
} else {
window_end += self.md2_shift;
continue;
}but I didn't arrive to this code in a principled way, so I'm not confident in it. It feels like to me the very fact that the md2_shift could overshoot the match is the problem, but it's less clear how to fix that.
@ethanpailes Do you have any ideas here? I am probably going to disable TBM for now because of this bug. Would it make sense to use a more traditional TBM variant where we don't use frequency analysis but instead always skip based on the last byte?