Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tuned boyer moore fails a match case #446

Closed
BurntSushi opened this issue Feb 8, 2018 · 5 comments
Closed

tuned boyer moore fails a match case #446

BurntSushi opened this issue Feb 8, 2018 · 5 comments

Comments

@BurntSushi
Copy link
Member

Here is a failing test case:

    #[test]
    fn bm_backstop_boundary() {
        let haystack = b"\
// aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
e_data.clone_created(entity_id, entity_to_add.entity_id);
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
".to_vec();
        let needle = b"clone_created".to_vec();

        let searcher = BoyerMooreSearch::new(needle);
        let result = searcher.find(&haystack);
        assert_eq!(Some(43), result);
    }

This was originally reported against ripgrep:

Both bugs are related to TBM. That is, shutting TBM off fixes the issue. The above unit test was derived by widdling down the example corpus given in BurntSushi/ripgrep#781

I've tried to fix this myself, but the code is dense. As best I can tell, the md2 shift is wrong, or this is hitting up into a boundary condition with respect to the backstop (more likely?). That is, this particular case enters the memchr skip routine, which uses _ as its guard char. This takes window_end from 12 to 44, which is positioned at the l in clone_created. This then fails check_match (which seems correct), but this in turn causes the md2_shift (which is 12) to get added to the window_end index, which causes the code to skip ahead too far and miss the match.

The part that doesn't make sense to me here is the relationship between check_match and the md2_shift. Namely, check_match appears to be used as a guard against whether md2_shift is used. If the match fails, then the code assumes we can skip md2_shift bytes. It almost seems like the case analysis is a little off. For example, this code seems to fix this particular bug:

                if self.check_match(haystack, window_end) {
                    return Some(window_end - (self.pattern.len() - 1));
                } else if window_end + self.md2_shift >= backstop {
                    break;
                } else {
                    window_end += self.md2_shift;
                    continue;
                }

but I didn't arrive to this code in a principled way, so I'm not confident in it. It feels like to me the very fact that the md2_shift could overshoot the match is the problem, but it's less clear how to fix that.

@ethanpailes Do you have any ideas here? I am probably going to disable TBM for now because of this bug. Would it make sense to use a more traditional TBM variant where we don't use frequency analysis but instead always skip based on the last byte?

@BurntSushi
Copy link
Member Author

I think the quickcheck tests are probably giving us a false sense of security here. My guess is that they almost always test the negative case since the odds of randomly generating two blobs where one is a substring of another are pretty low. To make those tests better, we probably need a custom Arbitrary impl that knows to provide data that tests positive matches with some probability.

BurntSushi added a commit that referenced this issue Feb 8, 2018
There is a bug in the implementation and it's not clear how to fix it. A
unit test has been added (marked to fail) that exposes the bug.

For more discussion, see: #446
@BurntSushi
Copy link
Member Author

regex 0.2.6 is released with TBM disabled.

@ethanpailes
Copy link
Contributor

@BurntSushi, I'll try to dig into this soon.

@ethanpailes
Copy link
Contributor

So it seems like the issue is that I was just going strait to the md2_shift on a match failure instead of resolving the shift through the skip table first and only defaulting back to the md2_shift on a zero skip. In particular, the patch

diff --git a/src/literals.rs b/src/literals.rs
index 852d4e76..72c210b8 100644
--- a/src/literals.rs
+++ b/src/literals.rs
@@ -627,7 +627,9 @@ impl BoyerMooreSearch {
                 if self.check_match(haystack, window_end) {
                     return Some(window_end - (self.pattern.len() - 1));
                 } else {
-                    window_end += self.md2_shift;
+                    let skip = self.skip_table[haystack[window_end] as usize];
+                    window_end +=
+                        if skip == 0 { self.md2_shift } else { skip };
                     continue;
                 }
             }
@@ -946,7 +948,6 @@ mod tests {
     }
 
     #[test]
-    #[should_panic]
     fn bm_backstop_boundary() {
         let haystack = b"\
 // aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

fixes things. That shift resolution happens in three places (the skip loop, the backstop cleanup phase, and the match failure case), so I'm going to abstract it out into a helper. PR incoming.

@ethanpailes
Copy link
Contributor

I think this bug managed to remain hidden because self.skip_table[haystack[window_end] as usize] != 0 is always true after skip_loop returns except in cases where the memchr optimization applies or the backstop has been hit.

ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Feb 9, 2018
This patch fixes an issue where skip resolution would go strait
to the default value (the md2_shift) on a match failure after
the shift_loop. Now we do the right thing, and first check in
the skip table. The problem with going strait to the md2_shift
is that you can accidentally shift to far when `window_end`
actually is in the pattern (as is the case for the failing
match).

In the issue thread I promised better modularity, but it turns
out that shift resolution was a bit too decomposed in the other
places I had mentioned. Sorry :(.
@BurntSushi BurntSushi mentioned this issue Feb 9, 2018
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Feb 10, 2018
If you just generate two random strings, the odds are very high
that the shorter one won't be a substring of the longer one once
they reach any substantial length. This means that the existing
quickcheck cases were probably just testing the negative cases.
The exception would be the two cases that append the needle
to the haystack, but those only test behavior at the ends. This
patch adds a better quickcheck case that can test a needle anywhere
in the haystack.

See the comments on rust-lang#446
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Mar 7, 2018
If you just generate two random strings, the odds are very high
that the shorter one won't be a substring of the longer one once
they reach any substantial length. This means that the existing
quickcheck cases were probably just testing the negative cases.
The exception would be the two cases that append the needle
to the haystack, but those only test behavior at the ends. This
patch adds a better quickcheck case that can test a needle anywhere
in the haystack.

See the comments on rust-lang#446
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants