-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tuned boyer moore fails a match case #446
Comments
I think the quickcheck tests are probably giving us a false sense of security here. My guess is that they almost always test the negative case since the odds of randomly generating two blobs where one is a substring of another are pretty low. To make those tests better, we probably need a custom |
There is a bug in the implementation and it's not clear how to fix it. A unit test has been added (marked to fail) that exposes the bug. For more discussion, see: #446
|
@BurntSushi, I'll try to dig into this soon. |
So it seems like the issue is that I was just going strait to the
fixes things. That shift resolution happens in three places (the skip loop, the backstop cleanup phase, and the match failure case), so I'm going to abstract it out into a helper. PR incoming. |
I think this bug managed to remain hidden because |
This patch fixes an issue where skip resolution would go strait to the default value (the md2_shift) on a match failure after the shift_loop. Now we do the right thing, and first check in the skip table. The problem with going strait to the md2_shift is that you can accidentally shift to far when `window_end` actually is in the pattern (as is the case for the failing match). In the issue thread I promised better modularity, but it turns out that shift resolution was a bit too decomposed in the other places I had mentioned. Sorry :(.
If you just generate two random strings, the odds are very high that the shorter one won't be a substring of the longer one once they reach any substantial length. This means that the existing quickcheck cases were probably just testing the negative cases. The exception would be the two cases that append the needle to the haystack, but those only test behavior at the ends. This patch adds a better quickcheck case that can test a needle anywhere in the haystack. See the comments on rust-lang#446
If you just generate two random strings, the odds are very high that the shorter one won't be a substring of the longer one once they reach any substantial length. This means that the existing quickcheck cases were probably just testing the negative cases. The exception would be the two cases that append the needle to the haystack, but those only test behavior at the ends. This patch adds a better quickcheck case that can test a needle anywhere in the haystack. See the comments on rust-lang#446
Here is a failing test case:
This was originally reported against ripgrep:
-S
and-i
BurntSushi/ripgrep#781Both bugs are related to TBM. That is, shutting TBM off fixes the issue. The above unit test was derived by widdling down the example corpus given in BurntSushi/ripgrep#781
I've tried to fix this myself, but the code is dense. As best I can tell, the md2 shift is wrong, or this is hitting up into a boundary condition with respect to the backstop (more likely?). That is, this particular case enters the
memchr
skip routine, which uses_
as its guard char. This takeswindow_end
from12
to44
, which is positioned at thel
inclone_created
. This then failscheck_match
(which seems correct), but this in turn causes themd2_shift
(which is12
) to get added to thewindow_end
index, which causes the code to skip ahead too far and miss the match.The part that doesn't make sense to me here is the relationship between
check_match
and themd2_shift
. Namely,check_match
appears to be used as a guard against whethermd2_shift
is used. If the match fails, then the code assumes we can skipmd2_shift
bytes. It almost seems like the case analysis is a little off. For example, this code seems to fix this particular bug:but I didn't arrive to this code in a principled way, so I'm not confident in it. It feels like to me the very fact that the
md2_shift
could overshoot the match is the problem, but it's less clear how to fix that.@ethanpailes Do you have any ideas here? I am probably going to disable TBM for now because of this bug. Would it make sense to use a more traditional TBM variant where we don't use frequency analysis but instead always skip based on the last byte?
The text was updated successfully, but these errors were encountered: