Is aho-corasick a good option for short patterns (2 bytes), in short text (< 30 bytes)? #142
-
I've a situation where I'm looking for very short 2-byte patterns in usually short text (under 30 bytes, under 10 in the vast majority of cases). However there are high hundreds / low thousands of haystacks, so a high one-shot setup cost might be worth it if it's not per-haystack. The patterns have a leading byte common to all patterns, and a trailing byte which varies, think string escapes. I'm wondering whether I should go with |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Interestingly, I do not. I think your instincts about match my own, or are at least what I would try first. A 10 byte haystack is rather short, for example, and it is too short for even the SSE2 implementations of Another choice, especially if your needles are two bytes, is to try using the lower level packed substring routines directly. It's very data dependent, but for example, if most of your For haystacks and needles that short, it might also be worth trying brute force as well. Finally, it could very well make sense to employ different strategies based on haystack length. For longer haystacks, I would generally expect Teddy to be your best bet. In theory, The |
Beta Was this translation helpful? Give feedback.
Interestingly, I do not. I think your instincts about match my own, or are at least what I would try first. A 10 byte haystack is rather short, for example, and it is too short for even the SSE2 implementations of
memchr
. In the 10 byte haystack case, it's likely that the SWAR approach will be used. And if you get down below 8 bytes (assuming a 64-bit target), then it will just be a byte-at-a-time loop.Another choice, especially if your needles are two bytes, is to try using the lower level packed substring routines directly. It's very data dependent, but for example, if most of your
memchr
searches produce a false positive, where as a two byte needle via Teddy (a vectorized packed subst…