Is aho-corasick a good option for short patterns (2 bytes), in short text (< 30 bytes)? #142

masklinn · 2024-04-29T11:43:48Z

masklinn
Apr 29, 2024

I've a situation where I'm looking for very short 2-byte patterns in usually short text (under 30 bytes, under 10 in the vast majority of cases). However there are high hundreds / low thousands of haystacks, so a high one-shot setup cost might be worth it if it's not per-haystack.

The patterns have a leading byte common to all patterns, and a trailing byte which varies, think string escapes.

I'm wondering whether I should go with memchr::memchr on the leading byte then check the second byte "by hand", or load all the variants in aho-corasick and search using that. I figure as the author of both you might have a good idea.

Answered by BurntSushi

Apr 29, 2024

Interestingly, I do not. I think your instincts about match my own, or are at least what I would try first. A 10 byte haystack is rather short, for example, and it is too short for even the SSE2 implementations of memchr. In the 10 byte haystack case, it's likely that the SWAR approach will be used. And if you get down below 8 bytes (assuming a 64-bit target), then it will just be a byte-at-a-time loop.

Another choice, especially if your needles are two bytes, is to try using the lower level packed substring routines directly. It's very data dependent, but for example, if most of your memchr searches produce a false positive, where as a two byte needle via Teddy (a vectorized packed subst…

View full answer

BurntSushi · 2024-04-29T12:11:49Z

BurntSushi
Apr 29, 2024
Maintainer

Interestingly, I do not. I think your instincts about match my own, or are at least what I would try first. A 10 byte haystack is rather short, for example, and it is too short for even the SSE2 implementations of memchr. In the 10 byte haystack case, it's likely that the SWAR approach will be used. And if you get down below 8 bytes (assuming a 64-bit target), then it will just be a byte-at-a-time loop.

Another choice, especially if your needles are two bytes, is to try using the lower level packed substring routines directly. It's very data dependent, but for example, if most of your memchr searches produce a false positive, where as a two byte needle via Teddy (a vectorized packed substring search implementation) will never produce a false positive, then the latter could be faster by skipping spurious confirmation steps that take you out of the vectorized code. But... 10 byte haystacks are too short for Teddy.

For haystacks and needles that short, it might also be worth trying brute force as well.

Finally, it could very well make sense to employ different strategies based on haystack length. For longer haystacks, I would generally expect Teddy to be your best bet. In theory, memchr on a single byte will have faster throughput, but it increases the chances of false positives. Teddy, by using bigger needles, reduces those chances and makes it faster in the general case. But if you can get away with a low false positive rate using memchr for your specific data, then that's probably your best bet.

The memchr crate's subtring search implementation does try pretty hard to take haystack length into account when deciding which algorithm to employ. But the aho-corasick crate doesn't do nearly as much. (Actually, it might not do anything at all based on haystack length.) There's probably a fair bit of room to improve there.

2 replies

masklinn Apr 29, 2024
Author

Thanks, for the link to packed as well I'd missed that, I guess all I can do is try to check if I can bench it, and if the difference is at all noticeable.

Looking at the sample dataset, 45% of the samples contain a leader byte and it doesn't seem like false positives are much of a concern (there is none in the sample). Given that and what you explained, memchr seems the more likely candidate. And possibly brute force, by that do you mean using the stdlib directly?

BurntSushi Apr 29, 2024
Maintainer

By brute force, I mean using the naive "check all positions" strategy. It's probably amenable to micro-optimization by treating the needle as a u16 and the haystack as a sequence of unaligned u16 values.

And yes, I agree, if you're interested in doing this in the fastest way possible, then you'll want to implement multiple strategies and then bake them off against one another for your specific data set. If you're intstead looking to do something that is "reasonable and probably fast," then I think just using AhoCorasick as-is is the answer to that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is aho-corasick a good option for short patterns (2 bytes), in short text (< 30 bytes)? #142

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Is aho-corasick a good option for short patterns (2 bytes), in short text (< 30 bytes)? #142

masklinn Apr 29, 2024

Replies: 1 comment · 2 replies

BurntSushi Apr 29, 2024 Maintainer

masklinn Apr 29, 2024 Author

BurntSushi Apr 29, 2024 Maintainer

masklinn
Apr 29, 2024

Replies: 1 comment 2 replies

BurntSushi
Apr 29, 2024
Maintainer

masklinn Apr 29, 2024
Author

BurntSushi Apr 29, 2024
Maintainer