questions about the Rabin-Karp and memcmp implementations #124

tofutofu · 2023-08-22T18:13:36Z

tofutofu
Aug 22, 2023

Hi Andrew -

Thanks a lot for developing this repo — I’ve been reading and studying your code to get a better feeling for idiomatic Rust in the context of a full-fledged library (in a domain in which I’m also particularly interested). I found the code generally not too hard to read - even though I’m a Rust newbie - but I still have a few questions. I hope you’ll have time to reply to some of these.

Btw - before I start - In https://github.com/BurntSushi/aho-corasick/tree/master/src/packed/teddy you mention a paper "Faro and Kulekci [4d]" . This paper turns out to be no longer behind a paywall. Haven’t been able to study it yet (SIMD programming is totally new to me) but it definitely seems interesting — and it contains interesting comparisons to other algorithms.

Ok - here goes.

(1) In rabinkarp.rs you implement a version of Rabin-Karp, that is essentially the same as the one described in the link you also provided (https://www-igm.univ-mlv.fr/~lecroq/string/node5.html).  I was wondering why you manage the hash table yourself rather than only providing the hash + update functions and then using the general HashMap or HashSet. Did you do his to avoid a dependency on std? Or simply because this was easy enough to implement? Other reasons?

(2) Since Rabin-Karp is only suitable for fixed-sized, small chunks, why not use the key itself as hash value? For instance when using 8-byte chunks, you could read those as a u64 and directly use that as hash value, or not?
(I can see that this might not work out well with the implemented binning method - assuming the keys are not random byte strings but natural language strings, but I wonder if you considered other types of rolling hashes.)

(3)

aho-corasick/src/packed/pattern.rs

Line 312 in feda228

unsafe {

contains an optimization where you cast raw byte pointers to u32 (px as *const u32).read_unaligned(). I suppose this is faster than doing individual reads using u32::from_ne_bytes, is that correct?  If so, why is that?

(I've some more questions related to the iterator design, but I'll post those another time.)

tofutofu · 2023-08-25T18:46:31Z

tofutofu
Aug 25, 2023
Author

I've been reading the Faro/ Külekci paper that you mentioned (https://www.sciencedirect.com/science/article/pii/S1570866714000471). Found it a rather disappointing: the described algorithms are still only limited to a max pattern size of 32, and their SIMD-implementation actually is slower for long patterns than some non-SIMD implementations of other algorithms which are much simpler (and could potentially run even faster if ported to use SIMD instructions). It's nice that they did publish the source code of all their implementations (including of other algorithms) though: https://github.com/smart-tool/smart/tree/master/source/algos.

0 replies

BurntSushi · 2023-08-29T13:59:52Z

BurntSushi
Aug 29, 2023
Maintainer

Hiya! Sorry about the late reply. You had some high-context questions so I kind of put it on the back-burner. :-)

(1) In rabinkarp.rs you implement a version of Rabin-Karp, that is essentially the same as the one described in the link you also provided (https://www-igm.univ-mlv.fr/~lecroq/string/node5.html). .%E2%80%A8) I was wondering why you manage the hash table yourself rather than only providing the hash + update functions and then using the general HashMap or HashSet. Did you do his to avoid a dependency on std? Or simply because this was easy enough to implement? Other reasons?

Today, yes, it avoids the dependency on std. That might have been why I did it originally, although when I wrote it initially, aho-corasick had an unconditional dependency on std. (At the time, I knew I eventually wanted to make it conditional. I just can't remember whether I used that as a constraint in this particular case.)

But separately from that, it's not obvious to me that I could use a HashMap from std as efficiently. You can't look up values by the hash, and the HashMap likely has other costs due to handling the general case. In theory one could measure it, but my prior here is so strong that it's not a place where I want to focus my effort. (And of course, such things could bite me.)

(2) Since Rabin-Karp is only suitable for fixed-sized, small chunks, why not use the key itself as hash value? For instance when using 8-byte chunks, you could read those as a u64 and directly use that as hash value, or not?
(I can see that this might not work out well with the implemented binning method - assuming the keys are not random byte strings but natural language strings, but I wonder if you considered other types of rolling hashes.)

I did not put too much thought into the specific kind of hash. I did a little experimenting but mostly just copied from the existing implementation. There very well could be wins here!

contains an optimization where you cast raw byte pointers to u32 (px as *const u32).read_unaligned(). I suppose this is faster than doing individual reads using u32::from_ne_bytes, is that correct?  If so, why is that?

I wouldn't be surprised if it compiled down to the same Assembly. But I'm already working in raw pointer land in that memcmp function, so to me, it's more natural to just do a read_unaligned() versus materializing an array and going through that dance. But for something like this, it really just has to be tried and measured. I can't really predict whether it will be better or not.

(I've some more questions related to the iterator design, but I'll post those another time.)

One thing that would be helpful is if you could ask distinct questions in a separate issue thread. Basically, put yourself in the shoes of someone browsing the Q&A category. "misc questions" doesn't really tell anyone anything about what's in there. (Which is why I changed the title for this one.)

Obviously if you have a group of very related questions then those can all go into one Q&A. And it was fine to do this one as one Q&A too. I just want to nudge you toward opening new questions for other stuff. :-)

Found it a rather disappointing: the described algorithms are still only limited to a max pattern size of 32, and their SIMD-implementation actually is slower for long patterns than some non-SIMD implementations of other algorithms which are much simpler (and could potentially run even faster if ported to use SIMD instructions).

This is actually probably okay. As long as it can find occurrences very quickly, it can be used as a prefilter. And the vast majority of substring searches, in my experience, use needles far less than 32 bytes. So it's a legitimate case to target with a special case.

(I still haven't read that paper though yet.)

2 replies

tofutofu Aug 29, 2023
Author

Thanks a lot for the reply - makes sense. I will also keep your suggestion in mind. (And congrats on the new job :) )

BurntSushi Aug 29, 2023
Maintainer

Thanks. :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions about the Rabin-Karp and memcmp implementations #124

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

questions about the Rabin-Karp and memcmp implementations #124

tofutofu Aug 22, 2023

Replies: 2 comments · 2 replies

tofutofu Aug 25, 2023 Author

BurntSushi Aug 29, 2023 Maintainer

tofutofu Aug 29, 2023 Author

BurntSushi Aug 29, 2023 Maintainer

tofutofu
Aug 22, 2023

Replies: 2 comments 2 replies

tofutofu
Aug 25, 2023
Author

BurntSushi
Aug 29, 2023
Maintainer

tofutofu Aug 29, 2023
Author

BurntSushi Aug 29, 2023
Maintainer