Replies: 2 comments 2 replies
-
I've been reading the Faro/ Külekci paper that you mentioned (https://www.sciencedirect.com/science/article/pii/S1570866714000471). Found it a rather disappointing: the described algorithms are still only limited to a max pattern size of 32, and their SIMD-implementation actually is slower for long patterns than some non-SIMD implementations of other algorithms which are much simpler (and could potentially run even faster if ported to use SIMD instructions). It's nice that they did publish the source code of all their implementations (including of other algorithms) though: https://github.com/smart-tool/smart/tree/master/source/algos. |
Beta Was this translation helpful? Give feedback.
-
Hiya! Sorry about the late reply. You had some high-context questions so I kind of put it on the back-burner. :-)
Today, yes, it avoids the dependency on But separately from that, it's not obvious to me that I could use a
I did not put too much thought into the specific kind of hash. I did a little experimenting but mostly just copied from the existing implementation. There very well could be wins here!
I wouldn't be surprised if it compiled down to the same Assembly. But I'm already working in raw pointer land in that
One thing that would be helpful is if you could ask distinct questions in a separate issue thread. Basically, put yourself in the shoes of someone browsing the Q&A category. "misc questions" doesn't really tell anyone anything about what's in there. (Which is why I changed the title for this one.) Obviously if you have a group of very related questions then those can all go into one Q&A. And it was fine to do this one as one Q&A too. I just want to nudge you toward opening new questions for other stuff. :-)
This is actually probably okay. As long as it can find occurrences very quickly, it can be used as a prefilter. And the vast majority of substring searches, in my experience, use needles far less than 32 bytes. So it's a legitimate case to target with a special case. (I still haven't read that paper though yet.) |
Beta Was this translation helpful? Give feedback.
-
Hi Andrew -
Thanks a lot for developing this repo — I’ve been reading and studying your code to get a better feeling for idiomatic Rust in the context of a full-fledged library (in a domain in which I’m also particularly interested). I found the code generally not too hard to read - even though I’m a Rust newbie - but I still have a few questions. I hope you’ll have time to reply to some of these.
Btw - before I start - In https://github.com/BurntSushi/aho-corasick/tree/master/src/packed/teddy you mention a paper "Faro and Kulekci [4d]" . This paper turns out to be no longer behind a paywall. Haven’t been able to study it yet (SIMD programming is totally new to me) but it definitely seems interesting — and it contains interesting comparisons to other algorithms.
Ok - here goes.
(1) In rabinkarp.rs you implement a version of Rabin-Karp, that is essentially the same as the one described in the link you also provided (https://www-igm.univ-mlv.fr/~lecroq/string/node5.html). I was wondering why you manage the hash table yourself rather than only providing the hash + update functions and then using the general HashMap or HashSet. Did you do his to avoid a dependency on std? Or simply because this was easy enough to implement? Other reasons?
(2) Since Rabin-Karp is only suitable for fixed-sized, small chunks, why not use the key itself as hash value? For instance when using 8-byte chunks, you could read those as a u64 and directly use that as hash value, or not?
(I can see that this might not work out well with the implemented binning method - assuming the keys are not random byte strings but natural language strings, but I wonder if you considered other types of rolling hashes.)
(3)
aho-corasick/src/packed/pattern.rs
Line 312 in feda228
contains an optimization where you cast raw byte pointers to u32
(px as *const u32).read_unaligned()
. I suppose this is faster than doing individual reads usingu32::from_ne_bytes
, is that correct? If so, why is that?(I've some more questions related to the iterator design, but I'll post those another time.)
Beta Was this translation helpful? Give feedback.
All reactions