-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better Heuristics for Substring Search #72
Comments
Hi @JobLeonard! Sorry for a late response, didn’t see the issue. That’s a good suggestion for the serial version! I haven’t spent much time optimizing it. On most modern CPUs the forward and backward passes over memory are equally fast, AFAIK. It might be a good idea to also add a version that uses 64 bit integers, if misaligned reads are allowed. Would you be interested in trying a couple of such approaches and submitting a PR? In case you do, the |
I'll take a shot at it, with the following caveats:
So my benchmark nrs will be limited to a few simple variations of |
Sure @JobLeonard, you can start and I can join a bit later. For benchmarking I often take cloud instances (c7g and r7iz) to gain access to more hardware. Might be useful to you too 🤗 One thing to watch out for - on very short strings (well under 64 bytes) we are optimizing the for-loop. On longer strings, if we take the first and the last character - we end up fetching 2 cache lines from each string, instead of just one. |
Hi @JobLeonard, I have a big update! I've generalized the substring search methods to be able to match different characters within the needle. The method that infers the best targets is called StringZilla/include/stringzilla/stringzilla.h Lines 1586 to 1657 in cbdae26
Not sure about what would be the best dataset for such benchmarks, seems like this is related to #91. |
Thanks for the heads-up, looks like a worthwhile change for the examples given in the doc comment. I was about to write that I'm still interested in giving this a go but have been very busy at work in the last month. Not entirely sure when I manage to free up some time again but just wanted to re-assure you that I haven't forgotten about it! |
So I recently had to implement a fast text filter for work (in browser JavaScript, so no use for your library (yet) I'm afraid), and used the same-ish kind of insight but inverted: given that nearby characters are strongly correlated, then the first and last characters in a substring should be the least correlated. Or put in other words: suppose two (eventually) different words match their first
n
characters so far. Of all remaining characters, charactern+1
is most likely to be the same, so actually the worst character to test next.Statistically speaking, comparing the first and last characters has a higher chance to filter out mismatching strings early, avoiding false positives (of course, I'm well aware that linear memory access patterns might have a much bigger impact than this statistic).
This was not my original insight - Wojciech Muła also does this in one of his SIMD string search algorithms: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd (without justification though, he left that as an exercise for the reader)
Looking through the code-base, I see that you first match on a four-character needle, then finish comparing on the suffixes with this
sz_equal
function:StringZilla/stringzilla/stringzilla.h
Lines 94 to 98 in 14e7a78
My idea could be (relatively) quickly benchmarked by modifying this function. I haven't written C++ in a while, but I think this would be a correct "first test the last character, then test the remaining string characters" variation:
Or a "compare back-to-front" variation:
I am aware that simple forward linear memory access is faster than random or backward memory access, so I would not be surprised if both of these functions are slower in practice, regardless of what statistics says. But we won't know until we try, right?
For completeness sake here is my JS data structure, although I don't think it is particularly relevant as it solves a slightly different (but adjacent) problem than StringZilla
https://observablehq.com/@jobleonard/a-data-structure-for-table-row-filtering
The use-case is real-time filtering a data table based on user-input, leaving only rows that contain a substring match in any data cell with the input string. So it's more optimized for "latency". It pre-processes the indices of all trigrams of the strings in the table, and uses these as needles. For an input string it filters on the first trigram, then filters backwards from the last trigram in the input substring. Because trigrams are looked up from a hasmap the forward or backward access pattern concern does not apply here. It also uses previous results to narrow down the results faster (e.g. if a user typed
the
, then types ay
to producethey
, the filter only searches the existing results ofthe
instead of starting over)The text was updated successfully, but these errors were encountered: