-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update substring search to use the Two Way algorithm #26327
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
cc @BurntSushi, you know substring search very well, cc @Kimundi about the pattern API |
I'll show some benchmark results. My workspace for this is temporarily hosted as a repo, if you want to check the quickcheck & benchmark implementations.
I got a bit pessimistic when I tried out the different bad cases I could find. The performance bug #25483 is the benchmark called aa_100k below. From experience, the main improvement in runtime for that benchmark comes from the Look at the Edit: Updated to current benchmarks. The old results are archived.
|
'search: loop { | ||
// Check that we have room to search in | ||
if self.position + needle.len() > haystack.len() { | ||
if start == haystack.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be done outside of the loop? We will only return Done
on the first iteration, if ever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no real loop -- it's used to return back to this point. So this if check can be visited multiple times per call to next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm I see now, that's a good point. it evolved into this.
Thanks for your extensive review @gereeter! I've pushed a commit to address some of this, with some cleanup. I also realized I could move the break for |
The benchmark list was very long, so here's just some improvements after the later Rejects
|
Now I feel good about the reverse algorithm, thanks @gereeter. I've split out the add-on commits to make it easier to review hopefully, so I'll squash before merge later. Edited: Now rebased to just two commits. At this point it should be algorithmically safe and sound. Short of switching algorithm, I guess followup changes can focus on performance optimizations. Sad to say, the loops are indexed, not using iterators, and they perform better the way they are than using .zip with two slice iterators.
|
|
||
// We have found a match! | ||
let match_pos = self.end - needle.len(); | ||
self.end -= needle.len(); // add self.period for all matches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To update this comment for next_back
and make it more clear, maybe this should read "subtract self.period instead of needle.len() for overlapping matches" (similarly, the next
case could be changed, but with add, not subtract)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this
This looks really good! Short of doing word-level things, which is messy and requires |
@gereeter I found a paper on that kind of extension to this algorithm http://drops.dagstuhl.de/opus/volltexte/2011/3355/ |
That is the paper I had in mind when I made my comment - I saw your comment on #25483. |
167bad3
to
00e97bf
Compare
Ok, the twoway search can be faster if instead of emitting Rejects, we just search to the next match. Let the StrSearcher handle emitting the implied intermediate Rejects. I'm going to incorporate that now, because it means we don't have to change the pattern API's specification. I found this by looking at what @BurntSushi's regex does. |
Results of moving reject handling out of the algorithm, that gave it some air to optimize the loop to find a match better. The reverse case is sped up the same way, showing results just for the forward case.
|
Updated to add two new commits. Will squash to one commit after review is done, and I'll update the commit logs with the fact that we don't need to change the Pattern API. |
First, that improvement is impressive enough that I think we should keep the new coalesing of Second, it seems that the primary users who benifit from smaller |
I just looked at benchmarks for We need to implement all the trait's methods in the long run, and yes, use something else (memcmp-like) for the prefix and suffix versions. |
@bluss: Yeah, the old impls never got optimized.
If the string searcher where to be written with only Having the duplication is unfortunate, but I don't see a better way apart from ensuring the generic codegen will optimize out the right cases on its own. |
I just addressed is_prefix_of / is_suffix_of, so they are a bit faster with this change than the naive impl they replace. Finding the first reject could be done simpler too, we could special case just that part maybe. |
@bluss: It doesn't have to match the naive impl exactly. It can split the Rejects up at arbitrary points if it helps. |
I've experimented back and forth (macros, closures for specialization) and it looks like duplicating the forward and backward searches is the best unfortunately. However, StrSearcher is a very specific algorithm for finding matches, and it is a suboptimal choice for reject-only search. In that case naive search is always better and any time we spend precomputing needle properties is wasted. For that reason, it would be best if this was given its own method on the Pattern API that circumvents the I'm away for the weekend, so enjoy midsummer everyone! |
☔ The latest upstream changes (presumably #26192) made this pull request unmergeable. Please resolve the merge conflicts. |
To improve our substring search performance, revive the two way searcher and adapt it to the Pattern API. Fixes rust-lang#25483, a performance bug: that particular case now completes faster in optimized rust than in ruby (but they share the same order of magnitude). Much thanks to @gereeter who helped me understand the reverse case better and wrote the comment explaining `next_back` in the code. I had quickcheck to fuzz test forward and reverse searching thoroughly. The two way searcher implements both forward and reverse search, but not double ended search. The forward and reverse parts of the two way searcher are completely independent. The two way searcher algorithm has very small, constant space overhead, requiring no dynamic allocation. Our implementation is relatively fast, especially due to the `byteset` addition to the algorithm, which speeds up many no-match cases. A bad case for the two way algorithm is: ``` let haystack = (0..10_000).map(|_| "dac").collect::<String>(); let needle = (0..100).map(|_| "bac").collect::<String>()); ``` For this particular case, two way is not much faster than the naive implementation it replaces.
Use a trait to be able to implement both the fast search that skips to each match, and the slower search that emits `Reject` intervals regularly. The latter is important for uses of `next_reject`.
Rebased. Updated with:
|
This is needed to not drop performance, after the trait-based changes. Force separate versions of the next method to be generated for the short and long period cases.
Here are some updated benchmarks with caveats.
The cases are:
|
I'll tentatively nominate this for the beta because I had hoped I would get this in before we branched for 1.2. It's a big performance improvement, no other perks. |
I just realized that I never actually said this, so LGTM. |
@gereeter Thank you, you've provided a lot of valuable help & feedback! |
@bors: r+ Thanks for doing this! |
📌 Commit 274bb24 has been approved by |
Update substring search to use the Two Way algorithm To improve our substring search performance, revive the two way searcher and adapt it to the Pattern API. Fixes #25483, a performance bug: that particular case now completes faster in optimized rust than in ruby (but they share the same order of magnitude). Many thanks to @gereeter who helped me understand the reverse case better and wrote the comment explaining `next_back` in the code. I had quickcheck to fuzz test forward and reverse searching thoroughly. The two way searcher implements both forward and reverse search, but not double ended search. The forward and reverse parts of the two way searcher are completely independent. The two way searcher algorithm has very small, constant space overhead, requiring no dynamic allocation. Our implementation is relatively fast, especially due to the `byteset` addition to the algorithm, which speeds up many no-match cases. A bad case for the two way algorithm is: ``` let haystack = (0..10_000).map(|_| "dac").collect::<String>(); let needle = (0..100).map(|_| "bac").collect::<String>()); ``` For this particular case, two way is not much faster than the naive implementation it replaces.
triage: not accepting for a beta backport |
Update substring search to use the Two Way algorithm
To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.
Fixes #25483, a performance bug: that particular case now completes faster
in optimized rust than in ruby (but they share the same order of magnitude).
Many thanks to @gereeter who helped me understand the reverse case
better and wrote the comment explaining
next_back
in the code.I had quickcheck to fuzz test forward and reverse searching thoroughly.
The two way searcher implements both forward and reverse search,
but not double ended search. The forward and reverse parts of the two
way searcher are completely independent.
The two way searcher algorithm has very small, constant space overhead,
requiring no dynamic allocation. Our implementation is relatively fast,
especially due to the
byteset
addition to the algorithm, which speedsup many no-match cases.
A bad case for the two way algorithm is:
For this particular case, two way is not much faster than the naive
implementation it replaces.