Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the complete reverse case for the two way algorithm #27474

Merged
merged 4 commits into from
Aug 18, 2015

Conversation

bluss
Copy link
Member

@bluss bluss commented Aug 2, 2015

StrSearcher: Implement the complete reverse case for the two way algorithm

Fix quadratic behavior in StrSearcher in reverse search with periodic
needles.

This commit adds the missing pieces for the "short period" case in
reverse search. The short case will show up when the needle is literally
periodic, for example "abababab".

Two way uses a "critical factorization" of the needle: x = u v.

Searching matches v first, if mismatch at character k, skip k forward.
Matching u, if mismatch, skip period(x) forward.

To avoid O(mn) behavior after mismatch in u, memorize the already
matched prefix.

The short period case requires that |u| < period(x).

For the reverse search we need to compute a different critical
factorization x = u' v' where |v'| < period(x), because we are searching
for the reversed needle. A short v' also benefits the algorithm in
general.

The reverse critical factorization is computed quickly by using the same
maximal suffix algorithm, but terminating as soon as we have a location
with local period equal to period(x).

This adds extra fields crit_pos_back and memory_back for the reverse
case. The new overhead for TwoWaySearcher::new is low, and additionally
I think the "short period" case is uncommon in many applications of
string search.

The maximal_suffix methods were updated in documentation and the
algorithms updated to not use !0 and wrapping add, variable left is now
1 larger, offset 1 smaller.

Use periodicity when computing byteset: in the periodic case, just
iterate over one period instead of the whole needle.

Example before (rfind) after (twoway_rfind) benchmark shows the removal
of quadratic behavior.

needle: "ab" * 100, haystack: ("bb" + "ab" * 100) * 100

test periodic::rfind           ... bench:   1,926,595 ns/iter (+/- 11,390) = 10 MB/s
test periodic::twoway_rfind    ... bench:      51,740 ns/iter (+/- 66) = 386 MB/s

Add tests for .rfind(&str), using the reverse searcher case for
substring search.
Fix quadratic behavior in StrSearcher in reverse search with periodic
needles.

This commit adds the missing pieces for the "short period" case in
reverse search. The short case will show up when the needle is literally
periodic, for example "abababab".

Two way uses a "critical factorization" of the needle: x = u v.

Searching matches v first, if mismatch at character k, skip k forward.
Matching u, if mismatch, skip period(x) forward.

To avoid O(mn) behavior after mismatch in u, memorize the already
matched prefix.

The short period case requires that |u| < period(x).

For the reverse search we need to compute a different critical
factorization x = u' v' where |v'| < period(x), because we are searching
for the reversed needle. A short v' also benefits the algorithm in
general.

The reverse critical factorization is computed quickly by using the same
maximal suffix algorithm, but terminating as soon as we have a location
with local period equal to period(x).

This adds extra fields crit_pos_back and memory_back for the reverse
case. The new overhead for TwoWaySearcher::new is low, and additionally
I think the "short period" case is uncommon in many applications of
string search.

The maximal_suffix methods were updated in documentation and the
algorithms updated to not use !0 and wrapping add, variable left is now
1 larger, offset 1 smaller.

Use periodicity when computing byteset: in the periodic case, just
iterate over one period instead of the whole needle.

Example before (rfind) after (twoway_rfind) benchmark shows the removal
of quadratic behavior.

needle: "ab" * 100, haystack: ("bb" + "ab" * 100) * 100

```
test periodic::rfind           ... bench:   1,926,595 ns/iter (+/- 11,390) = 10 MB/s
test periodic::twoway_rfind    ... bench:      51,740 ns/iter (+/- 66) = 386 MB/s
```
@rust-highfive
Copy link
Collaborator

r? @huonw

(rust_highfive has picked a reviewer for you, use r? to override)

@bluss
Copy link
Member Author

bluss commented Aug 2, 2015

The battery of benchmarks is hosted on a workspace externally (exact commit)

Benchmarking with LTO and the before case (rfind) and after case (twoway_rfind) are almost unchanged, except for the periodic pathology I found. Testcases named periodic, periodic2.

test aaab_in_aab::rfind                     ... bench:   1,120,909 ns/iter (+/- 1,913) = 267 MB/s
test aaab_in_aab::twoway_rfind              ... bench:   1,124,913 ns/iter (+/- 50,777) = 266 MB/s
test aaabbb::rfind                          ... bench:   1,119,982 ns/iter (+/- 7,207) = 267 MB/s
test aaabbb::twoway_rfind                   ... bench:   1,120,322 ns/iter (+/- 22,937) = 267 MB/s
test allright::rfind                        ... bench:  11,145,830 ns/iter (+/- 720,105) = 160 MB/s
test allright::twoway_rfind                 ... bench:  11,379,793 ns/iter (+/- 5,088,007) = 156 MB/s
test bb_in_aa::rfind                        ... bench:       3,788 ns/iter (+/- 55) = 26399 MB/s
test bb_in_aa::twoway_rfind                 ... bench:       3,841 ns/iter (+/- 36) = 26034 MB/s
test bbbaaa::rfind                          ... bench:   1,355,646 ns/iter (+/- 2,350) = 221 MB/s
test bbbaaa::twoway_rfind                   ... bench:   1,356,275 ns/iter (+/- 1,134) = 221 MB/s
test gllright::rfind                        ... bench:   7,425,510 ns/iter (+/- 12,293) = 241 MB/s
test gllright::twoway_rfind                 ... bench:   7,427,276 ns/iter (+/- 12,227) = 241 MB/s
test naive::rfind                           ... bench:         904 ns/iter (+/- 227) = 276 MB/s
test naive::twoway_rfind                    ... bench:         919 ns/iter (+/- 187) = 272 MB/s
test naive_longpat::rfind                   ... bench:     304,007 ns/iter (+/- 500) = 328 MB/s
test naive_longpat::twoway_rfind            ... bench:     304,109 ns/iter (+/- 1,542) = 328 MB/s
test naive_longpat_reversed::rfind          ... bench:     529,984 ns/iter (+/- 427) = 188 MB/s
test naive_longpat_reversed::twoway_rfind   ... bench:     530,297 ns/iter (+/- 695) = 188 MB/s
test naive_rev::rfind                       ... bench:       1,389 ns/iter (+/- 1) = 179 MB/s
test naive_rev::twoway_rfind                ... bench:       1,400 ns/iter (+/- 6) = 178 MB/s
test pathological_two_way::rfind            ... bench:      48,215 ns/iter (+/- 68) = 1244 MB/s
test pathological_two_way::twoway_rfind     ... bench:      48,230 ns/iter (+/- 71) = 1243 MB/s
test pathological_two_way_rev::rfind        ... bench:     200,548 ns/iter (+/- 398) = 299 MB/s
test pathological_two_way_rev::twoway_rfind ... bench:     200,389 ns/iter (+/- 101) = 299 MB/s
test periodic2::rfind                       ... bench:   1,947,164 ns/iter (+/- 9,479) = 10 MB/s
test periodic2::twoway_rfind                ... bench:      51,990 ns/iter (+/- 13,125) = 384 MB/s
test periodic::rfind                        ... bench:   1,927,657 ns/iter (+/- 11,644) = 10 MB/s
test periodic::twoway_rfind                 ... bench:      51,743 ns/iter (+/- 111) = 386 MB/s
test short_1let_long::rfind                 ... bench:       6,415 ns/iter (+/- 46) = 397 MB/s
test short_1let_long::twoway_rfind          ... bench:       6,641 ns/iter (+/- 18) = 384 MB/s
test short_2let_long::rfind                 ... bench:       3,745 ns/iter (+/- 3) = 681 MB/s
test short_2let_long::twoway_rfind          ... bench:       3,663 ns/iter (+/- 13) = 696 MB/s
test short_3let_long::rfind                 ... bench:       3,530 ns/iter (+/- 343) = 722 MB/s
test short_3let_long::twoway_rfind          ... bench:       3,346 ns/iter (+/- 468) = 762 MB/s
test short_short::rfind                     ... bench:         137 ns/iter (+/- 4) = 408 MB/s
test short_short::twoway_rfind              ... bench:         140 ns/iter (+/- 4) = 399 MB/s
test short_word_long::rfind                 ... bench:       2,344 ns/iter (+/- 23) = 1088 MB/s
test short_word_long::twoway_rfind          ... bench:       2,392 ns/iter (+/- 30) = 1066 MB/s

@Gankra
Copy link
Contributor

Gankra commented Aug 2, 2015

CC @nham @shepmaster

@bluss
Copy link
Member Author

bluss commented Aug 2, 2015

I'd love to hear from you @gereeter who helped me with the last PR (#26327). This time I've sweated some to understand the algorithm a lot better. It turns out it's not as symmetrical as we had supposed, and the reverse case needs its own critical factorization (for the short period case, for periodic needles).

Fortunately computing the critical factorization for the reverse is really quick, because we can just stop when we find a period that matches. Maybe there's an even quicker way to compute it?

Either way, the already present overhead in TwoWaySearcher::new is a bit worrying, for short needles. Even then, our search speed is really good and isn't that easy to improve upon except for maybe in the single byte needle case.

@bluss
Copy link
Member Author

bluss commented Aug 2, 2015

Just for fun, not really related to the PR, here's a comparison to a fast string search: glibc::memmem, which also uses two way search but with a full 256 entry skip table. It is non-simd except for using memchr once when starting.

My conclusion is that rust is reasonably competitive, and we win on several benchmarks, for example the test case "short_word_long", which is a common kind of natural language search, and the cases short_2let_long, short_3let_long where the needle is just 2 or 3 bytes.

test aaab_in_aab::memmem                    ... bench:     823,891 ns/iter (+/- 11,406) = 363 MB/s
test aaabbb::memmem                         ... bench:      43,787 ns/iter (+/- 750) = 6851 MB/s
test allright::memmem                       ... bench:   2,832,698 ns/iter (+/- 44,008) = 635 MB/s
test bb_in_aa::memmem                       ... bench:      10,163 ns/iter (+/- 36) = 9839 MB/s
test bbbaaa::memmem                         ... bench:     531,964 ns/iter (+/- 6,033) = 563 MB/s
test gllright::memmem                       ... bench:   4,010,152 ns/iter (+/- 65,234) = 448 MB/s
test naive::memmem                          ... bench:         421 ns/iter (+/- 1) = 593 MB/s
test naive_longpat::memmem                  ... bench:     117,963 ns/iter (+/- 95) = 847 MB/s
test naive_longpat_reversed::memmem         ... bench:       5,273 ns/iter (+/- 34) = 18964 MB/s
test naive_rev::memmem                      ... bench:          26 ns/iter (+/- 0) = 9615 MB/s
test pathological_two_way::memmem           ... bench:       3,153 ns/iter (+/- 46) = 19029 MB/s
test pathological_two_way_rev::memmem       ... bench:      82,465 ns/iter (+/- 529) = 727 MB/s
test periodic2::memmem                      ... bench:      28,161 ns/iter (+/- 560) = 710 MB/s
test periodic::memmem                       ... bench:      27,773 ns/iter (+/- 135) = 720 MB/s
test short_1let_long::memmem                ... bench:         142 ns/iter (+/- 0) = 17964 MB/s
test short_2let_long::memmem                ... bench:       7,882 ns/iter (+/- 1,661) = 323 MB/s
test short_3let_long::memmem                ... bench:       8,144 ns/iter (+/- 18) = 313 MB/s
test short_short::memmem                    ... bench:          94 ns/iter (+/- 1) = 595 MB/s
test short_word_long::memmem                ... bench:       5,556 ns/iter (+/- 136) = 459 MB/s

test aaab_in_aab::twoway_find               ... bench:   1,062,874 ns/iter (+/- 7,780) = 282 MB/s
test aaabbb::twoway_find                    ... bench:   1,188,327 ns/iter (+/- 30,034) = 252 MB/s
test allright::twoway_find                  ... bench:   5,240,214 ns/iter (+/- 104,973) = 342 MB/s
test bb_in_aa::twoway_find                  ... bench:       4,185 ns/iter (+/- 16) = 23894 MB/s
test bbbaaa::twoway_find                    ... bench:     827,474 ns/iter (+/- 40,194) = 362 MB/s
test gllright::twoway_find                  ... bench:   7,410,748 ns/iter (+/- 55,368) = 241 MB/s
test naive::twoway_find                     ... bench:       1,385 ns/iter (+/- 223) = 180 MB/s
test naive_longpat::twoway_find             ... bench:     520,873 ns/iter (+/- 1,148) = 191 MB/s
test naive_longpat_reversed::twoway_find    ... bench:     200,943 ns/iter (+/- 5,180) = 497 MB/s
test naive_rev::twoway_find                 ... bench:         654 ns/iter (+/- 89) = 382 MB/s
test pathological_two_way::twoway_find      ... bench:     200,557 ns/iter (+/- 328) = 299 MB/s
test pathological_two_way_rev::twoway_find  ... bench:      50,020 ns/iter (+/- 73) = 1199 MB/s
test periodic2::twoway_find                 ... bench:      39,672 ns/iter (+/- 103) = 504 MB/s
test periodic::twoway_find                  ... bench:      40,160 ns/iter (+/- 58) = 498 MB/s
test short_1let_long::twoway_find           ... bench:       7,078 ns/iter (+/- 16) = 360 MB/s
test short_2let_long::twoway_find           ... bench:       3,821 ns/iter (+/- 4) = 667 MB/s
test short_3let_long::twoway_find           ... bench:       3,672 ns/iter (+/- 43) = 694 MB/s
test short_short::twoway_find               ... bench:         132 ns/iter (+/- 1) = 424 MB/s
test short_word_long::twoway_find           ... bench:       2,159 ns/iter (+/- 21) = 1181 MB/s

@bluss
Copy link
Member Author

bluss commented Aug 7, 2015

I found an improvement to the innerest parts of the loop in TwoWaySearcher::next (the forward search). Added that as an add-on commit, see its commit log.

Edit: Both next and next_back.

Relies on slice length bounds being inside isize, that these calculations can't land back inside the haystack range when it wraps around.

  • next: position + needle.len() - 1 is a valid haystack index or out of bounds (0 <= position <= haystack)
  • next_back: end - needle.len() is a valid haystack index or out of bounds (0 <= end <= haystack)

The next calculation should never overflow. The next_back calculation is using wrapping_sub and will wrap when we are out of bounds. The resulting wrapped index should return None from get. This combines the end check with the bounds check into one operation.

The innermost loop of TwoWaySearcher checks the boundary of the haystack
vs position + needle.len(), and it checks the last byte of the needle
against the byteset.

If these two steps are combined by using the indexing of the last
needle byte's position as bounds check, the algorithm improves its
throughput. We improve the innermost loop by reducing the number of
instructions used, and elminating the panic case for the checked
indexing that was previously used.

Selected benchmarks from the external/workspace testsuite. Benchmarks
improve across the board.

```
before:

test bb_in_aa::twoway_find                  ... bench:       4,229 ns/iter (+/- 1,305) = 23646 MB/s
test bb_in_aa::twoway_rfind                 ... bench:       3,873 ns/iter (+/- 101) = 25819 MB/s
test short_1let_long::twoway_find           ... bench:       7,075 ns/iter (+/- 29) = 360 MB/s
test short_1let_long::twoway_rfind          ... bench:       6,640 ns/iter (+/- 79) = 384 MB/s
test short_2let_long::twoway_find           ... bench:       3,823 ns/iter (+/- 16) = 667 MB/s
test short_2let_long::twoway_rfind          ... bench:       3,774 ns/iter (+/- 44) = 675 MB/s
test short_3let_long::twoway_find           ... bench:       3,582 ns/iter (+/- 47) = 712 MB/s
test short_3let_long::twoway_rfind          ... bench:       3,616 ns/iter (+/- 34) = 705 MB/s

with this commit:

test bb_in_aa::twoway_find                  ... bench:       2,952 ns/iter (+/- 20) = 33875 MB/s
test bb_in_aa::twoway_rfind                 ... bench:       2,939 ns/iter (+/- 99) = 34025 MB/s
test short_1let_long::twoway_find           ... bench:       4,593 ns/iter (+/- 4) = 555 MB/s
test short_1let_long::twoway_rfind          ... bench:       4,592 ns/iter (+/- 76) = 555 MB/s
test short_2let_long::twoway_find           ... bench:       2,804 ns/iter (+/- 3) = 909 MB/s
test short_2let_long::twoway_rfind          ... bench:       2,807 ns/iter (+/- 40) = 908 MB/s
test short_3let_long::twoway_find           ... bench:       3,105 ns/iter (+/- 120) = 821 MB/s
test short_3let_long::twoway_rfind          ... bench:       3,019 ns/iter (+/- 50) = 844 MB/s
```

- `bb_in_aa`: fast skip due to byteset filter loop improves.
- 1/2/3let: Searches for 1, 2, or 3 ascii bytes improves.
@bluss bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Aug 7, 2015
@bluss bluss changed the title Implement the full two way algorithm in reverse Implement the complete reverse case for the two way algorithm Aug 8, 2015
if is_long {
searcher.next_back::<MatchOnly>(self.haystack.as_bytes(),
self.needle.as_bytes(),
true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not is_long here instead of the branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to encourage the compiler to compile two different versions of next_back (next does the same)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if this is actually working to compile to two versions. Wrapping Long vs Short period cases into the strategy trait could force it instead.

This was an idea from the original code, commit 39cb5b1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated this, it looks like this is what happens: it seems to work when looking at the code generated in benchmarks. The compiler transforms branches into either long period or short period outside of the main 'search loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked with @bluss on IRC and this seems plausible. Can you put a comment that briefly describes the optimization so future edits don't "fix" the code?

@huonw
Copy link
Member

huonw commented Aug 10, 2015

I don't understand this algorithm at all, and it'll take me a while to get up to speed, so I'm not sure I'm the best person to review.

@bluss
Copy link
Member Author

bluss commented Aug 10, 2015

@huonw I'm sympathetic. I understand it better now (second PR to fix this algorithm) than I did after submitting the first PR.

Last time @gereeter reviewed but he seems to be away.

@bluss
Copy link
Member Author

bluss commented Aug 13, 2015

The bulk of my testing is not in the rust source or in this PR, it's in my separate repo, a lot of quickcheck tests for interesting properties (finding correct matches, match-reject ranges add up, match-reject ranges on char boundaries etc.)

Main issues preventing tests to be included are productivity issues:

  • Quickcheck is not used in libstd, so it's huge work to replicate meaningful tests
  • Productivity in the libstd source is very low, so it's much easier to work with tests outside, even without quickcheck

@BurntSushi
Copy link
Member

OK, so I spent some time reading the relevant paper and hashing this out with @bluss, and I'm reasonably convinced that this is a good approach. Having to compute a new factorization makes sense to me, and using the previously computed period to stop the reverse period calculation early is pretty clever. :-)

I had one little nit about adding a comment, but otherwise, I'm pretty happy with this.

@bluss
Copy link
Member Author

bluss commented Aug 16, 2015

@BurntSushi Thank you for reading! Should we find some one more to review it, or is it good? Since you were so leninent on comments, I'll underline that I don't mind detailed critique.

@BurntSushi
Copy link
Member

@bluss I think I'm not the right person to answer that---I've never r+'d anything before. I'd feel better if someone else chimed in.

fn maximal_suffix(arr: &[u8], order_greater: bool) -> (usize, usize) {
let mut left = 0; // Corresponds to i in the paper
let mut right = 1; // Corresponds to j in the paper
let mut offset = 0; // Corresponds to k in the paper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this have a note saying that the paper starts k at 1 because the array indexing in the paper starts at one, and that using 0 instead fits 0-based indexing correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. You're pointing out that the new version is in a way closer to the paper than the old, I didn't notice that, it's nice.

It was rewritten to insert unchecked indexing while the code being obviously correct. Once written in this shape, unchecked indexing didn't make a difference..

@gereeter
Copy link
Contributor

Sorry for not responding earlier - I was away. I had a few comments on making things clearer and less repetitive, but otherwise this looks very good. Thanks for doing this, @bluss!

@bluss
Copy link
Member Author

bluss commented Aug 16, 2015

@gereeter Away is what you should be when it's summer!

I've added more comments, clarified some things. Thank you for reviewing!

I want to skip merging the two maximal suffix functions. Any way to do it nicely ends up with just as much code or more, but arranged differently.

Break out a separate static method to create the "byteset".
@BurntSushi
Copy link
Member

@bors r+

1 similar comment
@brson
Copy link
Contributor

brson commented Aug 17, 2015

@bors r+

@bors
Copy link
Contributor

bors commented Aug 17, 2015

📌 Commit 01e8812 has been approved by brson

@brson
Copy link
Contributor

brson commented Aug 17, 2015

@BurntSushi @bors doesn't think you are a reviewer.

@bors
Copy link
Contributor

bors commented Aug 18, 2015

⌛ Testing commit 01e8812 with merge de67d62...

bors added a commit that referenced this pull request Aug 18, 2015
StrSearcher: Implement the complete reverse case for the two way algorithm

Fix quadratic behavior in StrSearcher in reverse search with periodic
needles.

This commit adds the missing pieces for the "short period" case in
reverse search. The short case will show up when the needle is literally
periodic, for example "abababab".

Two way uses a "critical factorization" of the needle: x = u v.

Searching matches v first, if mismatch at character k, skip k forward.
Matching u, if mismatch, skip period(x) forward.

To avoid O(mn) behavior after mismatch in u, memorize the already
matched prefix.

The short period case requires that |u| < period(x).

For the reverse search we need to compute a different critical
factorization x = u' v' where |v'| < period(x), because we are searching
for the reversed needle. A short v' also benefits the algorithm in
general.

The reverse critical factorization is computed quickly by using the same
maximal suffix algorithm, but terminating as soon as we have a location
with local period equal to period(x).

This adds extra fields crit_pos_back and memory_back for the reverse
case. The new overhead for TwoWaySearcher::new is low, and additionally
I think the "short period" case is uncommon in many applications of
string search.

The maximal_suffix methods were updated in documentation and the
algorithms updated to not use !0 and wrapping add, variable left is now
1 larger, offset 1 smaller.

Use periodicity when computing byteset: in the periodic case, just
iterate over one period instead of the whole needle.

Example before (rfind) after (twoway_rfind) benchmark shows the removal
of quadratic behavior.

needle: "ab" * 100, haystack: ("bb" + "ab" * 100) * 100

```
test periodic::rfind           ... bench:   1,926,595 ns/iter (+/- 11,390) = 10 MB/s
test periodic::twoway_rfind    ... bench:      51,740 ns/iter (+/- 66) = 386 MB/s
```
@bors bors merged commit 01e8812 into rust-lang:master Aug 18, 2015
@bluss bluss deleted the twoway-reverse branch August 18, 2015 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants