Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve search/find_end perf by dropping memcmp #4654

Merged
merged 24 commits into from
Jun 18, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented May 5, 2024

Resolves #2453

These benchmark results are no longer relevant, as the PR intention has changed

Before:

---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
c_strstr                                      184 ns          119 ns      8960000
ranges_search<std::uint8_t>                  6276 ns         3899 ns       344615
ranges_search<std::uint16_t>                 6358 ns         3625 ns       331852
ranges_search<std::uint32_t>                 5652 ns         3692 ns       431622
ranges_search<std::uint64_t>                 7510 ns         4269 ns       373333
search_default_searcher<std::uint8_t>        1720 ns          921 ns      1120000
search_default_searcher<std::uint16_t>       2497 ns         1438 ns      1000000
search_default_searcher<std::uint32_t>       2224 ns         1001 ns      1280000
search_default_searcher<std::uint64_t>       2762 ns         1297 ns      1000000

After:

---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
c_strstr                                      183 ns          102 ns     10000000
ranges_search<std::uint8_t>                   678 ns          355 ns      4266667
ranges_search<std::uint16_t>                 1285 ns          698 ns      1723077
ranges_search<std::uint32_t>                 2317 ns         1111 ns      1280000
ranges_search<std::uint64_t>                 4736 ns         2537 ns       597333
search_default_searcher<std::uint8_t>         669 ns          353 ns      4072727
search_default_searcher<std::uint16_t>       1293 ns          582 ns      1906383
search_default_searcher<std::uint32_t>       2261 ns         1400 ns       814545
search_default_searcher<std::uint64_t>       4756 ns         2511 ns       497778

strstr is given for a reference in the benchmark, it is not affected by the optimization.

It may be impossible to reach strstr performance, as it uses pcmpistri (and reading beyond the last element, as pcmpistri is not very useful otherwise). We can try pcmpestri for 8-bit and 16-bit cases, but still it may be not as efficient, as strstr. I'd prefer to try this additional optimization in a next PR though.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner May 5, 2024 10:59
@AlexGuteniev
Copy link
Contributor Author

The difference in "before" results between ranges::search and search(..., default_searcher) is due to different implementation. search(..., default_searcher) doesn't try to use memcmp on every iteration. It is direct comparison. Apparently, this is faster.

It is curious that search(..., default_searcher) is faster for 64-bit type before vectorization.

I would want someone else to confirm the results, and maintainers decision what to do with this.

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented May 6, 2024

A possible way to handle it is to remove 32 and 64 bit optimization/vectorization attempts at all, so that ranges::search case (along with the usual std::search) would be the same as search(..., default_searcher)

@AlexGuteniev
Copy link
Contributor Author

And if we are to keep vectorization only for 8-bit and 16-bit elements, we may drop the current implementation and not review/commit it in the first place, if SSE4.2 pcmpestri smells like it would be faster.

@StephanTLavavej StephanTLavavej changed the title vectorize search vectorize search May 7, 2024
@StephanTLavavej StephanTLavavej added the performance Must go faster label May 7, 2024
@StephanTLavavej StephanTLavavej self-assigned this May 7, 2024
Copy link
Member

@StephanTLavavej StephanTLavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review - looking good so far!

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
benchmarks/src/search.cpp Show resolved Hide resolved
Who's a good search? You are! Yes you!
stl/inc/xutility Outdated Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
stl/inc/functional Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
benchmarks/src/search.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

I pushed changes to address the issues that I found, but I still need to think about whether we should be vectorizing this at all. I'm leaning towards ripping out the existing memcmp. Stay tuned...

…pred`.

`_Equal_rev_pred_unchecked` is called by classic/parallel `search`/`find_end`.

`_Equal_rev_pred` is called by ranges `search`/`find_end`.

This doesn't affect `equal` etc.
@StephanTLavavej
Copy link
Member

Ok, after looking at the benchmarks, I've taken the radical step of reverting both the vectorization changes and the existing memcmp "optimizations" that were massively pessimizing your reasonable benchmark example. My measurements:

Benchmark main Vector Plain
c_strstr 144 ns 151 ns 145 ns
classic_search<std::uint8_t> 8935 ns 1726 ns 1754 ns
classic_search<std::uint16_t> 9732 ns 3017 ns 1739 ns
classic_search<std::uint32_t> 9029 ns 4829 ns 1912 ns
classic_search<std::uint64_t> 8970 ns 8471 ns 2527 ns
ranges_search<std::uint8_t> 8916 ns 1723 ns 1784 ns
ranges_search<std::uint16_t> 8932 ns 3012 ns 1785 ns
ranges_search<std::uint32_t> 8303 ns 4840 ns 2460 ns
ranges_search<std::uint64_t> 9061 ns 8577 ns 1860 ns
search_default_searcher<std::uint8_t> 1858 ns 1720 ns 1883 ns
search_default_searcher<std::uint16_t> 2525 ns 3016 ns 1861 ns
search_default_searcher<std::uint32_t> 1863 ns 4812 ns 1869 ns
search_default_searcher<std::uint64_t> 1873 ns 8440 ns 2592 ns

This compares main (with only the benchmark added), this vectorization PR before my revert, and finally "Plain" which lacks both vectorization and memcmp.

There's some noise here (e.g. search_default_searcher is unchanged between main and Plain, but there were random 2500 ns spikes), but the pattern is quite clear: memcmp is almost a 5x penalty for all element sizes, vectorization helps smaller elements, but plain code results in great performance across the board.

Note that dropping the memcmp paths from _Equal_rev_pred_unchecked and _Equal_rev_pred has very limited impact. _Equal_rev_pred_unchecked is called by classic/parallel search/find_end, and _Equal_rev_pred is called by ranges search/find_end. (This makes sense, since find_end is the "opposite" of search.)

The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial). It's search/find_end that are unusual because they want to repeatedly compare a needle, which has very different performance characteristics.

In addition to keeping the benchmark from this vectorization attempt, I've also kept the correctness test (even though search is no longer vectorized), since it's quick to run, and we might figure out how to vectorize this beneficially in the future.

@StephanTLavavej StephanTLavavej removed their assignment Jun 10, 2024
@StephanTLavavej StephanTLavavej changed the title vectorize search Improve search/find_end perf by dropping memcmp Jun 10, 2024
@AlexGuteniev
Copy link
Contributor Author

The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial)

Agreed. mismatch / lexicographical_compare did benefit from similar optimization.

@StephanTLavavej StephanTLavavej self-assigned this Jun 14, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit e3ed206 into microsoft:main Jun 18, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

Thanks for investigating and improving the performance here even if the final result was very different from the initial vision! 🔮 🪄 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

<algorithm>: vectorize search
2 participants