-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement vectorized min_ / max_element for ints #2447
Conversation
Resolves microsoft#2438 TODO: * Test coverage * Attach minmax_element * Add AVX2 version of the same ---- <detail> <summary><b>Benchmark</b></summary> ```C++ #include <algorithm> #include <cstdint> #include <chrono> #include <iostream> #include <ranges> #include <intrin.h> enum class Kind { Min, Max, }; template<typename T> void benchmark_find(T* a, std::size_t max, size_t start, size_t pos, Kind kind, size_t rep) { std::fill_n(a, max, '0'); if (pos < max && pos >= start) { if (kind == Kind::Min) { a[pos] = '*'; } else { a[pos] = '1'; } } auto t1 = std::chrono::steady_clock::now(); switch (kind) { case Kind::Min: for (std::size_t s = 0; s < rep; s++) { if (std::min_element(a + start, a + max) != a + pos) { abort(); } } break; case Kind::Max: for (std::size_t s = 0; s < rep; s++) { if (std::min_element(a + start, a + max) != a + pos) { abort(); } } break; } auto t2 = std::chrono::steady_clock::now(); const char* op_str = nullptr; switch (kind) { case Kind::Min: op_str = "min"; break; case Kind::Max: op_str = "max"; break; } std::cout << std::setw(10) << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() << "s -- " << "Op " << op_str << " Size " << sizeof(T) << " byte elements, array size " << max << " starting at " << start << " found at " << pos << "; " << rep << " repetitions \n"; } constexpr std::size_t Nmax = 8192; alignas(64) std::uint8_t a8[Nmax]; alignas(64) std::uint16_t a16[Nmax]; alignas(64) std::uint32_t a32[Nmax]; alignas(64) std::uint64_t a64[Nmax]; extern "C" long __isa_enabled; int main() { std::cout << "Vector alg used: " << _USE_STD_VECTOR_ALGORITHMS << "\n"; benchmark_find(a8, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a16, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a32, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a64, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a8, Nmax, 0, 3459, Kind::Max, 100000); benchmark_find(a16, Nmax, 0, 3459, Kind::Max, 100000); benchmark_find(a32, Nmax, 0, 3459, Kind::Max, 100000); benchmark_find(a64, Nmax, 0, 3459, Kind::Max, 100000); std::cout << "Done\n"; return 0; } ``` <detail> <summary><b>Current benchmark results</b></summary> ``` ********************************************************************** ** Visual Studio 2022 Developer Command Prompt v17.1.0-pre.1.1 ** Copyright (c) 2021 Microsoft Corporation ********************************************************************** [vcvarsall.bat] Environment initialized for: 'x64' C:\Program Files\Microsoft Visual Studio\2022\Preview>cd/d C:\Project\vector_find_benchmark C:\Project\vector_find_benchmark>set INCLUDE=C:\Project\STL\out\build\x64\out\inc;%INCLUDE% C:\Project\vector_find_benchmark>set LIB=C:\Project\STL\out\build\x64\out\lib\amd64;%LIB% C:\Project\vector_find_benchmark>set PATH=C:\Project\STL\out\build\x64\out\bin\amd64;%PATH% C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp vector_find_benchmark.cpp vector_find_benchmark.cpp(1): warning C4005: '_USE_STD_VECTOR_ALGORITHMS': macro redefinition vector_find_benchmark.cpp: note: see previous definition of '_USE_STD_VECTOR_ALGORITHMS' C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp vector_find_benchmark.cpp C:\Project\vector_find_benchmark>vector_find_benchmark.exe Vector alg used: 0 1.48497s -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 1.48125s -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 1.47988s -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 1.48431s -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=1 /nologo vector_find_benchmark.cpp vector_find_benchmark.cpp C:\Project\vector_find_benchmark>vector_find_benchmark.exe Vector alg used: 1 0.0559598s -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 0.0681002s -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 0.159074s -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 0.597614s -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions ``` </detail>
Need to guard in C++20 to call _STD is_constant_evaluated()
@AlexGuteniev Thanks, this looks good - another amazing speedup! I pushed changes for the issues I noticed (FYI @barcharcraz in case you want to double-check). Edit: Also renamed "cor" to "correction".
|
We believe that CUDA 11.6 supports `__builtin_is_constant_evaluated`.
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
We need to restore the warning workarounds because the internal build is still using a slightly older compiler without the fix. I've opted to make them uncommented perma-workarounds. |
Thanks for minimizing the time and maximizing the speed of these algorithms! 🚀 🎉 😻 |
Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>
📝 Summary
SSE4.1 implementation of
min_element
/max_element
/minmax_element
for signed and unsigned integers of sizes 1,2,4,8.Resolves #2438
The algorithm is more complex than existing vector algorithms, not sure if this level of complexity is fine.
🧭Further directions (suggested next PRs)
ranges::min
,ranges::max
,ranges::minmax
-- as don't need iterators, this will be simpler faster algorithm, will consist only of vertical max and one reduction.🏁 Perf benchmark
Benchmark
Current benchmark results
Results table
⚖️ Size impact
The change adds more code.
DLLs and PDBs for them are not affected. Static libraries are affected.
The impact is negligible for static libs, but noticeable for import libs.
Table
✔️ Test coverage