`vector_algorithms.cpp`: `minmax` for 64-bit elements: replace ugly x86 workaround with a nice one #4661

AlexGuteniev · 2024-05-07T17:36:03Z

This: piece

Lines 1116 to 1123 in 8dc4faa

 static uint64_t _Get_any_u(const __m128i _Cur) noexcept { 

 #ifdef _M_IX86 

 return (static_cast<uint64_t>(static_cast<uint32_t>(_mm_extract_epi32(_Cur, 1))) << 32) 

 | static_cast<uint64_t>(static_cast<uint32_t>(_mm_cvtsi128_si32(_Cur))); 

 #else // ^^^ x86 / x64 vvv 

 return static_cast<uint64_t>(_mm_cvtsi128_si64(_Cur)); 

 #endif // ^^^ x64 ^^^ 

 }

works around the oddity of not having _mm_cvtsi128_si64 on 32-bit x86

It has been problematic:

An internal bug was reported; fixed by Fix vectorized min/max/minmax_element for 64-bit types on x86 #2821
A compiler bug was caught, it blocked Should we require SSE2? #3922

I have discovered a nicer workaround!

If we spill the reg into the stack, the spill will optimize away.
On 32-bit with at least /arch:SSE2 it even produces better code than the existing workaround.
Demo: https://godbolt.org/z/ErGWz8GYT

It still does the actual spill on /arch:IA32. But given that this path is executed only once per function call (there are no intermediate reductions for 64-bit elements), and there's a plan to lift to /arch:SSE2, I think that's fine.

…c_cast`.

…t_v_pos`.

…::_Get_v_pos`.

stl/src/vector_algorithms.cpp

StephanTLavavej · 2024-05-12T23:21:59Z

Thanks, this is great! 😻 I pushed further changes to centralize the logic, please meow if you have concerns.

AlexGuteniev · 2024-05-13T03:53:00Z

This centralization is already done in #4659 , there would just be more conflicts, after doing again here

AlexGuteniev · 2024-05-13T04:20:58Z

Oh, you also did the implementation of _Get_any via _Get_v_pos. Didn't think about it, bu seems it will be no worse, as 0 is expected to constant propagate.

StephanTLavavej · 2024-05-13T04:24:25Z

I'm drowning in PRs right now, so I think I'm going to have to clear out the backlog in multiple batches (in between investigating a non-STL bug that I can't weasel out of investigating forever). I'd like to land this PR first, then resolve conflicts in #4659.

StephanTLavavej · 2024-05-17T19:27:16Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-05-20T23:53:54Z

Thanks for noticing how to make this code way more elegant! 🪄 😻 💚

Do stack spill; compiler optimizes it away

455a184

AlexGuteniev requested a review from a team as a code owner May 7, 2024 17:36

AlexGuteniev added 2 commits May 7, 2024 20:43

Another occurrence!

00dde4c

consume less stack (virtually)

b702589

StephanTLavavej added the enhancement Something can be improved label May 7, 2024

StephanTLavavej self-assigned this May 7, 2024

StephanTLavavej added 4 commits May 12, 2024 15:46

Comment grammar.

a2798f6

_Minmax_traits_8: Implement _Get_any with _Get_v_pos and `stati…

5d85e7e

…c_cast`.

_Minmax_traits_d::_Get_v_pos is identical to `_Minmax_traits_8::_Ge…

39babc0

…t_v_pos`.

_Minmax_traits_d::_Get_any_u can punch through to `_Minmax_traits_8…

5563c6e

…::_Get_v_pos`.

StephanTLavavej reviewed May 12, 2024

View reviewed changes

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

stl/src/vector_algorithms.cpp Show resolved Hide resolved

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

StephanTLavavej approved these changes May 12, 2024

View reviewed changes

StephanTLavavej removed their assignment May 12, 2024

StephanTLavavej self-assigned this May 17, 2024

StephanTLavavej merged commit 910275c into microsoft:main May 20, 2024
39 checks passed

AlexGuteniev deleted the fake_spill branch May 21, 2024 04:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`vector_algorithms.cpp`: `minmax` for 64-bit elements: replace ugly x86 workaround with a nice one #4661

`vector_algorithms.cpp`: `minmax` for 64-bit elements: replace ugly x86 workaround with a nice one #4661

AlexGuteniev commented May 7, 2024 •

edited

Loading

StephanTLavavej commented May 12, 2024

AlexGuteniev commented May 13, 2024 •

edited

Loading

AlexGuteniev commented May 13, 2024

StephanTLavavej commented May 13, 2024

StephanTLavavej commented May 17, 2024

StephanTLavavej commented May 20, 2024

	static uint64_t _Get_any_u(const __m128i _Cur) noexcept {
	#ifdef _M_IX86
	return (static_cast<uint64_t>(static_cast<uint32_t>(_mm_extract_epi32(_Cur, 1))) << 32)
	\| static_cast<uint64_t>(static_cast<uint32_t>(_mm_cvtsi128_si32(_Cur)));
	#else // ^^^ x86 / x64 vvv
	return static_cast<uint64_t>(_mm_cvtsi128_si64(_Cur));
	#endif // ^^^ x64 ^^^
	}

vector_algorithms.cpp: minmax for 64-bit elements: replace ugly x86 workaround with a nice one #4661

vector_algorithms.cpp: minmax for 64-bit elements: replace ugly x86 workaround with a nice one #4661

Conversation

AlexGuteniev commented May 7, 2024 • edited Loading

StephanTLavavej commented May 12, 2024

AlexGuteniev commented May 13, 2024 • edited Loading

AlexGuteniev commented May 13, 2024

StephanTLavavej commented May 13, 2024

StephanTLavavej commented May 17, 2024

StephanTLavavej commented May 20, 2024

`vector_algorithms.cpp`: `minmax` for 64-bit elements: replace ugly x86 workaround with a nice one #4661

`vector_algorithms.cpp`: `minmax` for 64-bit elements: replace ugly x86 workaround with a nice one #4661

AlexGuteniev commented May 7, 2024 •

edited

Loading

AlexGuteniev commented May 13, 2024 •

edited

Loading