Vectorize remove_copy for 4 and 8 byte elements#5062
Vectorize remove_copy for 4 and 8 byte elements#5062AlexGuteniev wants to merge 15 commits intomicrosoft:mainfrom
remove_copy for 4 and 8 byte elements#5062Conversation
|
Benchmark results on a Ryzen 7840HS notebook:
|
Thanks. Apparently this is not the way to go 😿 |
|
Thanks @muellerj2 - I also confirm that this is a pessimization on my desktop 5950X (Zen 3):
@AlexGuteniev Do you want to rework or abandon this strategy? |
|
I see no rework possible. We have the following options, besides abandoning:
I expect that none of these are acceptable, but leaving the final decision to you. |
|
We talked about this at the weekly maintainer meeting, and although we always appreciate the vast amount of effort you've put into vectorizing the STL, and we're always sad to reject a PR, vendor-specific detection logic indeed seems to us to be a step too far. The STL has never exhibited vendor-specific behavior in the past, and signing up to monitor performance indefinitely and retuning the logic (in addition to complicating test coverage) doesn't appear to be worth the potential benefits here. |
Follow up on #4987
For now only 4 and 8 byte elements and only AVX2, so that AVX2 masks can work.
This may be doable for 1 and 2 byte elements, but will require a different approach for storing partial vector. Or may be not doable for 1 and 2 byte elements, if any approach will be slower than scalar. In any case, not trying right now to avoid too big PR.
AVX2 mask stores are slower than the usual stores, so not using this approach uniformly.
⏱️ Benchmark results
Expectedly
remmove_copyvectorized is better than non-vectorized.Expectedly
remmove_copyvectorized does not reach theremovevectorized performance.As usual, some minor variations in unchanged
removevectorized.I'm worried about the
vmaskmov*timings.They seem to be bad enough for AMD to turn this into a pessimization.