Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use AVX/AVX2 masks in minmax_element and minmax vectorization #4917

Merged
merged 10 commits into from
Oct 12, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Aug 26, 2024

🧭 Overview

Use AVX2 mask to read tails for minmax/minmax element, then use the same masks to populate the tail with previous data, and to exclude tail indices for _element algorithm.

⏱️ Benchmark results

8 and 16 bit Both_val cases are expected to improve in a significant way too, but it is currently hidden by #4913

Benchmark main this
bm<uint8_t, Op::Min>/8021 173 ns 168 ns
bm<uint8_t, Op::Min>/63 21.3 ns 9.64 ns
bm<uint8_t, Op::Max>/8021 173 ns 163 ns
bm<uint8_t, Op::Max>/63 22.1 ns 9.74 ns
bm<uint8_t, Op::Both>/8021 287 ns 277 ns
bm<uint8_t, Op::Both>/63 40.2 ns 20.4 ns
bm<uint8_t, Op::Min_val>/8021 73.5 ns 69.9 ns
bm<uint8_t, Op::Min_val>/63 14.9 ns 4.40 ns
bm<uint8_t, Op::Max_val>/8021 75.7 ns 67.1 ns
bm<uint8_t, Op::Max_val>/63 13.9 ns 4.29 ns
bm<uint8_t, Op::Both_val>/8021 3250 ns 3250 ns
bm<uint8_t, Op::Both_val>/63 29.3 ns 29.2 ns
bm<uint16_t, Op::Min>/8021 314 ns 318 ns
bm<uint16_t, Op::Min>/31 13.2 ns 8.51 ns
bm<uint16_t, Op::Max>/8021 314 ns 316 ns
bm<uint16_t, Op::Max>/31 13.2 ns 8.51 ns
bm<uint16_t, Op::Both>/8021 526 ns 538 ns
bm<uint16_t, Op::Both>/31 27.1 ns 18.4 ns
bm<uint16_t, Op::Min_val>/8021 131 ns 128 ns
bm<uint16_t, Op::Min_val>/31 5.53 ns 3.58 ns
bm<uint16_t, Op::Max_val>/8021 136 ns 127 ns
bm<uint16_t, Op::Max_val>/31 5.33 ns 3.59 ns
bm<uint16_t, Op::Both_val>/8021 4541 ns 4550 ns
bm<uint16_t, Op::Both_val>/31 18.1 ns 17.9 ns
bm<uint32_t, Op::Min>/8021 627 ns 605 ns
bm<uint32_t, Op::Min>/15 8.79 ns 7.11 ns
bm<uint32_t, Op::Max>/8021 622 ns 613 ns
bm<uint32_t, Op::Max>/15 8.81 ns 7.18 ns
bm<uint32_t, Op::Both>/8021 1045 ns 1052 ns
bm<uint32_t, Op::Both>/15 19.5 ns 17.0 ns
bm<uint32_t, Op::Min_val>/8021 258 ns 246 ns
bm<uint32_t, Op::Min_val>/15 6.01 ns 3.14 ns
bm<uint32_t, Op::Max_val>/8021 258 ns 253 ns
bm<uint32_t, Op::Max_val>/15 3.63 ns 3.13 ns
bm<uint32_t, Op::Both_val>/8021 364 ns 328 ns
bm<uint32_t, Op::Both_val>/15 8.17 ns 7.26 ns
bm<uint64_t, Op::Min>/8021 3480 ns 3565 ns
bm<uint64_t, Op::Min>/7 8.92 ns 9.99 ns
bm<uint64_t, Op::Max>/8021 3552 ns 3552 ns
bm<uint64_t, Op::Max>/7 8.72 ns 9.14 ns
bm<uint64_t, Op::Both>/8021 4079 ns 4089 ns
bm<uint64_t, Op::Both>/7 18.7 ns 17.9 ns
bm<uint64_t, Op::Min_val>/8021 2861 ns 2868 ns
bm<uint64_t, Op::Min_val>/7 4.53 ns 4.78 ns
bm<uint64_t, Op::Max_val>/8021 2849 ns 2871 ns
bm<uint64_t, Op::Max_val>/7 4.54 ns 4.77 ns
bm<uint64_t, Op::Both_val>/8021 2898 ns 2932 ns
bm<uint64_t, Op::Both_val>/7 9.56 ns 9.29 ns
bm<int8_t, Op::Min>/8021 166 ns 166 ns
bm<int8_t, Op::Min>/63 20.7 ns 12.6 ns
bm<int8_t, Op::Max>/8021 171 ns 165 ns
bm<int8_t, Op::Max>/63 21.4 ns 12.6 ns
bm<int8_t, Op::Both>/8021 286 ns 273 ns
bm<int8_t, Op::Both>/63 33.1 ns 19.4 ns
bm<int8_t, Op::Min_val>/8021 77.9 ns 64.2 ns
bm<int8_t, Op::Min_val>/63 14.4 ns 4.74 ns
bm<int8_t, Op::Max_val>/8021 75.0 ns 72.4 ns
bm<int8_t, Op::Max_val>/63 16.7 ns 4.54 ns
bm<int8_t, Op::Both_val>/8021 3233 ns 3226 ns
bm<int8_t, Op::Both_val>/63 28.8 ns 29.4 ns
bm<int16_t, Op::Min>/8021 315 ns 316 ns
bm<int16_t, Op::Min>/31 14.0 ns 11.3 ns
bm<int16_t, Op::Max>/8021 314 ns 321 ns
bm<int16_t, Op::Max>/31 13.9 ns 11.3 ns
bm<int16_t, Op::Both>/8021 527 ns 532 ns
bm<int16_t, Op::Both>/31 23.2 ns 18.0 ns
bm<int16_t, Op::Min_val>/8021 135 ns 130 ns
bm<int16_t, Op::Min_val>/31 11.2 ns 4.06 ns
bm<int16_t, Op::Max_val>/8021 134 ns 131 ns
bm<int16_t, Op::Max_val>/31 11.4 ns 4.08 ns
bm<int16_t, Op::Both_val>/8021 4180 ns 4249 ns
bm<int16_t, Op::Both_val>/31 18.1 ns 18.2 ns
bm<int32_t, Op::Min>/8021 619 ns 607 ns
bm<int32_t, Op::Min>/15 9.40 ns 10.1 ns
bm<int32_t, Op::Max>/8021 622 ns 608 ns
bm<int32_t, Op::Max>/15 9.81 ns 10.1 ns
bm<int32_t, Op::Both>/8021 1059 ns 1037 ns
bm<int32_t, Op::Both>/15 19.0 ns 16.6 ns
bm<int32_t, Op::Min_val>/8021 255 ns 251 ns
bm<int32_t, Op::Min_val>/15 4.56 ns 3.57 ns
bm<int32_t, Op::Max_val>/8021 251 ns 244 ns
bm<int32_t, Op::Max_val>/15 4.57 ns 3.59 ns
bm<int32_t, Op::Both_val>/8021 362 ns 336 ns
bm<int32_t, Op::Both_val>/15 9.89 ns 7.69 ns
bm<int64_t, Op::Min>/8021 3473 ns 3502 ns
bm<int64_t, Op::Min>/7 13.3 ns 14.9 ns
bm<int64_t, Op::Max>/8021 3542 ns 3464 ns
bm<int64_t, Op::Max>/7 13.1 ns 14.9 ns
bm<int64_t, Op::Both>/8021 4084 ns 4012 ns
bm<int64_t, Op::Both>/7 18.6 ns 17.9 ns
bm<int64_t, Op::Min_val>/8021 2879 ns 2851 ns
bm<int64_t, Op::Min_val>/7 3.77 ns 3.87 ns
bm<int64_t, Op::Max_val>/8021 2846 ns 2870 ns
bm<int64_t, Op::Max_val>/7 3.62 ns 3.88 ns
bm<int64_t, Op::Both_val>/8021 3131 ns 3183 ns
bm<int64_t, Op::Both_val>/7 8.63 ns 8.91 ns
bm<float, Op::Min>/8021 1179 ns 1173 ns
bm<float, Op::Min>/15 9.28 ns 7.06 ns
bm<float, Op::Max>/8021 1182 ns 1173 ns
bm<float, Op::Max>/15 9.94 ns 7.03 ns
bm<float, Op::Both>/8021 1338 ns 1345 ns
bm<float, Op::Both>/15 15.7 ns 16.4 ns
bm<float, Op::Min_val>/8021 1176 ns 1174 ns
bm<float, Op::Min_val>/15 8.84 ns 7.17 ns
bm<float, Op::Max_val>/8021 1182 ns 1170 ns
bm<float, Op::Max_val>/15 9.83 ns 7.19 ns
bm<float, Op::Both_val>/8021 1341 ns 1333 ns
bm<float, Op::Both_val>/15 13.4 ns 13.3 ns
bm<double, Op::Min>/8021 2325 ns 2354 ns
bm<double, Op::Min>/7 8.79 ns 7.39 ns
bm<double, Op::Max>/8021 2330 ns 2390 ns
bm<double, Op::Max>/7 9.99 ns 7.85 ns
bm<double, Op::Both>/8021 2695 ns 2724 ns
bm<double, Op::Both>/7 15.9 ns 16.4 ns
bm<double, Op::Min_val>/8021 2321 ns 2323 ns
bm<double, Op::Min_val>/7 7.66 ns 7.33 ns
bm<double, Op::Max_val>/8021 2347 ns 2367 ns
bm<double, Op::Max_val>/7 9.79 ns 7.43 ns
bm<double, Op::Both_val>/8021 2688 ns 2725 ns
bm<double, Op::Both_val>/7 13.0 ns 13.9 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner August 26, 2024 18:04
@CaseyCarter CaseyCarter added the performance Must go faster label Aug 26, 2024
@StephanTLavavej StephanTLavavej self-assigned this Aug 26, 2024
@StephanTLavavej StephanTLavavej changed the title Use AVX/AVX2 masks in minmax_element and minmax vectoization Use AVX/AVX2 masks in minmax_element and minmax vectorization Aug 27, 2024
@StephanTLavavej

This comment was marked as resolved.

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej removed their assignment Oct 7, 2024
@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej
Copy link
Member

Final (voluminous) results on my 5950X, relative to main containing #4913. As always, "before" has the updated benchmark, but with git restore --source=main stl to revert the product code.

Click to expand table:
Benchmark Before After Speedup
bm<uint8_t, Op::Min>/8021 129 ns 122 ns 1.06
bm<uint8_t, Op::Min>/63 31.2 ns 12.5 ns 2.50
bm<uint8_t, Op::Max>/8021 130 ns 124 ns 1.05
bm<uint8_t, Op::Max>/63 31.2 ns 12.8 ns 2.44
bm<uint8_t, Op::Both>/8021 161 ns 150 ns 1.07
bm<uint8_t, Op::Both>/63 41.3 ns 20.5 ns 2.01
bm<uint8_t, Op::Min_val>/8021 69.7 ns 63.4 ns 1.10
bm<uint8_t, Op::Min_val>/63 19.3 ns 5.81 ns 3.32
bm<uint8_t, Op::Max_val>/8021 70.7 ns 63.6 ns 1.11
bm<uint8_t, Op::Max_val>/63 19.0 ns 6.61 ns 2.87
bm<uint8_t, Op::Both_val>/8021 76.7 ns 66.1 ns 1.16
bm<uint8_t, Op::Both_val>/63 26.6 ns 6.82 ns 3.90
bm<uint16_t, Op::Min>/8021 227 ns 226 ns 1.00
bm<uint16_t, Op::Min>/31 16.5 ns 10.5 ns 1.57
bm<uint16_t, Op::Max>/8021 224 ns 229 ns 0.98
bm<uint16_t, Op::Max>/31 15.8 ns 10.5 ns 1.50
bm<uint16_t, Op::Both>/8021 250 ns 267 ns 0.94
bm<uint16_t, Op::Both>/31 25.6 ns 18.8 ns 1.36
bm<uint16_t, Op::Min_val>/8021 116 ns 116 ns 1.00
bm<uint16_t, Op::Min_val>/31 7.77 ns 5.33 ns 1.46
bm<uint16_t, Op::Max_val>/8021 116 ns 114 ns 1.02
bm<uint16_t, Op::Max_val>/31 8.34 ns 5.54 ns 1.51
bm<uint16_t, Op::Both_val>/8021 117 ns 117 ns 1.00
bm<uint16_t, Op::Both_val>/31 9.27 ns 5.76 ns 1.61
bm<uint32_t, Op::Min>/8021 446 ns 432 ns 1.03
bm<uint32_t, Op::Min>/15 10.0 ns 8.80 ns 1.14
bm<uint32_t, Op::Max>/8021 446 ns 437 ns 1.02
bm<uint32_t, Op::Max>/15 10.1 ns 9.05 ns 1.12
bm<uint32_t, Op::Both>/8021 506 ns 511 ns 0.99
bm<uint32_t, Op::Both>/15 18.8 ns 17.1 ns 1.10
bm<uint32_t, Op::Min_val>/8021 223 ns 223 ns 1.00
bm<uint32_t, Op::Min_val>/15 6.39 ns 5.54 ns 1.15
bm<uint32_t, Op::Max_val>/8021 223 ns 223 ns 1.00
bm<uint32_t, Op::Max_val>/15 6.33 ns 5.12 ns 1.24
bm<uint32_t, Op::Both_val>/8021 439 ns 223 ns 1.97
bm<uint32_t, Op::Both_val>/15 6.64 ns 5.96 ns 1.11
bm<uint64_t, Op::Min>/8021 1142 ns 881 ns 1.30
bm<uint64_t, Op::Min>/7 10.0 ns 11.1 ns 0.90
bm<uint64_t, Op::Max>/8021 903 ns 879 ns 1.03
bm<uint64_t, Op::Max>/7 11.1 ns 10.6 ns 1.05
bm<uint64_t, Op::Both>/8021 1249 ns 1272 ns 0.98
bm<uint64_t, Op::Both>/7 20.0 ns 20.2 ns 0.99
bm<uint64_t, Op::Min_val>/8021 862 ns 857 ns 1.01
bm<uint64_t, Op::Min_val>/7 6.24 ns 6.00 ns 1.04
bm<uint64_t, Op::Max_val>/8021 864 ns 852 ns 1.01
bm<uint64_t, Op::Max_val>/7 6.05 ns 5.78 ns 1.05
bm<uint64_t, Op::Both_val>/8021 884 ns 873 ns 1.01
bm<uint64_t, Op::Both_val>/7 9.07 ns 8.78 ns 1.03
bm<int8_t, Op::Min>/8021 129 ns 123 ns 1.05
bm<int8_t, Op::Min>/63 30.5 ns 12.6 ns 2.42
bm<int8_t, Op::Max>/8021 129 ns 123 ns 1.05
bm<int8_t, Op::Max>/63 30.3 ns 12.6 ns 2.40
bm<int8_t, Op::Both>/8021 158 ns 148 ns 1.07
bm<int8_t, Op::Both>/63 41.3 ns 20.4 ns 2.02
bm<int8_t, Op::Min_val>/8021 71.7 ns 64.2 ns 1.12
bm<int8_t, Op::Min_val>/63 18.9 ns 6.61 ns 2.86
bm<int8_t, Op::Max_val>/8021 71.5 ns 64.1 ns 1.12
bm<int8_t, Op::Max_val>/63 19.1 ns 5.77 ns 3.31
bm<int8_t, Op::Both_val>/8021 77.5 ns 65.5 ns 1.18
bm<int8_t, Op::Both_val>/63 25.9 ns 7.10 ns 3.65
bm<int16_t, Op::Min>/8021 230 ns 230 ns 1.00
bm<int16_t, Op::Min>/31 15.6 ns 11.5 ns 1.36
bm<int16_t, Op::Max>/8021 225 ns 228 ns 0.99
bm<int16_t, Op::Max>/31 16.8 ns 11.3 ns 1.49
bm<int16_t, Op::Both>/8021 249 ns 268 ns 0.93
bm<int16_t, Op::Both>/31 25.9 ns 18.6 ns 1.39
bm<int16_t, Op::Min_val>/8021 116 ns 115 ns 1.01
bm<int16_t, Op::Min_val>/31 10.0 ns 5.14 ns 1.95
bm<int16_t, Op::Max_val>/8021 117 ns 220 ns 0.53
bm<int16_t, Op::Max_val>/31 8.62 ns 5.34 ns 1.61
bm<int16_t, Op::Both_val>/8021 119 ns 118 ns 1.01
bm<int16_t, Op::Both_val>/31 13.2 ns 5.99 ns 2.20
bm<int32_t, Op::Min>/8021 440 ns 438 ns 1.00
bm<int32_t, Op::Min>/15 9.93 ns 10.4 ns 0.95
bm<int32_t, Op::Max>/8021 441 ns 438 ns 1.01
bm<int32_t, Op::Max>/15 9.95 ns 10.4 ns 0.96
bm<int32_t, Op::Both>/8021 507 ns 514 ns 0.99
bm<int32_t, Op::Both>/15 18.4 ns 16.7 ns 1.10
bm<int32_t, Op::Min_val>/8021 224 ns 222 ns 1.01
bm<int32_t, Op::Min_val>/15 6.45 ns 5.53 ns 1.17
bm<int32_t, Op::Max_val>/8021 225 ns 222 ns 1.01
bm<int32_t, Op::Max_val>/15 7.92 ns 5.98 ns 1.32
bm<int32_t, Op::Both_val>/8021 437 ns 434 ns 1.01
bm<int32_t, Op::Both_val>/15 8.14 ns 5.54 ns 1.47
bm<int64_t, Op::Min>/8021 1135 ns 1096 ns 1.04
bm<int64_t, Op::Min>/7 13.2 ns 13.8 ns 0.96
bm<int64_t, Op::Max>/8021 1094 ns 1114 ns 0.98
bm<int64_t, Op::Max>/7 13.6 ns 14.1 ns 0.96
bm<int64_t, Op::Both>/8021 1218 ns 1272 ns 0.96
bm<int64_t, Op::Both>/7 19.6 ns 19.8 ns 0.99
bm<int64_t, Op::Min_val>/8021 851 ns 845 ns 1.01
bm<int64_t, Op::Min_val>/7 5.57 ns 5.32 ns 1.05
bm<int64_t, Op::Max_val>/8021 860 ns 846 ns 1.02
bm<int64_t, Op::Max_val>/7 5.59 ns 5.33 ns 1.05
bm<int64_t, Op::Both_val>/8021 873 ns 868 ns 1.01
bm<int64_t, Op::Both_val>/7 7.75 ns 7.91 ns 0.98
bm<float, Op::Min>/8021 443 ns 436 ns 1.02
bm<float, Op::Min>/15 9.72 ns 8.36 ns 1.16
bm<float, Op::Max>/8021 443 ns 436 ns 1.02
bm<float, Op::Max>/15 9.43 ns 8.33 ns 1.13
bm<float, Op::Both>/8021 521 ns 518 ns 1.01
bm<float, Op::Both>/15 16.6 ns 17.5 ns 0.95
bm<float, Op::Min_val>/8021 442 ns 437 ns 1.01
bm<float, Op::Min_val>/15 9.67 ns 8.35 ns 1.16
bm<float, Op::Max_val>/8021 439 ns 436 ns 1.01
bm<float, Op::Max_val>/15 9.23 ns 8.41 ns 1.10
bm<float, Op::Both_val>/8021 514 ns 524 ns 0.98
bm<float, Op::Both_val>/15 15.5 ns 16.4 ns 0.95
bm<double, Op::Min>/8021 874 ns 660 ns 1.32
bm<double, Op::Min>/7 8.43 ns 8.25 ns 1.02
bm<double, Op::Max>/8021 653 ns 654 ns 1.00
bm<double, Op::Max>/7 7.95 ns 8.20 ns 0.97
bm<double, Op::Both>/8021 1027 ns 1023 ns 1.00
bm<double, Op::Both>/7 16.4 ns 16.6 ns 0.99
bm<double, Op::Min_val>/8021 871 ns 654 ns 1.33
bm<double, Op::Min_val>/7 8.40 ns 8.55 ns 0.98
bm<double, Op::Max_val>/8021 653 ns 654 ns 1.00
bm<double, Op::Max_val>/7 8.03 ns 8.45 ns 0.95
bm<double, Op::Both_val>/8021 1024 ns 1016 ns 1.01
bm<double, Op::Both_val>/7 15.4 ns 15.7 ns 0.98

The only significant regression is for bm<int16_t, Op::Max_val>/8021, which was 117 ns before and 220 ns after, 0.53x speedup. I don't think this should block the PR, but I wanted to note it.

@StephanTLavavej StephanTLavavej self-assigned this Oct 11, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit caba83c into microsoft:main Oct 12, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

It was the minimum of times, it was the maximum of times...

📉 📈 😻

@AlexGuteniev AlexGuteniev deleted the min_max_mask branch October 12, 2024 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants