Use AVX/AVX2 masks in `minmax_element` and `minmax` vectorization #4917

AlexGuteniev · 2024-08-26T18:04:35Z

🧭 Overview

Use AVX2 mask to read tails for minmax/minmax element, then use the same masks to populate the tail with previous data, and to exclude tail indices for _element algorithm.

⏱️ Benchmark results

8 and 16 bit Both_val cases are expected to improve in a significant way too, but it is currently hidden by #4913

Benchmark	main	this
bm<uint8_t, Op::Min>/8021	173 ns	168 ns
bm<uint8_t, Op::Min>/63	21.3 ns	9.64 ns
bm<uint8_t, Op::Max>/8021	173 ns	163 ns
bm<uint8_t, Op::Max>/63	22.1 ns	9.74 ns
bm<uint8_t, Op::Both>/8021	287 ns	277 ns
bm<uint8_t, Op::Both>/63	40.2 ns	20.4 ns
bm<uint8_t, Op::Min_val>/8021	73.5 ns	69.9 ns
bm<uint8_t, Op::Min_val>/63	14.9 ns	4.40 ns
bm<uint8_t, Op::Max_val>/8021	75.7 ns	67.1 ns
bm<uint8_t, Op::Max_val>/63	13.9 ns	4.29 ns
bm<uint8_t, Op::Both_val>/8021	3250 ns	3250 ns
bm<uint8_t, Op::Both_val>/63	29.3 ns	29.2 ns
bm<uint16_t, Op::Min>/8021	314 ns	318 ns
bm<uint16_t, Op::Min>/31	13.2 ns	8.51 ns
bm<uint16_t, Op::Max>/8021	314 ns	316 ns
bm<uint16_t, Op::Max>/31	13.2 ns	8.51 ns
bm<uint16_t, Op::Both>/8021	526 ns	538 ns
bm<uint16_t, Op::Both>/31	27.1 ns	18.4 ns
bm<uint16_t, Op::Min_val>/8021	131 ns	128 ns
bm<uint16_t, Op::Min_val>/31	5.53 ns	3.58 ns
bm<uint16_t, Op::Max_val>/8021	136 ns	127 ns
bm<uint16_t, Op::Max_val>/31	5.33 ns	3.59 ns
bm<uint16_t, Op::Both_val>/8021	4541 ns	4550 ns
bm<uint16_t, Op::Both_val>/31	18.1 ns	17.9 ns
bm<uint32_t, Op::Min>/8021	627 ns	605 ns
bm<uint32_t, Op::Min>/15	8.79 ns	7.11 ns
bm<uint32_t, Op::Max>/8021	622 ns	613 ns
bm<uint32_t, Op::Max>/15	8.81 ns	7.18 ns
bm<uint32_t, Op::Both>/8021	1045 ns	1052 ns
bm<uint32_t, Op::Both>/15	19.5 ns	17.0 ns
bm<uint32_t, Op::Min_val>/8021	258 ns	246 ns
bm<uint32_t, Op::Min_val>/15	6.01 ns	3.14 ns
bm<uint32_t, Op::Max_val>/8021	258 ns	253 ns
bm<uint32_t, Op::Max_val>/15	3.63 ns	3.13 ns
bm<uint32_t, Op::Both_val>/8021	364 ns	328 ns
bm<uint32_t, Op::Both_val>/15	8.17 ns	7.26 ns
bm<uint64_t, Op::Min>/8021	3480 ns	3565 ns
bm<uint64_t, Op::Min>/7	8.92 ns	9.99 ns
bm<uint64_t, Op::Max>/8021	3552 ns	3552 ns
bm<uint64_t, Op::Max>/7	8.72 ns	9.14 ns
bm<uint64_t, Op::Both>/8021	4079 ns	4089 ns
bm<uint64_t, Op::Both>/7	18.7 ns	17.9 ns
bm<uint64_t, Op::Min_val>/8021	2861 ns	2868 ns
bm<uint64_t, Op::Min_val>/7	4.53 ns	4.78 ns
bm<uint64_t, Op::Max_val>/8021	2849 ns	2871 ns
bm<uint64_t, Op::Max_val>/7	4.54 ns	4.77 ns
bm<uint64_t, Op::Both_val>/8021	2898 ns	2932 ns
bm<uint64_t, Op::Both_val>/7	9.56 ns	9.29 ns
bm<int8_t, Op::Min>/8021	166 ns	166 ns
bm<int8_t, Op::Min>/63	20.7 ns	12.6 ns
bm<int8_t, Op::Max>/8021	171 ns	165 ns
bm<int8_t, Op::Max>/63	21.4 ns	12.6 ns
bm<int8_t, Op::Both>/8021	286 ns	273 ns
bm<int8_t, Op::Both>/63	33.1 ns	19.4 ns
bm<int8_t, Op::Min_val>/8021	77.9 ns	64.2 ns
bm<int8_t, Op::Min_val>/63	14.4 ns	4.74 ns
bm<int8_t, Op::Max_val>/8021	75.0 ns	72.4 ns
bm<int8_t, Op::Max_val>/63	16.7 ns	4.54 ns
bm<int8_t, Op::Both_val>/8021	3233 ns	3226 ns
bm<int8_t, Op::Both_val>/63	28.8 ns	29.4 ns
bm<int16_t, Op::Min>/8021	315 ns	316 ns
bm<int16_t, Op::Min>/31	14.0 ns	11.3 ns
bm<int16_t, Op::Max>/8021	314 ns	321 ns
bm<int16_t, Op::Max>/31	13.9 ns	11.3 ns
bm<int16_t, Op::Both>/8021	527 ns	532 ns
bm<int16_t, Op::Both>/31	23.2 ns	18.0 ns
bm<int16_t, Op::Min_val>/8021	135 ns	130 ns
bm<int16_t, Op::Min_val>/31	11.2 ns	4.06 ns
bm<int16_t, Op::Max_val>/8021	134 ns	131 ns
bm<int16_t, Op::Max_val>/31	11.4 ns	4.08 ns
bm<int16_t, Op::Both_val>/8021	4180 ns	4249 ns
bm<int16_t, Op::Both_val>/31	18.1 ns	18.2 ns
bm<int32_t, Op::Min>/8021	619 ns	607 ns
bm<int32_t, Op::Min>/15	9.40 ns	10.1 ns
bm<int32_t, Op::Max>/8021	622 ns	608 ns
bm<int32_t, Op::Max>/15	9.81 ns	10.1 ns
bm<int32_t, Op::Both>/8021	1059 ns	1037 ns
bm<int32_t, Op::Both>/15	19.0 ns	16.6 ns
bm<int32_t, Op::Min_val>/8021	255 ns	251 ns
bm<int32_t, Op::Min_val>/15	4.56 ns	3.57 ns
bm<int32_t, Op::Max_val>/8021	251 ns	244 ns
bm<int32_t, Op::Max_val>/15	4.57 ns	3.59 ns
bm<int32_t, Op::Both_val>/8021	362 ns	336 ns
bm<int32_t, Op::Both_val>/15	9.89 ns	7.69 ns
bm<int64_t, Op::Min>/8021	3473 ns	3502 ns
bm<int64_t, Op::Min>/7	13.3 ns	14.9 ns
bm<int64_t, Op::Max>/8021	3542 ns	3464 ns
bm<int64_t, Op::Max>/7	13.1 ns	14.9 ns
bm<int64_t, Op::Both>/8021	4084 ns	4012 ns
bm<int64_t, Op::Both>/7	18.6 ns	17.9 ns
bm<int64_t, Op::Min_val>/8021	2879 ns	2851 ns
bm<int64_t, Op::Min_val>/7	3.77 ns	3.87 ns
bm<int64_t, Op::Max_val>/8021	2846 ns	2870 ns
bm<int64_t, Op::Max_val>/7	3.62 ns	3.88 ns
bm<int64_t, Op::Both_val>/8021	3131 ns	3183 ns
bm<int64_t, Op::Both_val>/7	8.63 ns	8.91 ns
bm<float, Op::Min>/8021	1179 ns	1173 ns
bm<float, Op::Min>/15	9.28 ns	7.06 ns
bm<float, Op::Max>/8021	1182 ns	1173 ns
bm<float, Op::Max>/15	9.94 ns	7.03 ns
bm<float, Op::Both>/8021	1338 ns	1345 ns
bm<float, Op::Both>/15	15.7 ns	16.4 ns
bm<float, Op::Min_val>/8021	1176 ns	1174 ns
bm<float, Op::Min_val>/15	8.84 ns	7.17 ns
bm<float, Op::Max_val>/8021	1182 ns	1170 ns
bm<float, Op::Max_val>/15	9.83 ns	7.19 ns
bm<float, Op::Both_val>/8021	1341 ns	1333 ns
bm<float, Op::Both_val>/15	13.4 ns	13.3 ns
bm<double, Op::Min>/8021	2325 ns	2354 ns
bm<double, Op::Min>/7	8.79 ns	7.39 ns
bm<double, Op::Max>/8021	2330 ns	2390 ns
bm<double, Op::Max>/7	9.99 ns	7.85 ns
bm<double, Op::Both>/8021	2695 ns	2724 ns
bm<double, Op::Both>/7	15.9 ns	16.4 ns
bm<double, Op::Min_val>/8021	2321 ns	2323 ns
bm<double, Op::Min_val>/7	7.66 ns	7.33 ns
bm<double, Op::Max_val>/8021	2347 ns	2367 ns
bm<double, Op::Max_val>/7	9.79 ns	7.43 ns
bm<double, Op::Both_val>/8021	2688 ns	2725 ns
bm<double, Op::Both_val>/7	13.0 ns	13.9 ns

stl/src/vector_algorithms.cpp

StephanTLavavej · 2024-10-08T13:54:15Z

Final (voluminous) results on my 5950X, relative to main containing #4913. As always, "before" has the updated benchmark, but with git restore --source=main stl to revert the product code.

Click to expand table:

Benchmark	Before	After	Speedup
`bm<uint8_t, Op::Min>/8021`	129 ns	122 ns	1.06
`bm<uint8_t, Op::Min>/63`	31.2 ns	12.5 ns	2.50
`bm<uint8_t, Op::Max>/8021`	130 ns	124 ns	1.05
`bm<uint8_t, Op::Max>/63`	31.2 ns	12.8 ns	2.44
`bm<uint8_t, Op::Both>/8021`	161 ns	150 ns	1.07
`bm<uint8_t, Op::Both>/63`	41.3 ns	20.5 ns	2.01
`bm<uint8_t, Op::Min_val>/8021`	69.7 ns	63.4 ns	1.10
`bm<uint8_t, Op::Min_val>/63`	19.3 ns	5.81 ns	3.32
`bm<uint8_t, Op::Max_val>/8021`	70.7 ns	63.6 ns	1.11
`bm<uint8_t, Op::Max_val>/63`	19.0 ns	6.61 ns	2.87
`bm<uint8_t, Op::Both_val>/8021`	76.7 ns	66.1 ns	1.16
`bm<uint8_t, Op::Both_val>/63`	26.6 ns	6.82 ns	3.90
`bm<uint16_t, Op::Min>/8021`	227 ns	226 ns	1.00
`bm<uint16_t, Op::Min>/31`	16.5 ns	10.5 ns	1.57
`bm<uint16_t, Op::Max>/8021`	224 ns	229 ns	0.98
`bm<uint16_t, Op::Max>/31`	15.8 ns	10.5 ns	1.50
`bm<uint16_t, Op::Both>/8021`	250 ns	267 ns	0.94
`bm<uint16_t, Op::Both>/31`	25.6 ns	18.8 ns	1.36
`bm<uint16_t, Op::Min_val>/8021`	116 ns	116 ns	1.00
`bm<uint16_t, Op::Min_val>/31`	7.77 ns	5.33 ns	1.46
`bm<uint16_t, Op::Max_val>/8021`	116 ns	114 ns	1.02
`bm<uint16_t, Op::Max_val>/31`	8.34 ns	5.54 ns	1.51
`bm<uint16_t, Op::Both_val>/8021`	117 ns	117 ns	1.00
`bm<uint16_t, Op::Both_val>/31`	9.27 ns	5.76 ns	1.61
`bm<uint32_t, Op::Min>/8021`	446 ns	432 ns	1.03
`bm<uint32_t, Op::Min>/15`	10.0 ns	8.80 ns	1.14
`bm<uint32_t, Op::Max>/8021`	446 ns	437 ns	1.02
`bm<uint32_t, Op::Max>/15`	10.1 ns	9.05 ns	1.12
`bm<uint32_t, Op::Both>/8021`	506 ns	511 ns	0.99
`bm<uint32_t, Op::Both>/15`	18.8 ns	17.1 ns	1.10
`bm<uint32_t, Op::Min_val>/8021`	223 ns	223 ns	1.00
`bm<uint32_t, Op::Min_val>/15`	6.39 ns	5.54 ns	1.15
`bm<uint32_t, Op::Max_val>/8021`	223 ns	223 ns	1.00
`bm<uint32_t, Op::Max_val>/15`	6.33 ns	5.12 ns	1.24
`bm<uint32_t, Op::Both_val>/8021`	439 ns	223 ns	1.97
`bm<uint32_t, Op::Both_val>/15`	6.64 ns	5.96 ns	1.11
`bm<uint64_t, Op::Min>/8021`	1142 ns	881 ns	1.30
`bm<uint64_t, Op::Min>/7`	10.0 ns	11.1 ns	0.90
`bm<uint64_t, Op::Max>/8021`	903 ns	879 ns	1.03
`bm<uint64_t, Op::Max>/7`	11.1 ns	10.6 ns	1.05
`bm<uint64_t, Op::Both>/8021`	1249 ns	1272 ns	0.98
`bm<uint64_t, Op::Both>/7`	20.0 ns	20.2 ns	0.99
`bm<uint64_t, Op::Min_val>/8021`	862 ns	857 ns	1.01
`bm<uint64_t, Op::Min_val>/7`	6.24 ns	6.00 ns	1.04
`bm<uint64_t, Op::Max_val>/8021`	864 ns	852 ns	1.01
`bm<uint64_t, Op::Max_val>/7`	6.05 ns	5.78 ns	1.05
`bm<uint64_t, Op::Both_val>/8021`	884 ns	873 ns	1.01
`bm<uint64_t, Op::Both_val>/7`	9.07 ns	8.78 ns	1.03
`bm<int8_t, Op::Min>/8021`	129 ns	123 ns	1.05
`bm<int8_t, Op::Min>/63`	30.5 ns	12.6 ns	2.42
`bm<int8_t, Op::Max>/8021`	129 ns	123 ns	1.05
`bm<int8_t, Op::Max>/63`	30.3 ns	12.6 ns	2.40
`bm<int8_t, Op::Both>/8021`	158 ns	148 ns	1.07
`bm<int8_t, Op::Both>/63`	41.3 ns	20.4 ns	2.02
`bm<int8_t, Op::Min_val>/8021`	71.7 ns	64.2 ns	1.12
`bm<int8_t, Op::Min_val>/63`	18.9 ns	6.61 ns	2.86
`bm<int8_t, Op::Max_val>/8021`	71.5 ns	64.1 ns	1.12
`bm<int8_t, Op::Max_val>/63`	19.1 ns	5.77 ns	3.31
`bm<int8_t, Op::Both_val>/8021`	77.5 ns	65.5 ns	1.18
`bm<int8_t, Op::Both_val>/63`	25.9 ns	7.10 ns	3.65
`bm<int16_t, Op::Min>/8021`	230 ns	230 ns	1.00
`bm<int16_t, Op::Min>/31`	15.6 ns	11.5 ns	1.36
`bm<int16_t, Op::Max>/8021`	225 ns	228 ns	0.99
`bm<int16_t, Op::Max>/31`	16.8 ns	11.3 ns	1.49
`bm<int16_t, Op::Both>/8021`	249 ns	268 ns	0.93
`bm<int16_t, Op::Both>/31`	25.9 ns	18.6 ns	1.39
`bm<int16_t, Op::Min_val>/8021`	116 ns	115 ns	1.01
`bm<int16_t, Op::Min_val>/31`	10.0 ns	5.14 ns	1.95
`bm<int16_t, Op::Max_val>/8021`	117 ns	220 ns	0.53
`bm<int16_t, Op::Max_val>/31`	8.62 ns	5.34 ns	1.61
`bm<int16_t, Op::Both_val>/8021`	119 ns	118 ns	1.01
`bm<int16_t, Op::Both_val>/31`	13.2 ns	5.99 ns	2.20
`bm<int32_t, Op::Min>/8021`	440 ns	438 ns	1.00
`bm<int32_t, Op::Min>/15`	9.93 ns	10.4 ns	0.95
`bm<int32_t, Op::Max>/8021`	441 ns	438 ns	1.01
`bm<int32_t, Op::Max>/15`	9.95 ns	10.4 ns	0.96
`bm<int32_t, Op::Both>/8021`	507 ns	514 ns	0.99
`bm<int32_t, Op::Both>/15`	18.4 ns	16.7 ns	1.10
`bm<int32_t, Op::Min_val>/8021`	224 ns	222 ns	1.01
`bm<int32_t, Op::Min_val>/15`	6.45 ns	5.53 ns	1.17
`bm<int32_t, Op::Max_val>/8021`	225 ns	222 ns	1.01
`bm<int32_t, Op::Max_val>/15`	7.92 ns	5.98 ns	1.32
`bm<int32_t, Op::Both_val>/8021`	437 ns	434 ns	1.01
`bm<int32_t, Op::Both_val>/15`	8.14 ns	5.54 ns	1.47
`bm<int64_t, Op::Min>/8021`	1135 ns	1096 ns	1.04
`bm<int64_t, Op::Min>/7`	13.2 ns	13.8 ns	0.96
`bm<int64_t, Op::Max>/8021`	1094 ns	1114 ns	0.98
`bm<int64_t, Op::Max>/7`	13.6 ns	14.1 ns	0.96
`bm<int64_t, Op::Both>/8021`	1218 ns	1272 ns	0.96
`bm<int64_t, Op::Both>/7`	19.6 ns	19.8 ns	0.99
`bm<int64_t, Op::Min_val>/8021`	851 ns	845 ns	1.01
`bm<int64_t, Op::Min_val>/7`	5.57 ns	5.32 ns	1.05
`bm<int64_t, Op::Max_val>/8021`	860 ns	846 ns	1.02
`bm<int64_t, Op::Max_val>/7`	5.59 ns	5.33 ns	1.05
`bm<int64_t, Op::Both_val>/8021`	873 ns	868 ns	1.01
`bm<int64_t, Op::Both_val>/7`	7.75 ns	7.91 ns	0.98
`bm<float, Op::Min>/8021`	443 ns	436 ns	1.02
`bm<float, Op::Min>/15`	9.72 ns	8.36 ns	1.16
`bm<float, Op::Max>/8021`	443 ns	436 ns	1.02
`bm<float, Op::Max>/15`	9.43 ns	8.33 ns	1.13
`bm<float, Op::Both>/8021`	521 ns	518 ns	1.01
`bm<float, Op::Both>/15`	16.6 ns	17.5 ns	0.95
`bm<float, Op::Min_val>/8021`	442 ns	437 ns	1.01
`bm<float, Op::Min_val>/15`	9.67 ns	8.35 ns	1.16
`bm<float, Op::Max_val>/8021`	439 ns	436 ns	1.01
`bm<float, Op::Max_val>/15`	9.23 ns	8.41 ns	1.10
`bm<float, Op::Both_val>/8021`	514 ns	524 ns	0.98
`bm<float, Op::Both_val>/15`	15.5 ns	16.4 ns	0.95
`bm<double, Op::Min>/8021`	874 ns	660 ns	1.32
`bm<double, Op::Min>/7`	8.43 ns	8.25 ns	1.02
`bm<double, Op::Max>/8021`	653 ns	654 ns	1.00
`bm<double, Op::Max>/7`	7.95 ns	8.20 ns	0.97
`bm<double, Op::Both>/8021`	1027 ns	1023 ns	1.00
`bm<double, Op::Both>/7`	16.4 ns	16.6 ns	0.99
`bm<double, Op::Min_val>/8021`	871 ns	654 ns	1.33
`bm<double, Op::Min_val>/7`	8.40 ns	8.55 ns	0.98
`bm<double, Op::Max_val>/8021`	653 ns	654 ns	1.00
`bm<double, Op::Max_val>/7`	8.03 ns	8.45 ns	0.95
`bm<double, Op::Both_val>/8021`	1024 ns	1016 ns	1.01
`bm<double, Op::Both_val>/7`	15.4 ns	15.7 ns	0.98

The only significant regression is for bm<int16_t, Op::Max_val>/8021, which was 117 ns before and 220 ns after, 0.53x speedup. I don't think this should block the PR, but I wanted to note it.

StephanTLavavej · 2024-10-11T10:17:13Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-10-12T03:52:28Z

It was the minimum of times, it was the maximum of times...

📉 📈 😻

AlexGuteniev added 4 commits August 26, 2024 11:03

benchmark

3a0de8c

values

6ae3ab8

indices

827b74d

Reduce sizes variety

e76261f

AlexGuteniev requested a review from a team as a code owner August 26, 2024 18:04

CaseyCarter added the performance Must go faster label Aug 26, 2024

StephanTLavavej self-assigned this Aug 26, 2024

StephanTLavavej changed the title ~~Use AVX/AVX2 masks in minmax_element and minmax vectoization~~ Use AVX/AVX2 masks in minmax_element and minmax vectorization Aug 27, 2024

AlexGuteniev and others added 3 commits August 27, 2024 14:10

reduce copypasta

7f2d6d6

fix floating mask

cb67ee5

Merge branch 'main' into min_max_mask

41ff083

This comment was marked as resolved.

Sign in to view

StephanTLavavej added 3 commits October 7, 2024 13:40

Merge branch 'main' into min_max_mask

a740de3

Use [[maybe_unused]].

735803d

Advance when _Tail_byte_size is non-zero.

b709699

StephanTLavavej reviewed Oct 7, 2024

View reviewed changes

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

StephanTLavavej approved these changes Oct 7, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

StephanTLavavej removed their assignment Oct 7, 2024

StephanTLavavej mentioned this pull request Oct 7, 2024

Maintainer priorities #4700

Open

This comment was marked as resolved.

Sign in to view

StephanTLavavej self-assigned this Oct 11, 2024

StephanTLavavej merged commit caba83c into microsoft:main Oct 12, 2024
39 checks passed

AlexGuteniev deleted the min_max_mask branch October 12, 2024 04:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use AVX/AVX2 masks in `minmax_element` and `minmax` vectorization #4917

Use AVX/AVX2 masks in `minmax_element` and `minmax` vectorization #4917

AlexGuteniev commented Aug 26, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Oct 8, 2024

StephanTLavavej commented Oct 11, 2024

StephanTLavavej commented Oct 12, 2024

Use AVX/AVX2 masks in minmax_element and minmax vectorization #4917

Use AVX/AVX2 masks in minmax_element and minmax vectorization #4917

Conversation

AlexGuteniev commented Aug 26, 2024 • edited Loading

🧭 Overview

⏱️ Benchmark results

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Oct 8, 2024

StephanTLavavej commented Oct 11, 2024

StephanTLavavej commented Oct 12, 2024

📉 📈 😻

Use AVX/AVX2 masks in `minmax_element` and `minmax` vectorization #4917

Use AVX/AVX2 masks in `minmax_element` and `minmax` vectorization #4917

AlexGuteniev commented Aug 26, 2024 •

edited

Loading