Vectorize {Last}IndexOfAny{Except} for ASCII needles #76740

MihaZupan · 2022-10-07T09:27:46Z

Contributes to #68328 (and contributes the workhorse implementation for an eventual dedicated API)
cc: @gfoidl

This PR adds a vectorized path for IndexOfAny-like methods if the input is at least 8 characters long and the needle is only ASCII.
It is used if Ssse3 or AdvSimd.Arm64 are supported.

I also unified all (Last)IndexOfAny(Except) methods to use the same approach of

if (HaystackIsShort) return SimpleSearch();
if (NeedleIsAscii) return Vectorized();
return ProbabilisticMap();

leading to perf improvements for both short and long inputs.

For inputs with long runs of no matches, the vectorized path has 10 to 140+ times the throughput on realistic inputs.
As the -Except methods were previously O(n * m), you can of course see arbitrarily large improvements.

X86 numbers

These numbers were collected on a Windows Azure VM running on an Intel Xeon 8370C processor.

Method	Toolchain	Length	Needle	Mean	Error	Ratio
IndexOfAny	main	1	ABCDEF	9.787 ns	0.0087 ns	1.00
IndexOfAny	pr	1	ABCDEF	5.852 ns	0.0053 ns	0.60

IndexOfAnyExcept	main	1	ABCDEF	5.313 ns	0.0077 ns	1.00
IndexOfAnyExcept	pr	1	ABCDEF	6.676 ns	0.0104 ns	1.26

LastIndexOfAny	main	1	ABCDEF	11.376 ns	0.0065 ns	1.00
LastIndexOfAny	pr	1	ABCDEF	5.783 ns	0.0037 ns	0.51

LastIndexOfAnyExcept	main	1	ABCDEF	5.946 ns	0.0123 ns	1.00
LastIndexOfAnyExcept	pr	1	ABCDEF	6.030 ns	0.0324 ns	1.01

IndexOfAny	main	1	AlphaNumeric	63.653 ns	0.0265 ns	1.00
IndexOfAny	pr	1	AlphaNumeric	5.855 ns	0.0066 ns	0.09

IndexOfAnyExcept	main	1	AlphaNumeric	6.371 ns	0.0128 ns	1.00
IndexOfAnyExcept	pr	1	AlphaNumeric	7.619 ns	0.0104 ns	1.20

LastIndexOfAny	main	1	AlphaNumeric	74.164 ns	0.0712 ns	1.00
LastIndexOfAny	pr	1	AlphaNumeric	5.558 ns	0.0011 ns	0.07

LastIndexOfAnyExcept	main	1	AlphaNumeric	8.766 ns	0.2798 ns	1.00
LastIndexOfAnyExcept	pr	1	AlphaNumeric	6.859 ns	0.0018 ns	0.88


IndexOfAny	main	8	ABCDEF	14.12 ns	0.016 ns	1.00
IndexOfAny	pr	8	ABCDEF	12.25 ns	0.006 ns	0.87

IndexOfAnyExcept	main	8	ABCDEF	24.38 ns	0.019 ns	1.00
IndexOfAnyExcept	pr	8	ABCDEF	14.86 ns	0.005 ns	0.61

LastIndexOfAny	main	8	ABCDEF	16.51 ns	0.008 ns	1.00
LastIndexOfAny	pr	8	ABCDEF	12.23 ns	0.006 ns	0.74

LastIndexOfAnyExcept	main	8	ABCDEF	23.42 ns	0.022 ns	1.00
LastIndexOfAnyExcept	pr	8	ABCDEF	14.93 ns	0.013 ns	0.64

IndexOfAny	main	8	AlphaNumeric	67.53 ns	0.075 ns	1.00
IndexOfAny	pr	8	AlphaNumeric	34.92 ns	0.045 ns	0.52

IndexOfAnyExcept	main	8	AlphaNumeric	27.99 ns	0.025 ns	1.00
IndexOfAnyExcept	pr	8	AlphaNumeric	31.09 ns	0.036 ns	1.11

LastIndexOfAny	main	8	AlphaNumeric	79.39 ns	0.065 ns	1.00
LastIndexOfAny	pr	8	AlphaNumeric	34.32 ns	0.028 ns	0.43

LastIndexOfAnyExcept	main	8	AlphaNumeric	27.67 ns	0.097 ns	1.00
LastIndexOfAnyExcept	pr	8	AlphaNumeric	25.80 ns	0.014 ns	0.93


IndexOfAny	main	16	ABCDEF	20.15 ns	0.060 ns	1.00
IndexOfAny	pr	16	ABCDEF	12.25 ns	0.007 ns	0.61

IndexOfAnyExcept	main	16	ABCDEF	43.59 ns	0.043 ns	1.00
IndexOfAnyExcept	pr	16	ABCDEF	14.96 ns	0.011 ns	0.34

LastIndexOfAny	main	16	ABCDEF	21.76 ns	0.012 ns	1.00
LastIndexOfAny	pr	16	ABCDEF	12.23 ns	0.006 ns	0.56

LastIndexOfAnyExcept	main	16	ABCDEF	41.38 ns	0.033 ns	1.00
LastIndexOfAnyExcept	pr	16	ABCDEF	14.92 ns	0.010 ns	0.36

IndexOfAny	main	16	AlphaNumeric	73.53 ns	0.060 ns	1.00
IndexOfAny	pr	16	AlphaNumeric	67.08 ns	0.084 ns	0.91

IndexOfAnyExcept	main	16	AlphaNumeric	49.79 ns	0.032 ns	1.00
IndexOfAnyExcept	pr	16	AlphaNumeric	54.95 ns	0.051 ns	1.10

LastIndexOfAny	main	16	AlphaNumeric	84.76 ns	0.061 ns	1.00
LastIndexOfAny	pr	16	AlphaNumeric	66.56 ns	0.060 ns	0.79

LastIndexOfAnyExcept	main	16	AlphaNumeric	50.64 ns	0.162 ns	1.00
LastIndexOfAnyExcept	pr	16	AlphaNumeric	45.80 ns	0.062 ns	0.91


IndexOfAny	main	32	ABCDEF	29.04 ns	0.034 ns	1.00
IndexOfAny	pr	32	ABCDEF	13.37 ns	0.009 ns	0.46

IndexOfAnyExcept	main	32	ABCDEF	83.02 ns	0.093 ns	1.00
IndexOfAnyExcept	pr	32	ABCDEF	15.98 ns	0.007 ns	0.19

LastIndexOfAny	main	32	ABCDEF	31.55 ns	0.022 ns	1.00
LastIndexOfAny	pr	32	ABCDEF	13.92 ns	0.050 ns	0.44

LastIndexOfAnyExcept	main	32	ABCDEF	82.09 ns	0.090 ns	1.00
LastIndexOfAnyExcept	pr	32	ABCDEF	15.77 ns	0.007 ns	0.19

IndexOfAny	main	32	AlphaNumeric	83.50 ns	0.061 ns	1.00
IndexOfAny	pr	32	AlphaNumeric	65.77 ns	0.252 ns	0.79

IndexOfAnyExcept	main	32	AlphaNumeric	106.94 ns	0.054 ns	1.00
IndexOfAnyExcept	pr	32	AlphaNumeric	66.03 ns	0.099 ns	0.62

LastIndexOfAny	main	32	AlphaNumeric	94.24 ns	0.052 ns	1.00
LastIndexOfAny	pr	32	AlphaNumeric	64.81 ns	0.073 ns	0.69

LastIndexOfAnyExcept	main	32	AlphaNumeric	107.73 ns	0.249 ns	1.00
LastIndexOfAnyExcept	pr	32	AlphaNumeric	65.58 ns	0.034 ns	0.61


IndexOfAny	main	64	ABCDEF	51.95 ns	0.321 ns	1.00
IndexOfAny	pr	64	ABCDEF	15.31 ns	0.009 ns	0.30

IndexOfAnyExcept	main	64	ABCDEF	165.25 ns	0.166 ns	1.00
IndexOfAnyExcept	pr	64	ABCDEF	17.54 ns	0.026 ns	0.11

LastIndexOfAny	main	64	ABCDEF	52.16 ns	0.039 ns	1.00
LastIndexOfAny	pr	64	ABCDEF	15.83 ns	0.020 ns	0.30

LastIndexOfAnyExcept	main	64	ABCDEF	164.63 ns	0.167 ns	1.00
LastIndexOfAnyExcept	pr	64	ABCDEF	17.01 ns	0.008 ns	0.10

IndexOfAny	main	64	AlphaNumeric	105.16 ns	0.079 ns	1.00
IndexOfAny	pr	64	AlphaNumeric	67.26 ns	0.093 ns	0.64

IndexOfAnyExcept	main	64	AlphaNumeric	208.43 ns	0.070 ns	1.00
IndexOfAnyExcept	pr	64	AlphaNumeric	68.18 ns	0.107 ns	0.33

LastIndexOfAny	main	64	AlphaNumeric	115.14 ns	0.057 ns	1.00
LastIndexOfAny	pr	64	AlphaNumeric	66.35 ns	0.040 ns	0.58

LastIndexOfAnyExcept	main	64	AlphaNumeric	212.70 ns	0.583 ns	1.00
LastIndexOfAnyExcept	pr	64	AlphaNumeric	67.31 ns	0.059 ns	0.32


IndexOfAny	main	128	ABCDEF	90.31 ns	0.083 ns	1.00
IndexOfAny	pr	128	ABCDEF	18.74 ns	0.008 ns	0.21

IndexOfAnyExcept	main	128	ABCDEF	336.07 ns	0.234 ns	1.00
IndexOfAnyExcept	pr	128	ABCDEF	20.78 ns	0.015 ns	0.06

LastIndexOfAny	main	128	ABCDEF	100.96 ns	0.050 ns	1.00
LastIndexOfAny	pr	128	ABCDEF	18.80 ns	0.024 ns	0.19

LastIndexOfAnyExcept	main	128	ABCDEF	337.61 ns	0.233 ns	1.00
LastIndexOfAnyExcept	pr	128	ABCDEF	21.00 ns	0.040 ns	0.06

IndexOfAny	main	128	AlphaNumeric	148.35 ns	0.231 ns	1.00
IndexOfAny	pr	128	AlphaNumeric	70.20 ns	0.121 ns	0.47

IndexOfAnyExcept	main	128	AlphaNumeric	425.60 ns	0.178 ns	1.00
IndexOfAnyExcept	pr	128	AlphaNumeric	71.25 ns	0.041 ns	0.17

LastIndexOfAny	main	128	AlphaNumeric	165.64 ns	0.085 ns	1.00
LastIndexOfAny	pr	128	AlphaNumeric	70.08 ns	0.118 ns	0.42

LastIndexOfAnyExcept	main	128	AlphaNumeric	438.12 ns	1.407 ns	1.00
LastIndexOfAnyExcept	pr	128	AlphaNumeric	72.16 ns	0.126 ns	0.16

IndexOfAny	main	256	ABCDEF	182.60 ns	0.382 ns	1.00
IndexOfAny	pr	256	ABCDEF	29.07 ns	0.026 ns	0.16

IndexOfAnyExcept	main	256	ABCDEF	703.61 ns	0.468 ns	1.00
IndexOfAnyExcept	pr	256	ABCDEF	32.34 ns	0.019 ns	0.05

LastIndexOfAny	main	256	ABCDEF	196.47 ns	1.878 ns	1.00
LastIndexOfAny	pr	256	ABCDEF	29.52 ns	0.063 ns	0.15

LastIndexOfAnyExcept	main	256	ABCDEF	704.98 ns	0.586 ns	1.00
LastIndexOfAnyExcept	pr	256	ABCDEF	31.83 ns	0.019 ns	0.05

IndexOfAny	main	256	AlphaNumeric	236.97 ns	0.128 ns	1.00
IndexOfAny	pr	256	AlphaNumeric	81.93 ns	0.287 ns	0.35

IndexOfAnyExcept	main	256	AlphaNumeric	878.10 ns	0.445 ns	1.00
IndexOfAnyExcept	pr	256	AlphaNumeric	82.96 ns	0.094 ns	0.09

LastIndexOfAny	main	256	AlphaNumeric	247.84 ns	0.110 ns	1.00
LastIndexOfAny	pr	256	AlphaNumeric	80.38 ns	0.055 ns	0.32

LastIndexOfAnyExcept	main	256	AlphaNumeric	871.79 ns	0.353 ns	1.00
LastIndexOfAnyExcept	pr	256	AlphaNumeric	83.21 ns	0.096 ns	0.10


IndexOfAny	main	10000	ABCDEF	6,409.6 ns	4.42 ns	1.00
IndexOfAny	pr	10000	ABCDEF	651.2 ns	0.36 ns	0.10

IndexOfAnyExcept	main	10000	ABCDEF	71,960.2 ns	31.30 ns	1.00
IndexOfAnyExcept	pr	10000	ABCDEF	741.1 ns	0.36 ns	0.01

LastIndexOfAny	main	10000	ABCDEF	6,396.7 ns	2.27 ns	1.00
LastIndexOfAny	pr	10000	ABCDEF	643.2 ns	0.36 ns	0.10

LastIndexOfAnyExcept	main	10000	ABCDEF	81,927.2 ns	23.11 ns	1.000
LastIndexOfAnyExcept	pr	10000	ABCDEF	728.1 ns	0.26 ns	0.009

IndexOfAny	main	10000	AlphaNumeric	6,450.1 ns	2.54 ns	1.00
IndexOfAny	pr	10000	AlphaNumeric	704.2 ns	0.41 ns	0.11

IndexOfAnyExcept	main	10000	AlphaNumeric	114,854.9 ns	91.93 ns	1.000
IndexOfAnyExcept	pr	10000	AlphaNumeric	791.0 ns	0.36 ns	0.007

LastIndexOfAny	main	10000	AlphaNumeric	6,613.7 ns	21.32 ns	1.00
LastIndexOfAny	pr	10000	AlphaNumeric	690.0 ns	0.08 ns	0.10

LastIndexOfAnyExcept	main	10000	AlphaNumeric	110,475.7 ns	74.60 ns	1.000
LastIndexOfAnyExcept	pr	10000	AlphaNumeric	774.9 ns	0.04 ns	0.007

On X86 we have to do a bit more work if the needle contains a zero:

Method	Length	Needle	Mean	Error
IndexOfAny	10000	ABCDEF	536.8 ns	0.45 ns
IndexOfAny	10000	NeedleWithZero	820.7 ns	0.74 ns

Approximate ARM64 numbers:

Method	Length	Needle	Mean	Error
IndexOfAny	10000	ABCDEF	1.783 µs	0.0051 µs
CurrentIndexOfAny	10000	ABCDEF	10.355 µs	0.0010 µs

ghost · 2022-10-07T09:27:58Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #68328 (and contributes the workhorse implementation for an eventual dedicated API)
cc: @gfoidl

This PR adds a vectorized path for IndexOfAny-like methods if the input is at least 8 characters long and the needle is only ASCII.
It is used if Ssse3 or AdvSimd.Arm64 are supported.

Similarly to the IndexOfAnyProbabilistic code paths, the cost of computing the bitmap used in the vectorized path is quite high compared to the searching cost (+1 for a dedicated API from #68328).
The (Last)IndexOfAnyExcept methods that were previously using an O(n * m) loop regress with this change if the needle is long, while the haystack is small or matches occur close to the start. We could do a few length checks to mitigate that, but this is also already the case with existing (Last)IndexOfAny methods.

For inputs with long runs of no matches, the vectorized path has 10-30x the throughput.

X86 numbers

These numbers were collected on a Windows Azure VM running on an Intel Xeon 8370C processor.

Method	Toolchain	Length	Needle	Mean	Error	Ratio
IndexOfAny	main	1	ABCDEF	10.018 ns	0.0316 ns	1.00
IndexOfAny	pr	1	ABCDEF	11.134 ns	0.0114 ns	1.11

IndexOfAnyExcept	main	1	ABCDEF	4.768 ns	0.0057 ns	1.00
IndexOfAnyExcept	pr	1	ABCDEF	5.380 ns	0.0053 ns	1.13

LastIndexOfAny	main	1	ABCDEF	11.007 ns	0.0322 ns	1.00
LastIndexOfAny	pr	1	ABCDEF	11.288 ns	0.0097 ns	1.03

LastIndexOfAnyExcept	main	1	ABCDEF	4.911 ns	0.0045 ns	1.00
LastIndexOfAnyExcept	pr	1	ABCDEF	5.260 ns	0.0076 ns	1.07

IndexOfAny	main	1	AlphaNumeric	63.540 ns	0.0710 ns	1.00
IndexOfAny	pr	1	AlphaNumeric	86.789 ns	0.2262 ns	1.37

IndexOfAnyExcept	main	1	AlphaNumeric	5.407 ns	0.0617 ns	1.00
IndexOfAnyExcept	pr	1	AlphaNumeric	5.578 ns	0.0013 ns	1.05

LastIndexOfAny	main	1	AlphaNumeric	67.869 ns	0.3421 ns	1.00
LastIndexOfAny	pr	1	AlphaNumeric	89.007 ns	0.1024 ns	1.32

LastIndexOfAnyExcept	main	1	AlphaNumeric	5.485 ns	0.0064 ns	1.00
LastIndexOfAnyExcept	pr	1	AlphaNumeric	5.898 ns	0.0021 ns	1.08


IndexOfAny	main	8	ABCDEF	14.55 ns	0.063 ns	1.00
IndexOfAny	pr	8	ABCDEF	12.86 ns	0.009 ns	0.89

IndexOfAnyExcept	main	8	ABCDEF	18.43 ns	0.018 ns	1.00
IndexOfAnyExcept	pr	8	ABCDEF	14.61 ns	0.012 ns	0.79

LastIndexOfAny	main	8	ABCDEF	16.69 ns	0.023 ns	1.00
LastIndexOfAny	pr	8	ABCDEF	12.89 ns	0.016 ns	0.77

LastIndexOfAnyExcept	main	8	ABCDEF	19.02 ns	0.062 ns	1.00
LastIndexOfAnyExcept	pr	8	ABCDEF	14.73 ns	0.011 ns	0.78

IndexOfAny	main	8	AlphaNumeric	67.92 ns	0.043 ns	1.00
IndexOfAny	pr	8	AlphaNumeric	67.60 ns	0.075 ns	1.00

IndexOfAnyExcept	main	8	AlphaNumeric	20.91 ns	0.022 ns	1.00
IndexOfAnyExcept	pr	8	AlphaNumeric	65.43 ns	0.048 ns	3.13

LastIndexOfAny	main	8	AlphaNumeric	79.32 ns	0.082 ns	1.00
LastIndexOfAny	pr	8	AlphaNumeric	70.53 ns	0.137 ns	0.89

LastIndexOfAnyExcept	main	8	AlphaNumeric	21.74 ns	0.015 ns	1.00
LastIndexOfAnyExcept	pr	8	AlphaNumeric	71.20 ns	0.023 ns	3.28


IndexOfAny	main	16	ABCDEF	19.80 ns	0.022 ns	1.00
IndexOfAny	pr	16	ABCDEF	12.79 ns	0.007 ns	0.65

IndexOfAnyExcept	main	16	ABCDEF	34.45 ns	0.031 ns	1.00
IndexOfAnyExcept	pr	16	ABCDEF	14.72 ns	0.021 ns	0.43

LastIndexOfAny	main	16	ABCDEF	21.83 ns	0.017 ns	1.00
LastIndexOfAny	pr	16	ABCDEF	12.99 ns	0.015 ns	0.60

LastIndexOfAnyExcept	main	16	ABCDEF	35.17 ns	0.100 ns	1.00
LastIndexOfAnyExcept	pr	16	ABCDEF	15.01 ns	0.032 ns	0.43

IndexOfAny	main	16	AlphaNumeric	73.40 ns	0.056 ns	1.00
IndexOfAny	pr	16	AlphaNumeric	67.11 ns	0.058 ns	0.91

IndexOfAnyExcept	main	16	AlphaNumeric	39.20 ns	0.041 ns	1.00
IndexOfAnyExcept	pr	16	AlphaNumeric	65.49 ns	0.054 ns	1.67

LastIndexOfAny	main	16	AlphaNumeric	85.42 ns	0.074 ns	1.00
LastIndexOfAny	pr	16	AlphaNumeric	71.06 ns	0.239 ns	0.83

LastIndexOfAnyExcept	main	16	AlphaNumeric	39.99 ns	0.054 ns	1.00
LastIndexOfAnyExcept	pr	16	AlphaNumeric	70.94 ns	0.042 ns	1.77


IndexOfAny	main	32	ABCDEF	29.25 ns	0.040 ns	1.00
IndexOfAny	pr	32	ABCDEF	14.21 ns	0.020 ns	0.49

IndexOfAnyExcept	main	32	ABCDEF	66.76 ns	0.054 ns	1.00
IndexOfAnyExcept	pr	32	ABCDEF	16.14 ns	0.035 ns	0.24

LastIndexOfAny	main	32	ABCDEF	31.49 ns	0.049 ns	1.00
LastIndexOfAny	pr	32	ABCDEF	14.28 ns	0.019 ns	0.45

LastIndexOfAnyExcept	main	32	ABCDEF	68.10 ns	0.255 ns	1.00
LastIndexOfAnyExcept	pr	32	ABCDEF	16.29 ns	0.013 ns	0.24

IndexOfAny	main	32	AlphaNumeric	83.77 ns	0.061 ns	1.00
IndexOfAny	pr	32	AlphaNumeric	67.79 ns	0.035 ns	0.81

IndexOfAnyExcept	main	32	AlphaNumeric	76.04 ns	0.062 ns	1.00
IndexOfAnyExcept	pr	32	AlphaNumeric	67.08 ns	0.063 ns	0.88

LastIndexOfAny	main	32	AlphaNumeric	95.07 ns	0.082 ns	1.00
LastIndexOfAny	pr	32	AlphaNumeric	68.43 ns	0.065 ns	0.72

LastIndexOfAnyExcept	main	32	AlphaNumeric	77.10 ns	0.089 ns	1.00
LastIndexOfAnyExcept	pr	32	AlphaNumeric	75.36 ns	0.223 ns	0.98


IndexOfAny	main	64	ABCDEF	49.95 ns	0.065 ns	1.00
IndexOfAny	pr	64	ABCDEF	15.56 ns	0.008 ns	0.31

IndexOfAnyExcept	main	64	ABCDEF	141.76 ns	0.164 ns	1.00
IndexOfAnyExcept	pr	64	ABCDEF	17.25 ns	0.012 ns	0.12

LastIndexOfAny	main	64	ABCDEF	52.32 ns	0.054 ns	1.00
LastIndexOfAny	pr	64	ABCDEF	16.22 ns	0.015 ns	0.31

LastIndexOfAnyExcept	main	64	ABCDEF	140.36 ns	0.092 ns	1.00
LastIndexOfAnyExcept	pr	64	ABCDEF	18.84 ns	0.107 ns	0.13

IndexOfAny	main	64	AlphaNumeric	104.80 ns	0.069 ns	1.00
IndexOfAny	pr	64	AlphaNumeric	70.33 ns	0.062 ns	0.67

IndexOfAnyExcept	main	64	AlphaNumeric	159.90 ns	0.214 ns	1.00
IndexOfAnyExcept	pr	64	AlphaNumeric	100.41 ns	4.322 ns	0.63

LastIndexOfAny	main	64	AlphaNumeric	115.49 ns	0.065 ns	1.00
LastIndexOfAny	pr	64	AlphaNumeric	70.45 ns	0.026 ns	0.61

LastIndexOfAnyExcept	main	64	AlphaNumeric	159.35 ns	0.152 ns	1.00
LastIndexOfAnyExcept	pr	64	AlphaNumeric	74.42 ns	0.015 ns	0.47


IndexOfAny	main	128	ABCDEF	90.87 ns	0.078 ns	1.00
IndexOfAny	pr	128	ABCDEF	19.31 ns	0.019 ns	0.21

IndexOfAnyExcept	main	128	ABCDEF	271.81 ns	0.272 ns	1.00
IndexOfAnyExcept	pr	128	ABCDEF	20.93 ns	0.011 ns	0.08

LastIndexOfAny	main	128	ABCDEF	101.48 ns	0.053 ns	1.00
LastIndexOfAny	pr	128	ABCDEF	19.41 ns	0.012 ns	0.19

LastIndexOfAnyExcept	main	128	ABCDEF	270.32 ns	0.180 ns	1.00
LastIndexOfAnyExcept	pr	128	ABCDEF	21.64 ns	0.049 ns	0.08

IndexOfAny	main	128	AlphaNumeric	146.21 ns	0.100 ns	1.00
IndexOfAny	pr	128	AlphaNumeric	73.25 ns	0.049 ns	0.50

IndexOfAnyExcept	main	128	AlphaNumeric	309.32 ns	0.400 ns	1.00
IndexOfAnyExcept	pr	128	AlphaNumeric	72.91 ns	0.052 ns	0.24

LastIndexOfAny	main	128	AlphaNumeric	170.34 ns	0.509 ns	1.00
LastIndexOfAny	pr	128	AlphaNumeric	74.98 ns	0.194 ns	0.44

LastIndexOfAnyExcept	main	128	AlphaNumeric	308.21 ns	0.256 ns	1.00
LastIndexOfAnyExcept	pr	128	AlphaNumeric	78.56 ns	0.019 ns	0.25


IndexOfAny	main	256	ABCDEF	181.28 ns	0.119 ns	1.00
IndexOfAny	pr	256	ABCDEF	29.36 ns	0.030 ns	0.16

IndexOfAnyExcept	main	256	ABCDEF	530.41 ns	0.329 ns	1.00
IndexOfAnyExcept	pr	256	ABCDEF	32.47 ns	0.074 ns	0.06

LastIndexOfAny	main	256	ABCDEF	198.31 ns	1.919 ns	1.00
LastIndexOfAny	pr	256	ABCDEF	29.15 ns	0.023 ns	0.15

LastIndexOfAnyExcept	main	256	ABCDEF	535.68 ns	1.678 ns	1.00
LastIndexOfAnyExcept	pr	256	ABCDEF	32.78 ns	0.079 ns	0.06

IndexOfAny	main	256	AlphaNumeric	236.67 ns	0.174 ns	1.00
IndexOfAny	pr	256	AlphaNumeric	83.34 ns	0.048 ns	0.35

IndexOfAnyExcept	main	256	AlphaNumeric	604.52 ns	0.440 ns	1.00
IndexOfAnyExcept	pr	256	AlphaNumeric	84.98 ns	0.088 ns	0.14

LastIndexOfAny	main	256	AlphaNumeric	248.39 ns	0.126 ns	1.00
LastIndexOfAny	pr	256	AlphaNumeric	84.54 ns	0.055 ns	0.34

LastIndexOfAnyExcept	main	256	AlphaNumeric	606.97 ns	0.625 ns	1.00
LastIndexOfAnyExcept	pr	256	AlphaNumeric	90.93 ns	0.164 ns	0.15


IndexOfAny	main	10000	ABCDEF	8,030.2 ns	70.01 ns	1.00
IndexOfAny	pr	10000	ABCDEF	653.0 ns	0.38 ns	0.08

IndexOfAnyExcept	main	10000	ABCDEF	20,356.5 ns	17.62 ns	1.00
IndexOfAnyExcept	pr	10000	ABCDEF	726.7 ns	0.37 ns	0.04

LastIndexOfAny	main	10000	ABCDEF	6,463.5 ns	14.40 ns	1.00
LastIndexOfAny	pr	10000	ABCDEF	645.3 ns	0.33 ns	0.10

LastIndexOfAnyExcept	main	10000	ABCDEF	20,271.5 ns	11.62 ns	1.00
LastIndexOfAnyExcept	pr	10000	ABCDEF	730.8 ns	0.34 ns	0.04

IndexOfAny	main	10000	AlphaNumeric	7,106.4 ns	23.74 ns	1.00
IndexOfAny	pr	10000	AlphaNumeric	712.5 ns	1.73 ns	0.10

IndexOfAnyExcept	main	10000	AlphaNumeric	23,274.9 ns	22.32 ns	1.00
IndexOfAnyExcept	pr	10000	AlphaNumeric	776.4 ns	0.30 ns	0.03

LastIndexOfAny	main	10000	AlphaNumeric	6,495.1 ns	5.23 ns	1.00
LastIndexOfAny	pr	10000	AlphaNumeric	694.5 ns	0.03 ns	0.11

LastIndexOfAnyExcept	main	10000	AlphaNumeric	23,190.5 ns	12.96 ns	1.00
LastIndexOfAnyExcept	pr	10000	AlphaNumeric	783.1 ns	0.11 ns	0.03

On X86 we have to do a bit more work if the needle contains a zero:

Method	Length	Needle	Mean	Error	StdDev
IndexOfAny	10000	ABCDEF	536.8 ns	0.45 ns	2.22 ns
IndexOfAny	10000	NeedleWithZero	820.7 ns	0.74 ns	3.69 ns

Approximate ARM64 numbers:

Method	Length	Needle	Mean	Error
IndexOfAny	10000	ABCDEF	1.783 µs	0.0051 µs
CurrentIndexOfAny	10000	ABCDEF	10.355 µs	0.0010 µs

Author:	MihaZupan
Assignees:	-
Labels:	`area-System.Memory`
Milestone:	8.0.0

stephentoub · 2022-10-07T12:47:21Z

Excellent, thanks for working on this.

The (Last)IndexOfAnyExcept methods that were previously using an O(n * m) loop regress with this change if the needle is long, while the haystack is small

Have we investigated doing something different for short haystacks? e.g. something super simple like:

if (haystack.Length < Vector128<short>.Count)
{
    for (int i = 0; i < haystack.Length; i++)
        if (needle.Contains(haystack[i]))
            return i;
    return -1;
}
...

?

MihaZupan · 2022-10-07T15:45:34Z

That is what we do for the -Except overloads with this PR.

haystack.Length >= 8 ? Vectorized() : SimpleLoop()

whereas (Last)IndexOfAny does

haystack.Length >= 8 ? Vectorized() : ProbabilisticMap()

In places where we were already using the ProbabilisticMap, the init overhead seems to be very similar.

It would likely be beneficial to tweak the exact cutoff and add the SimpleLoop variant where we currently only use the ProbabilisticMap.
E.g. (made up constants)

int IndexOfAny()
{
    if (haystack.Length < Max(8, needle.Length / 2))
        return SimpleLoop();

    if (IsAscii(needle))
        return Vectorized();

    return ProbabilisticMap();
}

int IndexOfAnyExcept()
{
    if (haystack.Length >= Max(8, needle.Length / 2) && IsAscii(needle))
        return Vectorized();

    return SimpleLoop();
}

I can look into what sort of numbers we'd see with something like that, though I would hope that in general, a dedicated API to hide the init cost completely would be the preferred approach.

stephentoub · 2022-10-07T15:51:07Z

It would likely be beneficial to tweak the exact cutoff and add the SimpleLoop variant where we currently only use the ProbabilisticMap.

Right, this is the main thing I was asking about. The probabilistic map path today needs to loop through each char individually, and do additional work for each. My gut would be that, other than for obscenely long needles, you could have a reasonably-sized haystack and still win with the simple loop doing a vectorized contains on the needles.

stephentoub · 2022-10-07T15:53:21Z

though I would hope that in general, a dedicated API to hide the init cost completely would be the preferred approach

Certainly for cases where you're going to be invoking something repeatedly and have the foresight to create and cache the preprocessed vector information.

gfoidl

👍🏻
Left some notes.

I expect these APIs should work for byte-inputs too, so with another static abstract interface (probably my favorite C# 11 feature) this can be done without duplication. I'm re-writing the base64 code in my repo using Char- and Byte-Operations for this* -- i.e. reading two char/short-vectors and combining them into one byte vector for further processing. Something similar could be done here too.

* it's still WIP, so pack signed vs. unsigned isn't done there

src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs

MihaZupan · 2022-10-09T06:05:29Z

I added a fast path for short haystacks to the probabilistic code path, leading to nice improvements there (I updated the numbers in the top post). E.g.

Method	Toolchain	Length	Needle	Mean	Error	Ratio
LastIndexOfAny	main	1	AlphaNumeric	74.164 ns	0.0712 ns	1.00
LastIndexOfAny	pr	1	AlphaNumeric	5.558 ns	0.0011 ns	0.07

My benchmark for -Except methods was also measuring the best-case performance for the previous simple loop. I updated it to measure the average case (match is found randomly in the needle instead of being the first element). The corresponding potential throughput change is therefore also much higher (145x instead of 30x).

~~The only real regression is the (Last)IndexOfAnyExcept for long needles and short haystacks as mentioned before. I'll keep playing around with cutoffs to try and improve that case too.~~

MihaZupan · 2022-10-09T14:17:30Z

The cutoff for the simple loop is now:

searchSpaceLength < Vector128<short>.Count || (searchSpaceLength < 20 && searchSpaceLength < (valuesLength >> 1))

Updated the benchmarks above, this is now a win pretty much across the board.

MihaZupan · 2022-10-10T14:21:50Z

/azp run runtime

azure-pipelines · 2022-10-10T14:22:31Z

Azure Pipelines successfully started running 1 pipeline(s).

src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs

src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs

src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs

danmoseley · 2022-10-10T20:37:45Z

Curious to see what this does to regex redux

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/regexredux-csharpcore-5.html

stephentoub · 2022-10-10T20:39:34Z

Curious to see what this does to regex redux

It won't do anything. We don't use this API from RegexCompiler / source generator (or, rather, we rely on the API immediately delegating to the 4/5-char overloads), nor does regex redux have any sets that would trigger it. We will update RegexCompiler / source generator to use whatever new API we create that plugs into the same implementation this is adding but that let's us precompute the vector/bit set rather than having to do it on each call.

MihaZupan · 2022-10-11T22:32:34Z

/azp run runtime

azure-pipelines · 2022-10-11T22:33:09Z

Azure Pipelines successfully started running 1 pipeline(s).

src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs

stephentoub · 2022-11-02T21:06:58Z

src/libraries/System.Memory/tests/Span/IndexOfAny.char.cs

+
+            Span<char> needleSpace = stackalloc char[8];
+            Span<char> haystackSpace = stackalloc char[40];
+            var rng = new Random();


To make this reproducable, we should include a seed here.

If I remember correctly, I removed it because the RNG part represents a significant portion of the test execution time, with the non-explicit-seed ctor using the faster impl AFAIK.

For reproducibility, I made sure to emit the exact inputs that failed as part of the error message. Does that address your concerns here?

Does that address your concerns here?

Only partially. There are two concerns:

That a test could fail and you not know why or be able to trigger it again.

That two runs of the same test suite might be non-deterministic.

Outputting the exact inputs addresses (1) but not (2). And (2) in general contributes to test flakiness.

If this test is about randomly stressing the implementation, then I don't believe it belongs in the functional test suite. If it's about using pseudo-randomness to try out thousands of inputs without having to manually code them all and be creative enough to ensure a reasonable spread of inputs, then it should be done in a way that runs the same tests every time, which means using a seed.

I changed it to waste less time in Random and use a constant seed.

It was extremely useful when working on the initial implementation to get rid of all the edge-case bugs. Some were so odd that I don't believe we would have come up with test cases for otherwise.

I increased the number of iterations so it now takes about a second of CPU time to run through them, so I moved these to outerloop.
Outerloop already takes ~4 minutes on my machine so I'm gonna assume that's acceptable.

src/libraries/System.Memory/tests/Span/IndexOfAny.char.cs

src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs

stephentoub · 2022-11-02T21:32:17Z

src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs

@@ -1791,8 +1779,9 @@ private static unsafe int IndexOfAnyProbabilistic(ref char searchSpace, int sear
        /// </summary>
        /// <param name="span">The span to search.</param>
        /// <param name="values">The set of values to search for.</param>
-        public static int LastIndexOfAny<T>(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>?
-            => LastIndexOfAny((ReadOnlySpan<T>)span, values);
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]


This was necessary? Or just added for consistency with something?

This was removed by mistake in #75754 (comment)

I added it back to at least make it consistent with the rest of {Last}IndexOfAny overloads.

stephentoub · 2022-11-02T21:39:37Z

Updated the benchmarks above, this is now a win pretty much across the board.

Meaning with this PR, for long or short inputs, vectorized or not, ASCII or not, everything you've tried is as good or better than it was before? If so, yay!

MihaZupan · 2022-11-02T21:47:02Z

Meaning with this PR, for long or short inputs, vectorized or not, ASCII or not, everything you've tried is as good or better than it was before? If so, yay!

Yes*.

The main scenario I can think of that would regress is if the needle contains non-ASCII, but starts with ASCII (e.g. alphabet + 'ü') as you now have to pay for an extra scan to build the ASCII bitmap that you don't actually need. The throughput would be the same as before, but the init cost would be higher.

For short inputs that still fall back to a simple for loop (short input + large needle), there is a slightly higher per-call overhead to do the extra calls/checks. E.g.

Method	Toolchain	Length	Needle	Mean	Error	Ratio
IndexOfAnyExcept	main	8	AlphaNumeric	27.99 ns	0.025 ns	1.00
IndexOfAnyExcept	pr	8	AlphaNumeric	31.09 ns	0.036 ns	1.11

There can also be slight differences due to a change of "naive for loop" Contains vs span.Contains for the probabilistic path, where we may be slightly slower for short needles, but faster for longer ones.

MihaZupan · 2022-11-03T21:53:08Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2022-11-03T21:53:18Z

Azure Pipelines successfully started running 1 pipeline(s).

MihaZupan · 2022-11-04T19:15:57Z

Test failures look like #76755, #77726

MihaZupan added the area-System.Memory label Oct 7, 2022

MihaZupan added this to the 8.0.0 milestone Oct 7, 2022

MihaZupan requested a review from stephentoub October 7, 2022 09:27

ghost assigned MihaZupan Oct 7, 2022

gfoidl reviewed Oct 7, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs Outdated Show resolved Hide resolved

This was referenced Oct 8, 2022

503's dotnet/arcade#10943

Closed

Tracking issue for CI build timeouts #76454

Closed

build-analysis bot mentioned this pull request Oct 10, 2022

Azure DevOps Maintenance in dnceng and dnceng-public organizations dotnet/arcade#11188

Closed