Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize {Last}IndexOfAny{Except} for ASCII needles #76740

Merged
merged 3 commits into from
Nov 4, 2022

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Oct 7, 2022

Contributes to #68328 (and contributes the workhorse implementation for an eventual dedicated API)
cc: @gfoidl

This PR adds a vectorized path for IndexOfAny-like methods if the input is at least 8 characters long and the needle is only ASCII.
It is used if Ssse3 or AdvSimd.Arm64 are supported.

I also unified all (Last)IndexOfAny(Except) methods to use the same approach of

if (HaystackIsShort) return SimpleSearch();
if (NeedleIsAscii) return Vectorized();
return ProbabilisticMap();

leading to perf improvements for both short and long inputs.

For inputs with long runs of no matches, the vectorized path has 10 to 140+ times the throughput on realistic inputs.
As the -Except methods were previously O(n * m), you can of course see arbitrarily large improvements.

X86 numbers

These numbers were collected on a Windows Azure VM running on an Intel Xeon 8370C processor.

Method Toolchain Length Needle Mean Error Ratio
IndexOfAny main 1 ABCDEF 9.787 ns 0.0087 ns 1.00
IndexOfAny pr 1 ABCDEF 5.852 ns 0.0053 ns 0.60
IndexOfAnyExcept main 1 ABCDEF 5.313 ns 0.0077 ns 1.00
IndexOfAnyExcept pr 1 ABCDEF 6.676 ns 0.0104 ns 1.26
LastIndexOfAny main 1 ABCDEF 11.376 ns 0.0065 ns 1.00
LastIndexOfAny pr 1 ABCDEF 5.783 ns 0.0037 ns 0.51
LastIndexOfAnyExcept main 1 ABCDEF 5.946 ns 0.0123 ns 1.00
LastIndexOfAnyExcept pr 1 ABCDEF 6.030 ns 0.0324 ns 1.01
IndexOfAny main 1 AlphaNumeric 63.653 ns 0.0265 ns 1.00
IndexOfAny pr 1 AlphaNumeric 5.855 ns 0.0066 ns 0.09
IndexOfAnyExcept main 1 AlphaNumeric 6.371 ns 0.0128 ns 1.00
IndexOfAnyExcept pr 1 AlphaNumeric 7.619 ns 0.0104 ns 1.20
LastIndexOfAny main 1 AlphaNumeric 74.164 ns 0.0712 ns 1.00
LastIndexOfAny pr 1 AlphaNumeric 5.558 ns 0.0011 ns 0.07
LastIndexOfAnyExcept main 1 AlphaNumeric 8.766 ns 0.2798 ns 1.00
LastIndexOfAnyExcept pr 1 AlphaNumeric 6.859 ns 0.0018 ns 0.88
IndexOfAny main 8 ABCDEF 14.12 ns 0.016 ns 1.00
IndexOfAny pr 8 ABCDEF 12.25 ns 0.006 ns 0.87
IndexOfAnyExcept main 8 ABCDEF 24.38 ns 0.019 ns 1.00
IndexOfAnyExcept pr 8 ABCDEF 14.86 ns 0.005 ns 0.61
LastIndexOfAny main 8 ABCDEF 16.51 ns 0.008 ns 1.00
LastIndexOfAny pr 8 ABCDEF 12.23 ns 0.006 ns 0.74
LastIndexOfAnyExcept main 8 ABCDEF 23.42 ns 0.022 ns 1.00
LastIndexOfAnyExcept pr 8 ABCDEF 14.93 ns 0.013 ns 0.64
IndexOfAny main 8 AlphaNumeric 67.53 ns 0.075 ns 1.00
IndexOfAny pr 8 AlphaNumeric 34.92 ns 0.045 ns 0.52
IndexOfAnyExcept main 8 AlphaNumeric 27.99 ns 0.025 ns 1.00
IndexOfAnyExcept pr 8 AlphaNumeric 31.09 ns 0.036 ns 1.11
LastIndexOfAny main 8 AlphaNumeric 79.39 ns 0.065 ns 1.00
LastIndexOfAny pr 8 AlphaNumeric 34.32 ns 0.028 ns 0.43
LastIndexOfAnyExcept main 8 AlphaNumeric 27.67 ns 0.097 ns 1.00
LastIndexOfAnyExcept pr 8 AlphaNumeric 25.80 ns 0.014 ns 0.93
IndexOfAny main 16 ABCDEF 20.15 ns 0.060 ns 1.00
IndexOfAny pr 16 ABCDEF 12.25 ns 0.007 ns 0.61
IndexOfAnyExcept main 16 ABCDEF 43.59 ns 0.043 ns 1.00
IndexOfAnyExcept pr 16 ABCDEF 14.96 ns 0.011 ns 0.34
LastIndexOfAny main 16 ABCDEF 21.76 ns 0.012 ns 1.00
LastIndexOfAny pr 16 ABCDEF 12.23 ns 0.006 ns 0.56
LastIndexOfAnyExcept main 16 ABCDEF 41.38 ns 0.033 ns 1.00
LastIndexOfAnyExcept pr 16 ABCDEF 14.92 ns 0.010 ns 0.36
IndexOfAny main 16 AlphaNumeric 73.53 ns 0.060 ns 1.00
IndexOfAny pr 16 AlphaNumeric 67.08 ns 0.084 ns 0.91
IndexOfAnyExcept main 16 AlphaNumeric 49.79 ns 0.032 ns 1.00
IndexOfAnyExcept pr 16 AlphaNumeric 54.95 ns 0.051 ns 1.10
LastIndexOfAny main 16 AlphaNumeric 84.76 ns 0.061 ns 1.00
LastIndexOfAny pr 16 AlphaNumeric 66.56 ns 0.060 ns 0.79
LastIndexOfAnyExcept main 16 AlphaNumeric 50.64 ns 0.162 ns 1.00
LastIndexOfAnyExcept pr 16 AlphaNumeric 45.80 ns 0.062 ns 0.91
IndexOfAny main 32 ABCDEF 29.04 ns 0.034 ns 1.00
IndexOfAny pr 32 ABCDEF 13.37 ns 0.009 ns 0.46
IndexOfAnyExcept main 32 ABCDEF 83.02 ns 0.093 ns 1.00
IndexOfAnyExcept pr 32 ABCDEF 15.98 ns 0.007 ns 0.19
LastIndexOfAny main 32 ABCDEF 31.55 ns 0.022 ns 1.00
LastIndexOfAny pr 32 ABCDEF 13.92 ns 0.050 ns 0.44
LastIndexOfAnyExcept main 32 ABCDEF 82.09 ns 0.090 ns 1.00
LastIndexOfAnyExcept pr 32 ABCDEF 15.77 ns 0.007 ns 0.19
IndexOfAny main 32 AlphaNumeric 83.50 ns 0.061 ns 1.00
IndexOfAny pr 32 AlphaNumeric 65.77 ns 0.252 ns 0.79
IndexOfAnyExcept main 32 AlphaNumeric 106.94 ns 0.054 ns 1.00
IndexOfAnyExcept pr 32 AlphaNumeric 66.03 ns 0.099 ns 0.62
LastIndexOfAny main 32 AlphaNumeric 94.24 ns 0.052 ns 1.00
LastIndexOfAny pr 32 AlphaNumeric 64.81 ns 0.073 ns 0.69
LastIndexOfAnyExcept main 32 AlphaNumeric 107.73 ns 0.249 ns 1.00
LastIndexOfAnyExcept pr 32 AlphaNumeric 65.58 ns 0.034 ns 0.61
IndexOfAny main 64 ABCDEF 51.95 ns 0.321 ns 1.00
IndexOfAny pr 64 ABCDEF 15.31 ns 0.009 ns 0.30
IndexOfAnyExcept main 64 ABCDEF 165.25 ns 0.166 ns 1.00
IndexOfAnyExcept pr 64 ABCDEF 17.54 ns 0.026 ns 0.11
LastIndexOfAny main 64 ABCDEF 52.16 ns 0.039 ns 1.00
LastIndexOfAny pr 64 ABCDEF 15.83 ns 0.020 ns 0.30
LastIndexOfAnyExcept main 64 ABCDEF 164.63 ns 0.167 ns 1.00
LastIndexOfAnyExcept pr 64 ABCDEF 17.01 ns 0.008 ns 0.10
IndexOfAny main 64 AlphaNumeric 105.16 ns 0.079 ns 1.00
IndexOfAny pr 64 AlphaNumeric 67.26 ns 0.093 ns 0.64
IndexOfAnyExcept main 64 AlphaNumeric 208.43 ns 0.070 ns 1.00
IndexOfAnyExcept pr 64 AlphaNumeric 68.18 ns 0.107 ns 0.33
LastIndexOfAny main 64 AlphaNumeric 115.14 ns 0.057 ns 1.00
LastIndexOfAny pr 64 AlphaNumeric 66.35 ns 0.040 ns 0.58
LastIndexOfAnyExcept main 64 AlphaNumeric 212.70 ns 0.583 ns 1.00
LastIndexOfAnyExcept pr 64 AlphaNumeric 67.31 ns 0.059 ns 0.32
IndexOfAny main 128 ABCDEF 90.31 ns 0.083 ns 1.00
IndexOfAny pr 128 ABCDEF 18.74 ns 0.008 ns 0.21
IndexOfAnyExcept main 128 ABCDEF 336.07 ns 0.234 ns 1.00
IndexOfAnyExcept pr 128 ABCDEF 20.78 ns 0.015 ns 0.06
LastIndexOfAny main 128 ABCDEF 100.96 ns 0.050 ns 1.00
LastIndexOfAny pr 128 ABCDEF 18.80 ns 0.024 ns 0.19
LastIndexOfAnyExcept main 128 ABCDEF 337.61 ns 0.233 ns 1.00
LastIndexOfAnyExcept pr 128 ABCDEF 21.00 ns 0.040 ns 0.06
IndexOfAny main 128 AlphaNumeric 148.35 ns 0.231 ns 1.00
IndexOfAny pr 128 AlphaNumeric 70.20 ns 0.121 ns 0.47
IndexOfAnyExcept main 128 AlphaNumeric 425.60 ns 0.178 ns 1.00
IndexOfAnyExcept pr 128 AlphaNumeric 71.25 ns 0.041 ns 0.17
LastIndexOfAny main 128 AlphaNumeric 165.64 ns 0.085 ns 1.00
LastIndexOfAny pr 128 AlphaNumeric 70.08 ns 0.118 ns 0.42
LastIndexOfAnyExcept main 128 AlphaNumeric 438.12 ns 1.407 ns 1.00
LastIndexOfAnyExcept pr 128 AlphaNumeric 72.16 ns 0.126 ns 0.16
IndexOfAny main 256 ABCDEF 182.60 ns 0.382 ns 1.00
IndexOfAny pr 256 ABCDEF 29.07 ns 0.026 ns 0.16
IndexOfAnyExcept main 256 ABCDEF 703.61 ns 0.468 ns 1.00
IndexOfAnyExcept pr 256 ABCDEF 32.34 ns 0.019 ns 0.05
LastIndexOfAny main 256 ABCDEF 196.47 ns 1.878 ns 1.00
LastIndexOfAny pr 256 ABCDEF 29.52 ns 0.063 ns 0.15
LastIndexOfAnyExcept main 256 ABCDEF 704.98 ns 0.586 ns 1.00
LastIndexOfAnyExcept pr 256 ABCDEF 31.83 ns 0.019 ns 0.05
IndexOfAny main 256 AlphaNumeric 236.97 ns 0.128 ns 1.00
IndexOfAny pr 256 AlphaNumeric 81.93 ns 0.287 ns 0.35
IndexOfAnyExcept main 256 AlphaNumeric 878.10 ns 0.445 ns 1.00
IndexOfAnyExcept pr 256 AlphaNumeric 82.96 ns 0.094 ns 0.09
LastIndexOfAny main 256 AlphaNumeric 247.84 ns 0.110 ns 1.00
LastIndexOfAny pr 256 AlphaNumeric 80.38 ns 0.055 ns 0.32
LastIndexOfAnyExcept main 256 AlphaNumeric 871.79 ns 0.353 ns 1.00
LastIndexOfAnyExcept pr 256 AlphaNumeric 83.21 ns 0.096 ns 0.10
IndexOfAny main 10000 ABCDEF 6,409.6 ns 4.42 ns 1.00
IndexOfAny pr 10000 ABCDEF 651.2 ns 0.36 ns 0.10
IndexOfAnyExcept main 10000 ABCDEF 71,960.2 ns 31.30 ns 1.00
IndexOfAnyExcept pr 10000 ABCDEF 741.1 ns 0.36 ns 0.01
LastIndexOfAny main 10000 ABCDEF 6,396.7 ns 2.27 ns 1.00
LastIndexOfAny pr 10000 ABCDEF 643.2 ns 0.36 ns 0.10
LastIndexOfAnyExcept main 10000 ABCDEF 81,927.2 ns 23.11 ns 1.000
LastIndexOfAnyExcept pr 10000 ABCDEF 728.1 ns 0.26 ns 0.009
IndexOfAny main 10000 AlphaNumeric 6,450.1 ns 2.54 ns 1.00
IndexOfAny pr 10000 AlphaNumeric 704.2 ns 0.41 ns 0.11
IndexOfAnyExcept main 10000 AlphaNumeric 114,854.9 ns 91.93 ns 1.000
IndexOfAnyExcept pr 10000 AlphaNumeric 791.0 ns 0.36 ns 0.007
LastIndexOfAny main 10000 AlphaNumeric 6,613.7 ns 21.32 ns 1.00
LastIndexOfAny pr 10000 AlphaNumeric 690.0 ns 0.08 ns 0.10
LastIndexOfAnyExcept main 10000 AlphaNumeric 110,475.7 ns 74.60 ns 1.000
LastIndexOfAnyExcept pr 10000 AlphaNumeric 774.9 ns 0.04 ns 0.007

On X86 we have to do a bit more work if the needle contains a zero:

Method Length Needle Mean Error
IndexOfAny 10000 ABCDEF 536.8 ns 0.45 ns
IndexOfAny 10000 NeedleWithZero 820.7 ns 0.74 ns

Approximate ARM64 numbers:

Method Length Needle Mean Error
IndexOfAny 10000 ABCDEF 1.783 µs 0.0051 µs
CurrentIndexOfAny 10000 ABCDEF 10.355 µs 0.0010 µs

@MihaZupan MihaZupan added this to the 8.0.0 milestone Oct 7, 2022
@MihaZupan MihaZupan requested a review from stephentoub October 7, 2022 09:27
@ghost
Copy link

ghost commented Oct 7, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #68328 (and contributes the workhorse implementation for an eventual dedicated API)
cc: @gfoidl

This PR adds a vectorized path for IndexOfAny-like methods if the input is at least 8 characters long and the needle is only ASCII.
It is used if Ssse3 or AdvSimd.Arm64 are supported.

Similarly to the IndexOfAnyProbabilistic code paths, the cost of computing the bitmap used in the vectorized path is quite high compared to the searching cost (+1 for a dedicated API from #68328).
The (Last)IndexOfAnyExcept methods that were previously using an O(n * m) loop regress with this change if the needle is long, while the haystack is small or matches occur close to the start. We could do a few length checks to mitigate that, but this is also already the case with existing (Last)IndexOfAny methods.

For inputs with long runs of no matches, the vectorized path has 10-30x the throughput.

X86 numbers

These numbers were collected on a Windows Azure VM running on an Intel Xeon 8370C processor.

Method Toolchain Length Needle Mean Error Ratio
IndexOfAny main 1 ABCDEF 10.018 ns 0.0316 ns 1.00
IndexOfAny pr 1 ABCDEF 11.134 ns 0.0114 ns 1.11
IndexOfAnyExcept main 1 ABCDEF 4.768 ns 0.0057 ns 1.00
IndexOfAnyExcept pr 1 ABCDEF 5.380 ns 0.0053 ns 1.13
LastIndexOfAny main 1 ABCDEF 11.007 ns 0.0322 ns 1.00
LastIndexOfAny pr 1 ABCDEF 11.288 ns 0.0097 ns 1.03
LastIndexOfAnyExcept main 1 ABCDEF 4.911 ns 0.0045 ns 1.00
LastIndexOfAnyExcept pr 1 ABCDEF 5.260 ns 0.0076 ns 1.07
IndexOfAny main 1 AlphaNumeric 63.540 ns 0.0710 ns 1.00
IndexOfAny pr 1 AlphaNumeric 86.789 ns 0.2262 ns 1.37
IndexOfAnyExcept main 1 AlphaNumeric 5.407 ns 0.0617 ns 1.00
IndexOfAnyExcept pr 1 AlphaNumeric 5.578 ns 0.0013 ns 1.05
LastIndexOfAny main 1 AlphaNumeric 67.869 ns 0.3421 ns 1.00
LastIndexOfAny pr 1 AlphaNumeric 89.007 ns 0.1024 ns 1.32
LastIndexOfAnyExcept main 1 AlphaNumeric 5.485 ns 0.0064 ns 1.00
LastIndexOfAnyExcept pr 1 AlphaNumeric 5.898 ns 0.0021 ns 1.08
IndexOfAny main 8 ABCDEF 14.55 ns 0.063 ns 1.00
IndexOfAny pr 8 ABCDEF 12.86 ns 0.009 ns 0.89
IndexOfAnyExcept main 8 ABCDEF 18.43 ns 0.018 ns 1.00
IndexOfAnyExcept pr 8 ABCDEF 14.61 ns 0.012 ns 0.79
LastIndexOfAny main 8 ABCDEF 16.69 ns 0.023 ns 1.00
LastIndexOfAny pr 8 ABCDEF 12.89 ns 0.016 ns 0.77
LastIndexOfAnyExcept main 8 ABCDEF 19.02 ns 0.062 ns 1.00
LastIndexOfAnyExcept pr 8 ABCDEF 14.73 ns 0.011 ns 0.78
IndexOfAny main 8 AlphaNumeric 67.92 ns 0.043 ns 1.00
IndexOfAny pr 8 AlphaNumeric 67.60 ns 0.075 ns 1.00
IndexOfAnyExcept main 8 AlphaNumeric 20.91 ns 0.022 ns 1.00
IndexOfAnyExcept pr 8 AlphaNumeric 65.43 ns 0.048 ns 3.13
LastIndexOfAny main 8 AlphaNumeric 79.32 ns 0.082 ns 1.00
LastIndexOfAny pr 8 AlphaNumeric 70.53 ns 0.137 ns 0.89
LastIndexOfAnyExcept main 8 AlphaNumeric 21.74 ns 0.015 ns 1.00
LastIndexOfAnyExcept pr 8 AlphaNumeric 71.20 ns 0.023 ns 3.28
IndexOfAny main 16 ABCDEF 19.80 ns 0.022 ns 1.00
IndexOfAny pr 16 ABCDEF 12.79 ns 0.007 ns 0.65
IndexOfAnyExcept main 16 ABCDEF 34.45 ns 0.031 ns 1.00
IndexOfAnyExcept pr 16 ABCDEF 14.72 ns 0.021 ns 0.43
LastIndexOfAny main 16 ABCDEF 21.83 ns 0.017 ns 1.00
LastIndexOfAny pr 16 ABCDEF 12.99 ns 0.015 ns 0.60
LastIndexOfAnyExcept main 16 ABCDEF 35.17 ns 0.100 ns 1.00
LastIndexOfAnyExcept pr 16 ABCDEF 15.01 ns 0.032 ns 0.43
IndexOfAny main 16 AlphaNumeric 73.40 ns 0.056 ns 1.00
IndexOfAny pr 16 AlphaNumeric 67.11 ns 0.058 ns 0.91
IndexOfAnyExcept main 16 AlphaNumeric 39.20 ns 0.041 ns 1.00
IndexOfAnyExcept pr 16 AlphaNumeric 65.49 ns 0.054 ns 1.67
LastIndexOfAny main 16 AlphaNumeric 85.42 ns 0.074 ns 1.00
LastIndexOfAny pr 16 AlphaNumeric 71.06 ns 0.239 ns 0.83
LastIndexOfAnyExcept main 16 AlphaNumeric 39.99 ns 0.054 ns 1.00
LastIndexOfAnyExcept pr 16 AlphaNumeric 70.94 ns 0.042 ns 1.77
IndexOfAny main 32 ABCDEF 29.25 ns 0.040 ns 1.00
IndexOfAny pr 32 ABCDEF 14.21 ns 0.020 ns 0.49
IndexOfAnyExcept main 32 ABCDEF 66.76 ns 0.054 ns 1.00
IndexOfAnyExcept pr 32 ABCDEF 16.14 ns 0.035 ns 0.24
LastIndexOfAny main 32 ABCDEF 31.49 ns 0.049 ns 1.00
LastIndexOfAny pr 32 ABCDEF 14.28 ns 0.019 ns 0.45
LastIndexOfAnyExcept main 32 ABCDEF 68.10 ns 0.255 ns 1.00
LastIndexOfAnyExcept pr 32 ABCDEF 16.29 ns 0.013 ns 0.24
IndexOfAny main 32 AlphaNumeric 83.77 ns 0.061 ns 1.00
IndexOfAny pr 32 AlphaNumeric 67.79 ns 0.035 ns 0.81
IndexOfAnyExcept main 32 AlphaNumeric 76.04 ns 0.062 ns 1.00
IndexOfAnyExcept pr 32 AlphaNumeric 67.08 ns 0.063 ns 0.88
LastIndexOfAny main 32 AlphaNumeric 95.07 ns 0.082 ns 1.00
LastIndexOfAny pr 32 AlphaNumeric 68.43 ns 0.065 ns 0.72
LastIndexOfAnyExcept main 32 AlphaNumeric 77.10 ns 0.089 ns 1.00
LastIndexOfAnyExcept pr 32 AlphaNumeric 75.36 ns 0.223 ns 0.98
IndexOfAny main 64 ABCDEF 49.95 ns 0.065 ns 1.00
IndexOfAny pr 64 ABCDEF 15.56 ns 0.008 ns 0.31
IndexOfAnyExcept main 64 ABCDEF 141.76 ns 0.164 ns 1.00
IndexOfAnyExcept pr 64 ABCDEF 17.25 ns 0.012 ns 0.12
LastIndexOfAny main 64 ABCDEF 52.32 ns 0.054 ns 1.00
LastIndexOfAny pr 64 ABCDEF 16.22 ns 0.015 ns 0.31
LastIndexOfAnyExcept main 64 ABCDEF 140.36 ns 0.092 ns 1.00
LastIndexOfAnyExcept pr 64 ABCDEF 18.84 ns 0.107 ns 0.13
IndexOfAny main 64 AlphaNumeric 104.80 ns 0.069 ns 1.00
IndexOfAny pr 64 AlphaNumeric 70.33 ns 0.062 ns 0.67
IndexOfAnyExcept main 64 AlphaNumeric 159.90 ns 0.214 ns 1.00
IndexOfAnyExcept pr 64 AlphaNumeric 100.41 ns 4.322 ns 0.63
LastIndexOfAny main 64 AlphaNumeric 115.49 ns 0.065 ns 1.00
LastIndexOfAny pr 64 AlphaNumeric 70.45 ns 0.026 ns 0.61
LastIndexOfAnyExcept main 64 AlphaNumeric 159.35 ns 0.152 ns 1.00
LastIndexOfAnyExcept pr 64 AlphaNumeric 74.42 ns 0.015 ns 0.47
IndexOfAny main 128 ABCDEF 90.87 ns 0.078 ns 1.00
IndexOfAny pr 128 ABCDEF 19.31 ns 0.019 ns 0.21
IndexOfAnyExcept main 128 ABCDEF 271.81 ns 0.272 ns 1.00
IndexOfAnyExcept pr 128 ABCDEF 20.93 ns 0.011 ns 0.08
LastIndexOfAny main 128 ABCDEF 101.48 ns 0.053 ns 1.00
LastIndexOfAny pr 128 ABCDEF 19.41 ns 0.012 ns 0.19
LastIndexOfAnyExcept main 128 ABCDEF 270.32 ns 0.180 ns 1.00
LastIndexOfAnyExcept pr 128 ABCDEF 21.64 ns 0.049 ns 0.08
IndexOfAny main 128 AlphaNumeric 146.21 ns 0.100 ns 1.00
IndexOfAny pr 128 AlphaNumeric 73.25 ns 0.049 ns 0.50
IndexOfAnyExcept main 128 AlphaNumeric 309.32 ns 0.400 ns 1.00
IndexOfAnyExcept pr 128 AlphaNumeric 72.91 ns 0.052 ns 0.24
LastIndexOfAny main 128 AlphaNumeric 170.34 ns 0.509 ns 1.00
LastIndexOfAny pr 128 AlphaNumeric 74.98 ns 0.194 ns 0.44
LastIndexOfAnyExcept main 128 AlphaNumeric 308.21 ns 0.256 ns 1.00
LastIndexOfAnyExcept pr 128 AlphaNumeric 78.56 ns 0.019 ns 0.25
IndexOfAny main 256 ABCDEF 181.28 ns 0.119 ns 1.00
IndexOfAny pr 256 ABCDEF 29.36 ns 0.030 ns 0.16
IndexOfAnyExcept main 256 ABCDEF 530.41 ns 0.329 ns 1.00
IndexOfAnyExcept pr 256 ABCDEF 32.47 ns 0.074 ns 0.06
LastIndexOfAny main 256 ABCDEF 198.31 ns 1.919 ns 1.00
LastIndexOfAny pr 256 ABCDEF 29.15 ns 0.023 ns 0.15
LastIndexOfAnyExcept main 256 ABCDEF 535.68 ns 1.678 ns 1.00
LastIndexOfAnyExcept pr 256 ABCDEF 32.78 ns 0.079 ns 0.06
IndexOfAny main 256 AlphaNumeric 236.67 ns 0.174 ns 1.00
IndexOfAny pr 256 AlphaNumeric 83.34 ns 0.048 ns 0.35
IndexOfAnyExcept main 256 AlphaNumeric 604.52 ns 0.440 ns 1.00
IndexOfAnyExcept pr 256 AlphaNumeric 84.98 ns 0.088 ns 0.14
LastIndexOfAny main 256 AlphaNumeric 248.39 ns 0.126 ns 1.00
LastIndexOfAny pr 256 AlphaNumeric 84.54 ns 0.055 ns 0.34
LastIndexOfAnyExcept main 256 AlphaNumeric 606.97 ns 0.625 ns 1.00
LastIndexOfAnyExcept pr 256 AlphaNumeric 90.93 ns 0.164 ns 0.15
IndexOfAny main 10000 ABCDEF 8,030.2 ns 70.01 ns 1.00
IndexOfAny pr 10000 ABCDEF 653.0 ns 0.38 ns 0.08
IndexOfAnyExcept main 10000 ABCDEF 20,356.5 ns 17.62 ns 1.00
IndexOfAnyExcept pr 10000 ABCDEF 726.7 ns 0.37 ns 0.04
LastIndexOfAny main 10000 ABCDEF 6,463.5 ns 14.40 ns 1.00
LastIndexOfAny pr 10000 ABCDEF 645.3 ns 0.33 ns 0.10
LastIndexOfAnyExcept main 10000 ABCDEF 20,271.5 ns 11.62 ns 1.00
LastIndexOfAnyExcept pr 10000 ABCDEF 730.8 ns 0.34 ns 0.04
IndexOfAny main 10000 AlphaNumeric 7,106.4 ns 23.74 ns 1.00
IndexOfAny pr 10000 AlphaNumeric 712.5 ns 1.73 ns 0.10
IndexOfAnyExcept main 10000 AlphaNumeric 23,274.9 ns 22.32 ns 1.00
IndexOfAnyExcept pr 10000 AlphaNumeric 776.4 ns 0.30 ns 0.03
LastIndexOfAny main 10000 AlphaNumeric 6,495.1 ns 5.23 ns 1.00
LastIndexOfAny pr 10000 AlphaNumeric 694.5 ns 0.03 ns 0.11
LastIndexOfAnyExcept main 10000 AlphaNumeric 23,190.5 ns 12.96 ns 1.00
LastIndexOfAnyExcept pr 10000 AlphaNumeric 783.1 ns 0.11 ns 0.03

On X86 we have to do a bit more work if the needle contains a zero:

Method Length Needle Mean Error StdDev
IndexOfAny 10000 ABCDEF 536.8 ns 0.45 ns 2.22 ns
IndexOfAny 10000 NeedleWithZero 820.7 ns 0.74 ns 3.69 ns

Approximate ARM64 numbers:

Method Length Needle Mean Error
IndexOfAny 10000 ABCDEF 1.783 µs 0.0051 µs
CurrentIndexOfAny 10000 ABCDEF 10.355 µs 0.0010 µs
Author: MihaZupan
Assignees: -
Labels:

area-System.Memory

Milestone: 8.0.0

@ghost ghost assigned MihaZupan Oct 7, 2022
@stephentoub
Copy link
Member

stephentoub commented Oct 7, 2022

Excellent, thanks for working on this.

The (Last)IndexOfAnyExcept methods that were previously using an O(n * m) loop regress with this change if the needle is long, while the haystack is small

Have we investigated doing something different for short haystacks? e.g. something super simple like:

if (haystack.Length < Vector128<short>.Count)
{
    for (int i = 0; i < haystack.Length; i++)
        if (needle.Contains(haystack[i]))
            return i;
    return -1;
}
...

?

@MihaZupan
Copy link
Member Author

That is what we do for the -Except overloads with this PR.

haystack.Length >= 8 ? Vectorized() : SimpleLoop()

whereas (Last)IndexOfAny does

haystack.Length >= 8 ? Vectorized() : ProbabilisticMap()

In places where we were already using the ProbabilisticMap, the init overhead seems to be very similar.

It would likely be beneficial to tweak the exact cutoff and add the SimpleLoop variant where we currently only use the ProbabilisticMap.
E.g. (made up constants)

int IndexOfAny()
{
    if (haystack.Length < Max(8, needle.Length / 2))
        return SimpleLoop();

    if (IsAscii(needle))
        return Vectorized();

    return ProbabilisticMap();
}

int IndexOfAnyExcept()
{
    if (haystack.Length >= Max(8, needle.Length / 2) && IsAscii(needle))
        return Vectorized();

    return SimpleLoop();
}

I can look into what sort of numbers we'd see with something like that, though I would hope that in general, a dedicated API to hide the init cost completely would be the preferred approach.

@stephentoub
Copy link
Member

It would likely be beneficial to tweak the exact cutoff and add the SimpleLoop variant where we currently only use the ProbabilisticMap.

Right, this is the main thing I was asking about. The probabilistic map path today needs to loop through each char individually, and do additional work for each. My gut would be that, other than for obscenely long needles, you could have a reasonably-sized haystack and still win with the simple loop doing a vectorized contains on the needles.

@stephentoub
Copy link
Member

though I would hope that in general, a dedicated API to hide the init cost completely would be the preferred approach

Certainly for cases where you're going to be invoking something repeatedly and have the foresight to create and cache the preprocessed vector information.

Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻
Left some notes.

I expect these APIs should work for byte-inputs too, so with another static abstract interface (probably my favorite C# 11 feature) this can be done without duplication. I'm re-writing the base64 code in my repo using Char- and Byte-Operations for this* -- i.e. reading two char/short-vectors and combining them into one byte vector for further processing. Something similar could be done here too.

* it's still WIP, so pack signed vs. unsigned isn't done there

@MihaZupan
Copy link
Member Author

MihaZupan commented Oct 9, 2022

I added a fast path for short haystacks to the probabilistic code path, leading to nice improvements there (I updated the numbers in the top post). E.g.

Method Toolchain Length Needle Mean Error Ratio
LastIndexOfAny main 1 AlphaNumeric 74.164 ns 0.0712 ns 1.00
LastIndexOfAny pr 1 AlphaNumeric 5.558 ns 0.0011 ns 0.07

My benchmark for -Except methods was also measuring the best-case performance for the previous simple loop. I updated it to measure the average case (match is found randomly in the needle instead of being the first element). The corresponding potential throughput change is therefore also much higher (145x instead of 30x).

The only real regression is the (Last)IndexOfAnyExcept for long needles and short haystacks as mentioned before. I'll keep playing around with cutoffs to try and improve that case too.

@MihaZupan
Copy link
Member Author

The cutoff for the simple loop is now:

searchSpaceLength < Vector128<short>.Count || (searchSpaceLength < 20 && searchSpaceLength < (valuesLength >> 1))

Updated the benchmarks above, this is now a win pretty much across the board.

@MihaZupan
Copy link
Member Author

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@danmoseley
Copy link
Member

@stephentoub
Copy link
Member

stephentoub commented Oct 10, 2022

Curious to see what this does to regex redux

It won't do anything. We don't use this API from RegexCompiler / source generator (or, rather, we rely on the API immediately delegating to the 4/5-char overloads), nor does regex redux have any sets that would trigger it. We will update RegexCompiler / source generator to use whatever new API we create that plugs into the same implementation this is adding but that let's us precompute the vector/bit set rather than having to do it on each call.

@MihaZupan
Copy link
Member Author

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


Span<char> needleSpace = stackalloc char[8];
Span<char> haystackSpace = stackalloc char[40];
var rng = new Random();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this reproducable, we should include a seed here.

Copy link
Member Author

@MihaZupan MihaZupan Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, I removed it because the RNG part represents a significant portion of the test execution time, with the non-explicit-seed ctor using the faster impl AFAIK.

For reproducibility, I made sure to emit the exact inputs that failed as part of the error message. Does that address your concerns here?

Copy link
Member

@stephentoub stephentoub Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that address your concerns here?

Only partially. There are two concerns:

  1. That a test could fail and you not know why or be able to trigger it again.
  2. That two runs of the same test suite might be non-deterministic.

Outputting the exact inputs addresses (1) but not (2). And (2) in general contributes to test flakiness.

If this test is about randomly stressing the implementation, then I don't believe it belongs in the functional test suite. If it's about using pseudo-randomness to try out thousands of inputs without having to manually code them all and be creative enough to ensure a reasonable spread of inputs, then it should be done in a way that runs the same tests every time, which means using a seed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to waste less time in Random and use a constant seed.

It was extremely useful when working on the initial implementation to get rid of all the edge-case bugs. Some were so odd that I don't believe we would have come up with test cases for otherwise.

I increased the number of iterations so it now takes about a second of CPU time to run through them, so I moved these to outerloop.
Outerloop already takes ~4 minutes on my machine so I'm gonna assume that's acceptable.

@@ -1791,8 +1779,9 @@ private static unsafe int IndexOfAnyProbabilistic(ref char searchSpace, int sear
/// </summary>
/// <param name="span">The span to search.</param>
/// <param name="values">The set of values to search for.</param>
public static int LastIndexOfAny<T>(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>?
=> LastIndexOfAny((ReadOnlySpan<T>)span, values);
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was necessary? Or just added for consistency with something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed by mistake in #75754 (comment)

I added it back to at least make it consistent with the rest of {Last}IndexOfAny overloads.

@stephentoub
Copy link
Member

Updated the benchmarks above, this is now a win pretty much across the board.

Meaning with this PR, for long or short inputs, vectorized or not, ASCII or not, everything you've tried is as good or better than it was before? If so, yay!

@MihaZupan
Copy link
Member Author

MihaZupan commented Nov 2, 2022

Meaning with this PR, for long or short inputs, vectorized or not, ASCII or not, everything you've tried is as good or better than it was before? If so, yay!

Yes*.

The main scenario I can think of that would regress is if the needle contains non-ASCII, but starts with ASCII (e.g. alphabet + 'ü') as you now have to pay for an extra scan to build the ASCII bitmap that you don't actually need. The throughput would be the same as before, but the init cost would be higher.


For short inputs that still fall back to a simple for loop (short input + large needle), there is a slightly higher per-call overhead to do the extra calls/checks. E.g.

Method Toolchain Length Needle Mean Error Ratio
IndexOfAnyExcept main 8 AlphaNumeric 27.99 ns 0.025 ns 1.00
IndexOfAnyExcept pr 8 AlphaNumeric 31.09 ns 0.036 ns 1.11

There can also be slight differences due to a change of "naive for loop" Contains vs span.Contains for the probabilistic path, where we may be slightly slower for short needles, but faster for longer ones.

@MihaZupan
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@MihaZupan
Copy link
Member Author

Test failures look like #76755, #77726

@MihaZupan MihaZupan changed the title Vectorize (Last)IndexOfAny(Except) for ASCII needles Vectorize {Last}IndexOfAny{Except} for ASCII needles Nov 4, 2022
@MihaZupan MihaZupan merged commit 6dfd63c into dotnet:main Nov 4, 2022
@MihaZupan MihaZupan mentioned this pull request Nov 20, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Dec 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants