Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Avx512 support to IndexOfAnyAsciiSearcher #103710

Merged
merged 1 commit into from
Jun 22, 2024

Conversation

MihaZupan
Copy link
Member

Closes #93222

Pretty much a copy-paste of the existing Vector128/Vector256 paths (#93222 (comment)).
I had to spam a couple more AggressiveInlinings to get all the small helpers to inline even in microbenchmarks where they're never called. This is the source of slight improvements for early matches in a couple benchmarks below.

Numbers-wise it's a ~0.5 - 1 ns regression for early matches, and a speedup to ~1.5x in throughput for longer inputs.

Early matches
Method Toolchain Length MatchAtStart Mean Error Ratio
IndexOfAny_Char main 32 True 4.885 ns 0.0129 ns 1.00
IndexOfAny_Char pr 32 True 3.343 ns 0.0007 ns 0.68
IndexOfAnyExcept_Char main 32 True 3.056 ns 0.0008 ns 1.00
IndexOfAnyExcept_Char pr 32 True 3.152 ns 0.0095 ns 1.03
LastIndexOfAny_Char main 32 True 4.166 ns 0.0522 ns 1.00
LastIndexOfAny_Char pr 32 True 3.564 ns 0.0748 ns 0.86
LastIndexOfAnyExcept_Char main 32 True 3.550 ns 0.0008 ns 1.00
LastIndexOfAnyExcept_Char pr 32 True 3.561 ns 0.0007 ns 1.00
IndexOfAny_Char main 33 True 3.463 ns 0.0023 ns 1.00
IndexOfAny_Char pr 33 True 3.598 ns 0.0037 ns 1.04
IndexOfAnyExcept_Char main 33 True 2.834 ns 0.0106 ns 1.00
IndexOfAnyExcept_Char pr 33 True 3.660 ns 0.0064 ns 1.29
LastIndexOfAny_Char main 33 True 4.158 ns 0.0082 ns 1.00
LastIndexOfAny_Char pr 33 True 3.951 ns 0.0004 ns 0.95
LastIndexOfAnyExcept_Char main 33 True 3.162 ns 0.0053 ns 1.00
LastIndexOfAnyExcept_Char pr 33 True 4.114 ns 0.0006 ns 1.30
IndexOfAny_Char main 65 True 3.464 ns 0.0028 ns 1.00
IndexOfAny_Char pr 65 True 3.504 ns 0.0768 ns 1.01
IndexOfAnyExcept_Char main 65 True 2.833 ns 0.0137 ns 1.00
IndexOfAnyExcept_Char pr 65 True 3.581 ns 0.1278 ns 1.26
LastIndexOfAny_Char main 65 True 4.128 ns 0.0257 ns 1.00
LastIndexOfAny_Char pr 65 True 3.402 ns 0.0005 ns 0.82
LastIndexOfAnyExcept_Char main 65 True 3.030 ns 0.0152 ns 1.00
LastIndexOfAnyExcept_Char pr 65 True 3.457 ns 0.0019 ns 1.14
IndexOfAny_Char main 10000 True 3.466 ns 0.0035 ns 1.00
IndexOfAny_Char pr 10000 True 3.397 ns 0.0004 ns 0.98
IndexOfAnyExcept_Char main 10000 True 2.822 ns 0.0111 ns 1.00
IndexOfAnyExcept_Char pr 10000 True 3.402 ns 0.0012 ns 1.20
LastIndexOfAny_Char main 10000 True 3.846 ns 0.0118 ns 1.00
LastIndexOfAny_Char pr 10000 True 3.806 ns 0.0153 ns 0.99
LastIndexOfAnyExcept_Char main 10000 True 3.030 ns 0.0044 ns 1.00
LastIndexOfAnyExcept_Char pr 10000 True 3.467 ns 0.0014 ns 1.14
Throughput
Method Toolchain Length MatchAtStart Mean Error Ratio
IndexOfAny_Char main 32 False 3.109 ns 0.0027 ns 1.00
IndexOfAny_Char pr 32 False 2.625 ns 0.0006 ns 0.84
IndexOfAnyExcept_Char main 32 False 2.460 ns 0.0025 ns 1.00
IndexOfAnyExcept_Char pr 32 False 3.084 ns 0.0783 ns 1.25
LastIndexOfAny_Char main 32 False 3.063 ns 0.0087 ns 1.00
LastIndexOfAny_Char pr 32 False 2.865 ns 0.1573 ns 0.94
LastIndexOfAnyExcept_Char main 32 False 3.646 ns 0.4223 ns 1.00
LastIndexOfAnyExcept_Char pr 32 False 2.426 ns 0.0036 ns 0.68
IndexOfAny_Char main 33 False 4.343 ns 0.0232 ns 1.00
IndexOfAny_Char pr 33 False 3.146 ns 0.0026 ns 0.72
IndexOfAnyExcept_Char main 33 False 4.320 ns 0.2929 ns 1.00
IndexOfAnyExcept_Char pr 33 False 3.038 ns 0.0004 ns 0.71
LastIndexOfAny_Char main 33 False 4.913 ns 0.0303 ns 1.00
LastIndexOfAny_Char pr 33 False 3.017 ns 0.0008 ns 0.61
LastIndexOfAnyExcept_Char main 33 False 3.896 ns 0.0007 ns 1.00
LastIndexOfAnyExcept_Char pr 33 False 3.440 ns 0.0003 ns 0.88
IndexOfAny_Char main 65 False 5.708 ns 0.0108 ns 1.00
IndexOfAny_Char pr 65 False 4.236 ns 0.0008 ns 0.74
IndexOfAnyExcept_Char main 65 False 5.004 ns 0.0600 ns 1.00
IndexOfAnyExcept_Char pr 65 False 4.259 ns 0.0068 ns 0.85
LastIndexOfAny_Char main 65 False 6.250 ns 0.1711 ns 1.00
LastIndexOfAny_Char pr 65 False 4.242 ns 0.0043 ns 0.68
LastIndexOfAnyExcept_Char main 65 False 6.590 ns 0.7326 ns 1.00
LastIndexOfAnyExcept_Char pr 65 False 4.174 ns 0.0598 ns 0.65
IndexOfAny_Char main 10000 False 331.894 ns 0.2110 ns 1.00
IndexOfAny_Char pr 10000 False 208.743 ns 0.0179 ns 0.63
IndexOfAnyExcept_Char main 10000 False 375.454 ns 0.0229 ns 1.00
IndexOfAnyExcept_Char pr 10000 False 260.705 ns 0.0650 ns 0.69
LastIndexOfAny_Char main 10000 False 333.720 ns 0.0841 ns 1.00
LastIndexOfAny_Char pr 10000 False 209.806 ns 0.0891 ns 0.63
LastIndexOfAnyExcept_Char main 10000 False 378.696 ns 0.3796 ns 1.00
LastIndexOfAnyExcept_Char pr 10000 False 262.419 ns 0.4015 ns 0.69

I'll rerun the Regex benchmarks.

@MihaZupan
Copy link
Member Author

@MihuBot benchmark Regex

@MihuBot
Copy link

MihuBot commented Jun 20, 2024

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock
BenchmarkDotNet v0.13.13-nightly.20240311.145, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10
Method Toolchain Pattern Mean Error Ratio Allocated Alloc Ratio
Count Main .* 691,131.27 ns 842.077 ns 1.00 2 B 1.00
Count PR .* 701,291.96 ns 4,208.201 ns 1.01 2 B 1.00
Count Main (?i)Holmes 71,847.44 ns 471.748 ns 1.00 - NA
Count PR (?i)Holmes 72,181.78 ns 346.591 ns 1.00 - NA
Count Main (?i)Sher[a-z]+|Hol[a-z]+ 102,412.23 ns 2,482.554 ns 1.00 - NA
Count PR (?i)Sher[a-z]+|Hol[a-z]+ 100,226.97 ns 4,693.052 ns 0.98 - NA
Count Main (?i)Sherlock 63,157.23 ns 286.797 ns 1.00 - NA
Count PR (?i)Sherlock 63,592.22 ns 294.601 ns 1.01 3 B NA
Count Main (?i)Sherlock Holmes 63,163.87 ns 353.487 ns 1.00 - NA
Count PR (?i)Sherlock Holmes 63,130.32 ns 312.279 ns 1.00 - NA
Count Main (?i)Sherlock|Holmes|Watson 96,989.85 ns 3,823.367 ns 1.00 - NA
Count PR (?i)Sherlock|Holmes|Watson 96,919.46 ns 4,402.650 ns 1.00 - NA
Count Main (?i)Sherlock|(...)er|John|Baker [49] 292,458.32 ns 4,255.586 ns 1.00 2 B 1.00
Count PR (?i)Sherlock|(...)er|John|Baker [49] 291,038.70 ns 5,906.277 ns 1.00 2 B 1.00
Count Main (?i)the 380,816.71 ns 448.532 ns 1.00 1 B 1.00
Count PR (?i)the 378,387.61 ns 597.080 ns 0.99 1 B 1.00
Count Main (?m)^Sherlock(...)rlock Holmes$ [37] 80,760.72 ns 4,306.860 ns 1.01 - NA
Count PR (?m)^Sherlock(...)rlock Holmes$ [37] 82,582.68 ns 3,929.177 ns 1.03 - NA
Count Main (?s).* 51.49 ns 0.230 ns 1.00 - NA
Count PR (?s).* 51.44 ns 0.201 ns 1.00 - NA
Count Main [^\\n]* 696,503.75 ns 1,189.220 ns 1.00 2 B 1.00
Count PR [^\\n]* 701,777.91 ns 4,947.894 ns 1.01 2 B 1.00
Count Main [a-q][^u-z]{13}x 34,252.94 ns 271.978 ns 1.00 - NA
Count PR [a-q][^u-z]{13}x 32,601.06 ns 1,062.765 ns 0.95 - NA
Count Main [a-zA-Z]+ing 4,743,957.65 ns 35,002.412 ns 1.00 21 B 1.00
Count PR [a-zA-Z]+ing 4,892,478.29 ns 15,606.673 ns 1.03 21 B 1.00
Count Main \b\w+n\b 9,984,392.53 ns 16,833.207 ns 1.00 44 B 1.00
Count PR \b\w+n\b 10,828,533.01 ns 36,736.622 ns 1.08 39 B 0.89
Count Main \p{L} 10,603,931.90 ns 48,819.499 ns 1.00 35 B 1.00
Count PR \p{L} 11,179,680.36 ns 9,486.172 ns 1.05 31 B 0.89
Count Main \p{Ll} 10,195,145.76 ns 27,272.582 ns 1.00 35 B 1.00
Count PR \p{Ll} 11,436,304.68 ns 77,659.485 ns 1.12 35 B 1.00
Count Main \p{Lu} 512,487.85 ns 3,951.166 ns 1.00 2 B 1.00
Count PR \p{Lu} 465,932.29 ns 13,751.745 ns 0.91 1 B 0.50
Count Main \s[a-zA-Z]{0,12}ing\s 5,401,364.31 ns 12,575.532 ns 1.00 24 B 1.00
Count PR \s[a-zA-Z]{0,12}ing\s 5,785,507.55 ns 101,020.912 ns 1.07 24 B 1.00
Count Main \w+ 5,177,572.11 ns 28,630.119 ns 1.00 18 B 1.00
Count PR \w+ 5,505,569.07 ns 66,941.247 ns 1.06 18 B 1.00
Count Main \w+\s+Holmes 4,021,934.73 ns 23,376.720 ns 1.00 22 B 1.00
Count PR \w+\s+Holmes 4,287,127.11 ns 34,628.196 ns 1.07 22 B 1.00
Count Main \w+\s+Holmes\s+\w+ 4,157,736.06 ns 3,369.514 ns 1.00 23 B 1.00
Count PR \w+\s+Holmes\s+\w+ 4,448,197.47 ns 8,704.825 ns 1.07 23 B 1.00
Count Main aei 60,226.44 ns 177.769 ns 1.00 - NA
Count PR aei 59,984.97 ns 231.105 ns 1.00 - NA
Count Main aqj 60,359.02 ns 219.068 ns 1.00 - NA
Count PR aqj 60,216.79 ns 250.061 ns 1.00 - NA
Count Main Holmes 70,247.90 ns 381.235 ns 1.00 - NA
Count PR Holmes 70,191.66 ns 312.375 ns 1.00 - NA
Count Main Holmes.{0,25}(...).{0,25}Holmes [39] 59,812.45 ns 223.860 ns 1.00 - NA
Count PR Holmes.{0,25}(...).{0,25}Holmes [39] 61,210.71 ns 1,547.310 ns 1.02 - NA
Count Main Sher[a-z]+|Hol[a-z]+ 66,397.69 ns 761.073 ns 1.00 - NA
Count PR Sher[a-z]+|Hol[a-z]+ 64,849.71 ns 1,154.795 ns 0.98 3 B NA
Count Main Sherlock 85,985.04 ns 1,966.243 ns 1.00 - NA
Count PR Sherlock 83,371.97 ns 3,793.039 ns 0.97 - NA
Count Main Sherlock Holmes 82,774.02 ns 3,862.238 ns 1.01 - NA
Count PR Sherlock Holmes 85,172.83 ns 2,128.470 ns 1.03 - NA
Count Main Sherlock\s+Holmes 83,442.70 ns 3,644.738 ns 1.00 - NA
Count PR Sherlock\s+Holmes 81,539.97 ns 4,064.719 ns 0.98 - NA
Count Main Sherlock|Holmes 57,513.25 ns 1,431.014 ns 1.00 - NA
Count PR Sherlock|Holmes 58,628.39 ns 909.981 ns 1.02 - NA
Count Main Sherlock|Holmes|Watson 81,797.43 ns 3,871.627 ns 1.00 - NA
Count PR Sherlock|Holmes|Watson 77,711.13 ns 263.733 ns 0.95 - NA
Count Main Sherlock|Holm(...)er|John|Baker [45] 204,100.20 ns 4,644.087 ns 1.00 1 B 1.00
Count PR Sherlock|Holm(...)er|John|Baker [45] 209,457.96 ns 1,109.672 ns 1.03 1 B 1.00
Count Main Sherlock|Street 34,583.11 ns 1,498.786 ns 1.00 - NA
Count PR Sherlock|Street 37,804.48 ns 905.387 ns 1.10 - NA
Count Main the 308,111.60 ns 2,660.420 ns 1.00 1 B 1.00
Count PR the 304,342.71 ns 867.519 ns 0.99 1 B 1.00
Count Main The 74,898.95 ns 553.273 ns 1.00 - NA
Count PR The 74,592.77 ns 544.687 ns 1.00 - NA
Count Main the\s+\w+ 448,325.25 ns 1,314.530 ns 1.00 1 B 1.00
Count PR the\s+\w+ 457,653.52 ns 8,459.813 ns 1.02 1 B 1.00
Count Main zqj 59,408.54 ns 583.905 ns 1.00 - NA
Count PR zqj 59,892.80 ns 549.900 ns 1.01 - NA

A few other runs

@MihaZupan MihaZupan requested a review from stephentoub June 20, 2024 15:13
public BitVector256 Lookup = lookup;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public readonly Vector128<byte> Bitmap128() => Bitmap512._lower._lower;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why a method rather than a property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit more code to define the get since we need the MethodImpl on it. No other reason.

@@ -17,18 +17,36 @@ internal static class IndexOfAnyAsciiSearcher
{
public struct AsciiState(Vector128<byte> bitmap, BitVector256 lookup)
{
public Vector256<byte> Bitmap = Vector256.Create(bitmap);
public Vector512<byte> Bitmap512 = Vector512.Create(bitmap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can't / shouldn't be readonly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TryIndexOfAny helpers (called by non-searchvalues IndexOfAny) are writing into this field directly when computing the bitmap to avoid needing an extra local.
As far as the vectorized code is concerned, these could be readonly.

@MihaZupan MihaZupan merged commit ce1ae77 into dotnet:main Jun 22, 2024
143 of 146 checks passed
rzikm pushed a commit to rzikm/dotnet-runtime that referenced this pull request Jun 24, 2024
MihaZupan added a commit to MihaZupan/runtime that referenced this pull request Jul 10, 2024
MihaZupan added a commit that referenced this pull request Jul 10, 2024
…4688)

* Revert "Add Avx512 support to IndexOfAnyAsciiSearcher (#103710)"

This reverts commit ce1ae77.

* Keep the test improvements

* Keep the inlining changes

* Keep the shift improvements
matouskozak added a commit to matouskozak/runtime that referenced this pull request Jul 11, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Light up IndexOfAnyAsciiSearcher for AVX512
3 participants