Port IndexOfAny(ref char, char[1-5], int) to Vector128/256 #73469

adamsitnik · 2022-08-05T15:47:37Z

x64

AVX2

The improvements for IndexOfAnyFourValues and IndexOfAnyFiveValues comes from performing just one ExtractMostSignificantBits operation instead n MoveMask

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-VHNRBA : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-DJIUDN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method	Toolchain	Size	Mean	Ratio
IndexOfAnyTwoValues	\PR\corerun.exe	512	11.49 ns	1.00
IndexOfAnyTwoValues	\baseline\corerun.exe	512	11.48 ns	1.00

IndexOfAnyThreeValues	\PR\corerun.exe	512	11.70 ns	0.92
IndexOfAnyThreeValues	\baseline\corerun.exe	512	12.75 ns	1.00

IndexOfAnyFourValues	\PR\corerun.exe	512	16.42 ns	0.70
IndexOfAnyFourValues	\baseline\corerun.exe	512	23.39 ns	1.00

IndexOfAnyFiveValues	\PR\corerun.exe	512	18.89 ns	0.63
IndexOfAnyFiveValues	\baseline\corerun.exe	512	29.86 ns	1.00

AVX

The improvements for IndexOfAnyFourValues and IndexOfAnyFiveValues comes from performing just one ExtractMostSignificantBits operation instead n MoveMask. The 5% regression for IndexOfAnyTwoValues is just alignment (I can't repro it on other PCs).

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT AVX2
  Job-KFFJJI : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-AQUTGK : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  
EnvironmentVariables=COMPlus_EnableAVX2=0

Method	Toolchain	Size	Mean	Ratio
IndexOfAnyTwoValues	\PR\corerun.exe	512	26.03 ns	1.05
IndexOfAnyTwoValues	\baseline\corerun.exe	512	24.92 ns	1.00

IndexOfAnyThreeValues	\PR\corerun.exe	512	31.29 ns	0.99
IndexOfAnyThreeValues	\baseline\corerun.exe	512	31.70 ns	1.00

IndexOfAnyFourValues	\PR\corerun.exe	512	39.18 ns	0.76
IndexOfAnyFourValues	\baseline\corerun.exe	512	51.67 ns	1.00

IndexOfAnyFiveValues	\PR\corerun.exe	512	47.45 ns	0.76
IndexOfAnyFiveValues	\baseline\corerun.exe	512	62.57 ns	1.00

arm64

We observe very nice gains (25-33%) as these particular methods were previously vectorized using Vector<T> APIs.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
  Job-OMLHKF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-ZTKIBV : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Toolchain	Size	Ratio
IndexOfAnyTwoValues	/PR/corerun	512	0.66
IndexOfAnyTwoValues	/main/corerun	512	1.00

IndexOfAnyThreeValues	/PR/corerun	512	0.69
IndexOfAnyThreeValues	/main/corerun	512	1.00

IndexOfAnyFourValues	/PR/corerun	512	0.75
IndexOfAnyFourValues	/main/corerun	512	1.00

IndexOfAnyFiveValues	/PR/corerun	512	0.75
IndexOfAnyFiveValues	/main/corerun	512	1.00

Contributes to #64451

ghost · 2022-08-05T15:47:43Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

x64

AVX2

The improvements for IndexOfAnyFourValues and IndexOfAnyFiveValues comes from performing just one ExtractMostSignificantBits operation instead n MoveMask

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-VHNRBA : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-DJIUDN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method	Toolchain	Size	Mean	Ratio
IndexOfAnyTwoValues	\PR\corerun.exe	512	11.49 ns	1.00
IndexOfAnyTwoValues	\baseline\corerun.exe	512	11.48 ns	1.00

IndexOfAnyThreeValues	\PR\corerun.exe	512	11.70 ns	0.92
IndexOfAnyThreeValues	\baseline\corerun.exe	512	12.75 ns	1.00

IndexOfAnyFourValues	\PR\corerun.exe	512	16.42 ns	0.70
IndexOfAnyFourValues	\baseline\corerun.exe	512	23.39 ns	1.00

IndexOfAnyFiveValues	\PR\corerun.exe	512	18.89 ns	0.63
IndexOfAnyFiveValues	\baseline\corerun.exe	512	29.86 ns	1.00

AVX

The improvements for IndexOfAnyFourValues and IndexOfAnyFiveValues comes from performing just one ExtractMostSignificantBits operation instead n MoveMask. The 5% regression for IndexOfAnyTwoValues is just alignment (I can't repro it on other PCs).

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT AVX2
  Job-KFFJJI : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-AQUTGK : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  
EnvironmentVariables=COMPlus_EnableAVX2=0

Method	Toolchain	Size	Mean	Ratio
IndexOfAnyTwoValues	\PR\corerun.exe	512	26.03 ns	1.05
IndexOfAnyTwoValues	\baseline\corerun.exe	512	24.92 ns	1.00

IndexOfAnyThreeValues	\PR\corerun.exe	512	31.29 ns	0.99
IndexOfAnyThreeValues	\baseline\corerun.exe	512	31.70 ns	1.00

IndexOfAnyFourValues	\PR\corerun.exe	512	39.18 ns	0.76
IndexOfAnyFourValues	\baseline\corerun.exe	512	51.67 ns	1.00

IndexOfAnyFiveValues	\PR\corerun.exe	512	47.45 ns	0.76
IndexOfAnyFiveValues	\baseline\corerun.exe	512	62.57 ns	1.00

arm64

We observe very nice gains (25-33%) as these particular methods were previously vectorized using Vector<T> APIs.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
  Job-OMLHKF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-ZTKIBV : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Toolchain	Size	Ratio
IndexOfAnyTwoValues	/PR/corerun	512	0.66
IndexOfAnyTwoValues	/main/corerun	512	1.00

IndexOfAnyThreeValues	/PR/corerun	512	0.69
IndexOfAnyThreeValues	/main/corerun	512	1.00

IndexOfAnyFourValues	/PR/corerun	512	0.75
IndexOfAnyFourValues	/main/corerun	512	1.00

IndexOfAnyFiveValues	/PR/corerun	512	0.75
IndexOfAnyFiveValues	/main/corerun	512	1.00

Contributes to #64451

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Memory`, `tenet-performance`
Milestone:	7.0.0

stephentoub · 2022-08-05T15:49:30Z

I expect these improvements will measurably move the needle for several regex workloads. Very nice.

adamsitnik · 2022-08-10T13:11:47Z

@lambdageek The WASM AOT CI leg has failed with a new error:

[12:29:19] info: RuntimeError: function signature mismatch
[12:29:19] info:     at corlib_string_IndexOfAny_char__ (<anonymous>:wasm-function[24898]:0x5a015a

Is it a different symptom of the linker bug(#73474) or a new issue?

tannergooding · 2022-08-10T16:01:05Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

@@ -954,7 +944,7 @@ public static unsafe int IndexOfAny(ref char searchStart, char value0, char valu

 VectorCompare:
 // We include the Supported check again here even though path will not be taken, so the asm isn't generated if not supported.
- if (!Sse2.IsSupported && Vector.IsHardwareAccelerated)
+ if (!Vector128.IsHardwareAccelerated && Vector.IsHardwareAccelerated)


Do we have any platforms which accelerate Vector<T> but not Vector128<T>?

This certainly doesn't exist for RyuJIT and I don't think that's true for Mono anymore either (CC. @fanyang-mono?)

If that's the case, we can probably just remove this path entirely.

If that's the case, we can probably just remove this path entirely.

Awesome, then once all my SpanHelpers PRs get merged I am going to send a new one that removes all Vector<T> code paths for the methods that are supporting Vector128 now.

We do accelerate both Vector<T> and Vector128<T> on both Arm64 and Amd64. On Arm64, the intrinsics support for their methods is complete now. However, it is not on Amd64. This issue tracks the progress (#66392).

stephentoub · 2022-08-11T18:13:37Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

+ matches = (Vector256.Equals(values0, search) | Vector256.Equals(values1, search))
+ .AsByte().ExtractMostSignificantBits();
+
+ // Note that ExtractMostSignificantBits has converted the equal vector elements into a set of bit flags,
 // So the bit position in 'matches' corresponds to the element offset.
 if (matches == 0)


For my edification, does using .AsByte().ExtractMostSignificantBits() and the doing == 0 produce more efficient code than using == Vector256<ushort>.Zero? We do the latter on the Vector128 code path, and I'm wondering why this one differs.

We do the latter on the Vector128 code path, and I'm wondering why this one differs.

ExtractMostSignificantBits() is cheap on x64 but expensive for arm64. Vector128 includes arm64, so we delay this operation until it's needed for Vector128. We don't need to do this for Vector256, which currently is supported only by x64.

I've verified that with some experiment in other PR (Tanner asked for that as well)

Thanks. And for the first part of the question:

does using .AsByte().ExtractMostSignificantBits() and then doing == 0 produce more efficient code than using == Vector256.Zero?

Should I take it from your answer that the answer is "yes"?

lambdageek · 2022-08-11T23:01:22Z

@lambdageek The WASM AOT CI leg has failed with a new error:
[12:29:19] info: RuntimeError: function signature mismatch
[12:29:19] info:     at corlib_string_IndexOfAny_char__ (<anonymous>:wasm-function[24898]:0x5a015a
Is it a different symptom of the linker bug(#73474) or a new issue?

It looks like a different issue - it's this call to String.IndexOfAny(char[]):

runtime/src/mono/wasm/host/Options.cs

Line 573 in 37235c4

int end = name.IndexOfAny(NameTerminator);

and it looks like it ends up calling something with an unexpected native signature (perhaps the name mangling in the AOT compiler collided?) (for wasm this would most likely mean wrong number of arguments)

I'll try to repro locally

/cc @vargaz

adamsitnik · 2022-08-12T07:26:10Z

@lambdageek @vargaz The WASM AOT CI leg has failed with a new error:

[12:29:19] info: RuntimeError: function signature mismatch
[12:29:19] info:     at corlib_string_IndexOfAny_char__ (<anonymous>:wasm-function[24898]:0x5a015a

Is it a different symptom of the linker bug (#73474) or a new issue?

lambdageek · 2022-08-12T15:41:19Z

@lambdageek @vargaz The WASM AOT CI leg has failed with a new error:
[12:29:19] info: RuntimeError: function signature mismatch
[12:29:19] info:     at corlib_string_IndexOfAny_char__ (<anonymous>:wasm-function[24898]:0x5a015a
Is it a different symptom of the linker bug (#73474) or a new issue?

@adamsitnik I don't think it's the linker bug. It's something else.

Doesn't reproduce locally for me yet - the System.Runtime testsuite passes on wasm+AOT. I'm going to keep trying - maybe i'm merging with an older main.

lambdageek · 2022-08-12T19:23:25Z

Is it a different symptom of the linker bug (#73474) or a new issue?

@adamsitnik I don't think it's the linker bug. It's something else.

Doesn't reproduce locally for me yet - the System.Runtime testsuite passes on wasm+AOT. I'm going to keep trying - maybe i'm merging with an older main.

Ok, it repros locally. Interestingly the version of v8 that I have on my machine gives a slightly better error message:

RuntimeError: null function or function signature mismatch
  info:     at corlib_string_IndexOfAny_char__ (wasm://wasm/1e51fc5a:wasm-function[102581]:0x169999c)

which makes me suspect that it's the same linker issue and the AOT compiler is skipping the method. investigating...

lambdageek · 2022-08-12T19:35:56Z

Is it a different symptom of the linker bug (#73474) or a new issue?

@adamsitnik I don't think it's the linker bug. It's something else.
Doesn't reproduce locally for me yet - the System.Runtime testsuite passes on wasm+AOT. I'm going to keep trying - maybe i'm merging with an older main.

Ok, it repros locally. Interestingly the version of v8 that I have on my machine gives a slightly better error message:
RuntimeError: null function or function signature mismatch
  info:     at corlib_string_IndexOfAny_char__ (wasm://wasm/1e51fc5a:wasm-function[102581]:0x169999c)
which makes me suspect that it's the same linker issue and the AOT compiler is skipping the method. investigating...

Yea. it's the same IL pattern @adamsitnik. if I change the last

        VectorCompare:
            // We include the Supported check again here even though path will not be taken, so the asm isn't generated if not supported.
            if (!Vector128.IsHardwareAccelerated && Vector.IsHardwareAccelerated)
            {

to

        VectorCompare:
            // We include the Supported check again here even though path will not be taken, so the asm isn't generated if not supported.
            Debug.Assert (!Vector128.IsHardwareAccelerated && Vector.IsHardwareAccelerated);
            {

and remove the unreachable code at the end, the tests pass.

So we're waiting on dotnet/linker#2966 to flow to the runtime

stephentoub · 2022-08-12T19:38:59Z

So we're waiting on dotnet/linker#2966 to flow to the runtime

When do we expect that to happen automatically? If not today, can we manually update to a newer version?

lambdageek · 2022-08-12T20:22:28Z

Update #73865

So we're waiting on dotnet/linker#2966 to flow to the runtime

When do we expect that to happen automatically? If not today, can we manually update to a newer version?

I just manually triggered it. (assuming my darc-fu is sound).

lambdageek · 2022-08-13T01:12:21Z

The System.Runtime tests with wasm AOT are passing for me locally after merging with e94f7ce

adamsitnik · 2022-08-13T11:50:52Z

@lambdageek big thanks for all your help!

I am going to merge this PR as soon as the CI gets green. I'll then try to get #73876 merged and hopefully solve all the merge conflicts for #73768 and maybe get it merged for 7 before we snap.

kunalspathak · 2022-08-16T16:44:04Z

Improvements

[Perf] Windows 10.0.18362/x64 : Improvement on 8/13/2022 7:54:18 PM perf-autofiling-issues#7344
[Perf] Windows 10.0.18362/x86 : Improvement on 8/13/2022 7:54:18 PM perf-autofiling-issues#7330
[Perf] Windows 10.0.18362/x86 : Improvement on 8/11/2022 5:43:50 AM perf-autofiling-issues#7324

adamsitnik added 7 commits August 5, 2022 15:48

port SpanHelpers.IndexOfAny(ref char, char, char, int) to Vector128/256

697e3b6

port SpanHelpers.IndexOfAny(ref char, char, char, char, int)

33e9022

SpanHelpers.IndexOfAny(ref char, char, char, char, char, int)

3191968

SpanHelpers.IndexOfAny(ref char, char, char, char, char, char, int)

43e6a5e

polishing

91581df

perform more than 3 Or at once, observe a nice perf boost

5c67de1

perform more than 3 Or at once, observe perf boost on ARM64

fba7d16

adamsitnik added area-System.Memory tenet-performance Performance related issue labels Aug 5, 2022

adamsitnik added this to the 7.0.0 milestone Aug 5, 2022

adamsitnik requested review from EgorBo, stephentoub and tannergooding August 5, 2022 15:47

ghost assigned adamsitnik Aug 5, 2022

adamsitnik mentioned this pull request Aug 5, 2022

Switch from direct intrinsics usage to Vector/Vector64/Vector128/Vector256 #64451

Open

75 tasks

This was referenced Aug 5, 2022

Infra improvements for Helix #68176

Closed

GC/API/GC/GetGCMemoryInfo/GetGCMemoryInfo.sh test failing intermittently on CoreCLR Linux ARM32 #73247

Closed

Merge branch 'dotnet:main' into indexOfAnyChar

7bd44b5

tannergooding reviewed Aug 10, 2022

View reviewed changes

tannergooding approved these changes Aug 10, 2022

View reviewed changes

stephentoub reviewed Aug 11, 2022

View reviewed changes

stephentoub approved these changes Aug 11, 2022

View reviewed changes

stephentoub mentioned this pull request Aug 11, 2022

Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64 #73788

Closed

Merge remote-tracking branch 'upstream/main' into indexOfAnyChar

9cf1956

adamsitnik merged commit 27b18b1 into dotnet:main Aug 13, 2022

ghost locked as resolved and limited conversation to collaborators Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port IndexOfAny(ref char, char[1-5], int) to Vector128/256 #73469

Port IndexOfAny(ref char, char[1-5], int) to Vector128/256 #73469

adamsitnik commented Aug 5, 2022

ghost commented Aug 5, 2022

x64

AVX2

AVX

arm64

stephentoub commented Aug 5, 2022

adamsitnik commented Aug 10, 2022

tannergooding Aug 10, 2022

adamsitnik Aug 11, 2022

fanyang-mono Aug 11, 2022

stephentoub Aug 11, 2022 •

edited

Loading

adamsitnik Aug 13, 2022

stephentoub Aug 13, 2022 •

edited

Loading

lambdageek commented Aug 11, 2022

adamsitnik commented Aug 12, 2022

lambdageek commented Aug 12, 2022

lambdageek commented Aug 12, 2022

lambdageek commented Aug 12, 2022

stephentoub commented Aug 12, 2022

lambdageek commented Aug 12, 2022 •

edited

Loading

lambdageek commented Aug 13, 2022

adamsitnik commented Aug 13, 2022

kunalspathak commented Aug 16, 2022 •

edited

Loading

Port IndexOfAny(ref char, char[1-5], int) to Vector128/256 #73469

Port IndexOfAny(ref char, char[1-5], int) to Vector128/256 #73469

Conversation

adamsitnik commented Aug 5, 2022

x64

AVX2

AVX

arm64

ghost commented Aug 5, 2022

x64

AVX2

AVX

arm64

stephentoub commented Aug 5, 2022

adamsitnik commented Aug 10, 2022

tannergooding Aug 10, 2022

Choose a reason for hiding this comment

adamsitnik Aug 11, 2022

Choose a reason for hiding this comment

fanyang-mono Aug 11, 2022

Choose a reason for hiding this comment

stephentoub Aug 11, 2022 • edited Loading

Choose a reason for hiding this comment

adamsitnik Aug 13, 2022

Choose a reason for hiding this comment

stephentoub Aug 13, 2022 • edited Loading

Choose a reason for hiding this comment

lambdageek commented Aug 11, 2022

adamsitnik commented Aug 12, 2022

lambdageek commented Aug 12, 2022

lambdageek commented Aug 12, 2022

lambdageek commented Aug 12, 2022

stephentoub commented Aug 12, 2022

lambdageek commented Aug 12, 2022 • edited Loading

lambdageek commented Aug 13, 2022

adamsitnik commented Aug 13, 2022

kunalspathak commented Aug 16, 2022 • edited Loading

stephentoub Aug 11, 2022 •

edited

Loading

stephentoub Aug 13, 2022 •

edited

Loading

lambdageek commented Aug 12, 2022 •

edited

Loading

kunalspathak commented Aug 16, 2022 •

edited

Loading