Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port IndexOfAny(ref char, char[1-5], int) to Vector128/256 #73469

Merged
merged 9 commits into from
Aug 13, 2022

Conversation

adamsitnik
Copy link
Member

x64

AVX2

The improvements for IndexOfAnyFourValues and IndexOfAnyFiveValues comes from performing just one ExtractMostSignificantBits operation instead n MoveMask

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-VHNRBA : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-DJIUDN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Toolchain Size Mean Ratio
IndexOfAnyTwoValues \PR\corerun.exe 512 11.49 ns 1.00
IndexOfAnyTwoValues \baseline\corerun.exe 512 11.48 ns 1.00
IndexOfAnyThreeValues \PR\corerun.exe 512 11.70 ns 0.92
IndexOfAnyThreeValues \baseline\corerun.exe 512 12.75 ns 1.00
IndexOfAnyFourValues \PR\corerun.exe 512 16.42 ns 0.70
IndexOfAnyFourValues \baseline\corerun.exe 512 23.39 ns 1.00
IndexOfAnyFiveValues \PR\corerun.exe 512 18.89 ns 0.63
IndexOfAnyFiveValues \baseline\corerun.exe 512 29.86 ns 1.00

AVX

The improvements for IndexOfAnyFourValues and IndexOfAnyFiveValues comes from performing just one ExtractMostSignificantBits operation instead n MoveMask. The 5% regression for IndexOfAnyTwoValues is just alignment (I can't repro it on other PCs).

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT AVX2
  Job-KFFJJI : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-AQUTGK : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  
EnvironmentVariables=COMPlus_EnableAVX2=0
Method Toolchain Size Mean Ratio
IndexOfAnyTwoValues \PR\corerun.exe 512 26.03 ns 1.05
IndexOfAnyTwoValues \baseline\corerun.exe 512 24.92 ns 1.00
IndexOfAnyThreeValues \PR\corerun.exe 512 31.29 ns 0.99
IndexOfAnyThreeValues \baseline\corerun.exe 512 31.70 ns 1.00
IndexOfAnyFourValues \PR\corerun.exe 512 39.18 ns 0.76
IndexOfAnyFourValues \baseline\corerun.exe 512 51.67 ns 1.00
IndexOfAnyFiveValues \PR\corerun.exe 512 47.45 ns 0.76
IndexOfAnyFiveValues \baseline\corerun.exe 512 62.57 ns 1.00

arm64

We observe very nice gains (25-33%) as these particular methods were previously vectorized using Vector<T> APIs.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
  Job-OMLHKF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-ZTKIBV : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Size Ratio
IndexOfAnyTwoValues /PR/corerun 512 0.66
IndexOfAnyTwoValues /main/corerun 512 1.00
IndexOfAnyThreeValues /PR/corerun 512 0.69
IndexOfAnyThreeValues /main/corerun 512 1.00
IndexOfAnyFourValues /PR/corerun 512 0.75
IndexOfAnyFourValues /main/corerun 512 1.00
IndexOfAnyFiveValues /PR/corerun 512 0.75
IndexOfAnyFiveValues /main/corerun 512 1.00

Contributes to #64451

@adamsitnik adamsitnik added area-System.Memory tenet-performance Performance related issue labels Aug 5, 2022
@adamsitnik adamsitnik added this to the 7.0.0 milestone Aug 5, 2022
@ghost
Copy link

ghost commented Aug 5, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

x64

AVX2

The improvements for IndexOfAnyFourValues and IndexOfAnyFiveValues comes from performing just one ExtractMostSignificantBits operation instead n MoveMask

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-VHNRBA : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-DJIUDN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Toolchain Size Mean Ratio
IndexOfAnyTwoValues \PR\corerun.exe 512 11.49 ns 1.00
IndexOfAnyTwoValues \baseline\corerun.exe 512 11.48 ns 1.00
IndexOfAnyThreeValues \PR\corerun.exe 512 11.70 ns 0.92
IndexOfAnyThreeValues \baseline\corerun.exe 512 12.75 ns 1.00
IndexOfAnyFourValues \PR\corerun.exe 512 16.42 ns 0.70
IndexOfAnyFourValues \baseline\corerun.exe 512 23.39 ns 1.00
IndexOfAnyFiveValues \PR\corerun.exe 512 18.89 ns 0.63
IndexOfAnyFiveValues \baseline\corerun.exe 512 29.86 ns 1.00

AVX

The improvements for IndexOfAnyFourValues and IndexOfAnyFiveValues comes from performing just one ExtractMostSignificantBits operation instead n MoveMask. The 5% regression for IndexOfAnyTwoValues is just alignment (I can't repro it on other PCs).

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT AVX2
  Job-KFFJJI : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-AQUTGK : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  
EnvironmentVariables=COMPlus_EnableAVX2=0
Method Toolchain Size Mean Ratio
IndexOfAnyTwoValues \PR\corerun.exe 512 26.03 ns 1.05
IndexOfAnyTwoValues \baseline\corerun.exe 512 24.92 ns 1.00
IndexOfAnyThreeValues \PR\corerun.exe 512 31.29 ns 0.99
IndexOfAnyThreeValues \baseline\corerun.exe 512 31.70 ns 1.00
IndexOfAnyFourValues \PR\corerun.exe 512 39.18 ns 0.76
IndexOfAnyFourValues \baseline\corerun.exe 512 51.67 ns 1.00
IndexOfAnyFiveValues \PR\corerun.exe 512 47.45 ns 0.76
IndexOfAnyFiveValues \baseline\corerun.exe 512 62.57 ns 1.00

arm64

We observe very nice gains (25-33%) as these particular methods were previously vectorized using Vector<T> APIs.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
  Job-OMLHKF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-ZTKIBV : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Size Ratio
IndexOfAnyTwoValues /PR/corerun 512 0.66
IndexOfAnyTwoValues /main/corerun 512 1.00
IndexOfAnyThreeValues /PR/corerun 512 0.69
IndexOfAnyThreeValues /main/corerun 512 1.00
IndexOfAnyFourValues /PR/corerun 512 0.75
IndexOfAnyFourValues /main/corerun 512 1.00
IndexOfAnyFiveValues /PR/corerun 512 0.75
IndexOfAnyFiveValues /main/corerun 512 1.00

Contributes to #64451

Author: adamsitnik
Assignees: -
Labels:

area-System.Memory, tenet-performance

Milestone: 7.0.0

@stephentoub
Copy link
Member

I expect these improvements will measurably move the needle for several regex workloads. Very nice.

@adamsitnik
Copy link
Member Author

@lambdageek The WASM AOT CI leg has failed with a new error:

[12:29:19] info: RuntimeError: function signature mismatch
[12:29:19] info:     at corlib_string_IndexOfAny_char__ (<anonymous>:wasm-function[24898]:0x5a015a

Is it a different symptom of the linker bug(#73474) or a new issue?

@@ -954,7 +944,7 @@ public static unsafe int IndexOfAny(ref char searchStart, char value0, char valu

VectorCompare:
// We include the Supported check again here even though path will not be taken, so the asm isn't generated if not supported.
if (!Sse2.IsSupported && Vector.IsHardwareAccelerated)
if (!Vector128.IsHardwareAccelerated && Vector.IsHardwareAccelerated)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any platforms which accelerate Vector<T> but not Vector128<T>?

This certainly doesn't exist for RyuJIT and I don't think that's true for Mono anymore either (CC. @fanyang-mono?)

If that's the case, we can probably just remove this path entirely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, we can probably just remove this path entirely.

Awesome, then once all my SpanHelpers PRs get merged I am going to send a new one that removes all Vector<T> code paths for the methods that are supporting Vector128 now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do accelerate both Vector<T> and Vector128<T> on both Arm64 and Amd64. On Arm64, the intrinsics support for their methods is complete now. However, it is not on Amd64. This issue tracks the progress (#66392).

matches = (Vector256.Equals(values0, search) | Vector256.Equals(values1, search))
.AsByte().ExtractMostSignificantBits();

// Note that ExtractMostSignificantBits has converted the equal vector elements into a set of bit flags,
// So the bit position in 'matches' corresponds to the element offset.
if (matches == 0)
Copy link
Member

@stephentoub stephentoub Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my edification, does using .AsByte().ExtractMostSignificantBits() and the doing == 0 produce more efficient code than using == Vector256<ushort>.Zero? We do the latter on the Vector128 code path, and I'm wondering why this one differs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do the latter on the Vector128 code path, and I'm wondering why this one differs.

ExtractMostSignificantBits() is cheap on x64 but expensive for arm64. Vector128 includes arm64, so we delay this operation until it's needed for Vector128. We don't need to do this for Vector256, which currently is supported only by x64.

I've verified that with some experiment in other PR (Tanner asked for that as well)

Copy link
Member

@stephentoub stephentoub Aug 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. And for the first part of the question:

does using .AsByte().ExtractMostSignificantBits() and then doing == 0 produce more efficient code than using == Vector256.Zero?

Should I take it from your answer that the answer is "yes"?

@lambdageek
Copy link
Member

@lambdageek The WASM AOT CI leg has failed with a new error:

[12:29:19] info: RuntimeError: function signature mismatch
[12:29:19] info:     at corlib_string_IndexOfAny_char__ (<anonymous>:wasm-function[24898]:0x5a015a

Is it a different symptom of the linker bug(#73474) or a new issue?

It looks like a different issue - it's this call to String.IndexOfAny(char[]):

int end = name.IndexOfAny(NameTerminator);

and it looks like it ends up calling something with an unexpected native signature (perhaps the name mangling in the AOT compiler collided?) (for wasm this would most likely mean wrong number of arguments)

I'll try to repro locally

/cc @vargaz

@adamsitnik
Copy link
Member Author

@lambdageek @vargaz The WASM AOT CI leg has failed with a new error:

[12:29:19] info: RuntimeError: function signature mismatch
[12:29:19] info:     at corlib_string_IndexOfAny_char__ (<anonymous>:wasm-function[24898]:0x5a015a

Is it a different symptom of the linker bug (#73474) or a new issue?

@lambdageek
Copy link
Member

@lambdageek @vargaz The WASM AOT CI leg has failed with a new error:

[12:29:19] info: RuntimeError: function signature mismatch
[12:29:19] info:     at corlib_string_IndexOfAny_char__ (<anonymous>:wasm-function[24898]:0x5a015a

Is it a different symptom of the linker bug (#73474) or a new issue?

@adamsitnik I don't think it's the linker bug. It's something else.

Doesn't reproduce locally for me yet - the System.Runtime testsuite passes on wasm+AOT. I'm going to keep trying - maybe i'm merging with an older main.

@lambdageek
Copy link
Member

Is it a different symptom of the linker bug (#73474) or a new issue?

@adamsitnik I don't think it's the linker bug. It's something else.

Doesn't reproduce locally for me yet - the System.Runtime testsuite passes on wasm+AOT. I'm going to keep trying - maybe i'm merging with an older main.

Ok, it repros locally. Interestingly the version of v8 that I have on my machine gives a slightly better error message:

RuntimeError: null function or function signature mismatch
  info:     at corlib_string_IndexOfAny_char__ (wasm://wasm/1e51fc5a:wasm-function[102581]:0x169999c)

which makes me suspect that it's the same linker issue and the AOT compiler is skipping the method. investigating...

@lambdageek
Copy link
Member

Is it a different symptom of the linker bug (#73474) or a new issue?

@adamsitnik I don't think it's the linker bug. It's something else.
Doesn't reproduce locally for me yet - the System.Runtime testsuite passes on wasm+AOT. I'm going to keep trying - maybe i'm merging with an older main.

Ok, it repros locally. Interestingly the version of v8 that I have on my machine gives a slightly better error message:

RuntimeError: null function or function signature mismatch
  info:     at corlib_string_IndexOfAny_char__ (wasm://wasm/1e51fc5a:wasm-function[102581]:0x169999c)

which makes me suspect that it's the same linker issue and the AOT compiler is skipping the method. investigating...

Yea. it's the same IL pattern @adamsitnik. if I change the last

        VectorCompare:
            // We include the Supported check again here even though path will not be taken, so the asm isn't generated if not supported.
            if (!Vector128.IsHardwareAccelerated && Vector.IsHardwareAccelerated)
            {

to

        VectorCompare:
            // We include the Supported check again here even though path will not be taken, so the asm isn't generated if not supported.
            Debug.Assert (!Vector128.IsHardwareAccelerated && Vector.IsHardwareAccelerated);
            {

and remove the unreachable code at the end, the tests pass.

So we're waiting on dotnet/linker#2966 to flow to the runtime

@stephentoub
Copy link
Member

So we're waiting on dotnet/linker#2966 to flow to the runtime

When do we expect that to happen automatically? If not today, can we manually update to a newer version?

@lambdageek
Copy link
Member

lambdageek commented Aug 12, 2022

Update #73865

So we're waiting on dotnet/linker#2966 to flow to the runtime

When do we expect that to happen automatically? If not today, can we manually update to a newer version?

I just manually triggered it. (assuming my darc-fu is sound).

@lambdageek
Copy link
Member

The System.Runtime tests with wasm AOT are passing for me locally after merging with e94f7ce

@adamsitnik
Copy link
Member Author

@lambdageek big thanks for all your help!

I am going to merge this PR as soon as the CI gets green. I'll then try to get #73876 merged and hopefully solve all the merge conflicts for #73768 and maybe get it merged for 7 before we snap.

@adamsitnik adamsitnik merged commit 27b18b1 into dotnet:main Aug 13, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants