-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port IndexOfAny(ref char, char[1-5], int) to Vector128/256 #73469
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue Detailsx64AVX2The improvements for BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Job-VHNRBA : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-DJIUDN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
AVXThe improvements for BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22405.1
[Host] : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT AVX2
Job-KFFJJI : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-AQUTGK : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
EnvironmentVariables=COMPlus_EnableAVX2=0
arm64We observe very nice gains (25-33%) as these particular methods were previously vectorized using BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
[Host] : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
Job-OMLHKF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-ZTKIBV : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Contributes to #64451
|
I expect these improvements will measurably move the needle for several regex workloads. Very nice. |
@lambdageek The WASM AOT CI leg has failed with a new error:
Is it a different symptom of the linker bug(#73474) or a new issue? |
@@ -954,7 +944,7 @@ public static unsafe int IndexOfAny(ref char searchStart, char value0, char valu | |||
|
|||
VectorCompare: | |||
// We include the Supported check again here even though path will not be taken, so the asm isn't generated if not supported. | |||
if (!Sse2.IsSupported && Vector.IsHardwareAccelerated) | |||
if (!Vector128.IsHardwareAccelerated && Vector.IsHardwareAccelerated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any platforms which accelerate Vector<T>
but not Vector128<T>
?
This certainly doesn't exist for RyuJIT and I don't think that's true for Mono anymore either (CC. @fanyang-mono?)
If that's the case, we can probably just remove this path entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, we can probably just remove this path entirely.
Awesome, then once all my SpanHelpers PRs get merged I am going to send a new one that removes all Vector<T>
code paths for the methods that are supporting Vector128
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do accelerate both Vector<T>
and Vector128<T>
on both Arm64 and Amd64. On Arm64, the intrinsics support for their methods is complete now. However, it is not on Amd64. This issue tracks the progress (#66392).
matches = (Vector256.Equals(values0, search) | Vector256.Equals(values1, search)) | ||
.AsByte().ExtractMostSignificantBits(); | ||
|
||
// Note that ExtractMostSignificantBits has converted the equal vector elements into a set of bit flags, | ||
// So the bit position in 'matches' corresponds to the element offset. | ||
if (matches == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my edification, does using .AsByte().ExtractMostSignificantBits()
and the doing == 0
produce more efficient code than using == Vector256<ushort>.Zero
? We do the latter on the Vector128
code path, and I'm wondering why this one differs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do the latter on the Vector128 code path, and I'm wondering why this one differs.
ExtractMostSignificantBits()
is cheap on x64 but expensive for arm64. Vector128
includes arm64, so we delay this operation until it's needed for Vector128
. We don't need to do this for Vector256
, which currently is supported only by x64.
I've verified that with some experiment in other PR (Tanner asked for that as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. And for the first part of the question:
does using .AsByte().ExtractMostSignificantBits() and then doing == 0 produce more efficient code than using == Vector256.Zero?
Should I take it from your answer that the answer is "yes"?
It looks like a different issue - it's this call to runtime/src/mono/wasm/host/Options.cs Line 573 in 37235c4
and it looks like it ends up calling something with an unexpected native signature (perhaps the name mangling in the AOT compiler collided?) (for wasm this would most likely mean wrong number of arguments) I'll try to repro locally /cc @vargaz |
@lambdageek @vargaz The WASM AOT CI leg has failed with a new error:
Is it a different symptom of the linker bug (#73474) or a new issue? |
@adamsitnik I don't think it's the linker bug. It's something else. Doesn't reproduce locally for me yet - the System.Runtime testsuite passes on wasm+AOT. I'm going to keep trying - maybe i'm merging with an older |
Ok, it repros locally. Interestingly the version of v8 that I have on my machine gives a slightly better error message:
which makes me suspect that it's the same linker issue and the AOT compiler is skipping the method. investigating... |
Yea. it's the same IL pattern @adamsitnik. if I change the last VectorCompare:
// We include the Supported check again here even though path will not be taken, so the asm isn't generated if not supported.
if (!Vector128.IsHardwareAccelerated && Vector.IsHardwareAccelerated)
{ to VectorCompare:
// We include the Supported check again here even though path will not be taken, so the asm isn't generated if not supported.
Debug.Assert (!Vector128.IsHardwareAccelerated && Vector.IsHardwareAccelerated);
{ and remove the unreachable code at the end, the tests pass. So we're waiting on dotnet/linker#2966 to flow to the runtime |
When do we expect that to happen automatically? If not today, can we manually update to a newer version? |
Update #73865
I just manually triggered it. (assuming my darc-fu is sound). |
The System.Runtime tests with wasm AOT are passing for me locally after merging with e94f7ce |
@lambdageek big thanks for all your help! I am going to merge this PR as soon as the CI gets green. I'll then try to get #73876 merged and hopefully solve all the merge conflicts for #73768 and maybe get it merged for 7 before we snap. |
Improvements |
x64
AVX2
The improvements for
IndexOfAnyFourValues
andIndexOfAnyFiveValues
comes from performing just oneExtractMostSignificantBits
operation insteadn
MoveMask
AVX
The improvements for
IndexOfAnyFourValues
andIndexOfAnyFiveValues
comes from performing just oneExtractMostSignificantBits
operation insteadn
MoveMask
. The 5% regression forIndexOfAnyTwoValues
is just alignment (I can't repro it on other PCs).arm64
We observe very nice gains (25-33%) as these particular methods were previously vectorized using
Vector<T>
APIs.Contributes to #64451