Move more of the xplat hwintrinsic API implementation into managed code #103150

tannergooding · 2024-06-07T05:25:02Z

This moves the entirety of Vector and a number of core operations from Vector64/128/256/512 into managed

…fallbacks

…aged

…nt for Vector64/128/256/512 into managed

…managed

…r64/128/256/512 into managed

jkotas · 2024-06-07T11:34:46Z

Is this the direction that we want to take with primitive Vector* operations? It does not feel right looking at the amounts of IL that these operations expand into with this change.

jkotas · 2024-06-07T13:23:03Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs

+                    }
+                    else if (sizeof(T) == 8)
+                    {
+                        if (Avx512F.VL.IsSupported)


I suspect that this is not going to work well for NAOT. The IsSupported properties are not constants for NAOT, they are dynamic feature checks. It works ok today since these properties are used relatively rarely and protect large chunks of code. It is not going to work well when they are used very frequently.

They're only dynamic feature checks for the same encoding levels.

That is, if you target the baseline (x86-64-v1) then SSE3-SSE4.2 become dynamic checks while Avx+ is static false (due to requiring VEX encoding).
If you target x86-64-v2 then there should be no dynamic checks.
If you target x86-64-v3 then there should be no dynamic checks (Avx512F+ requires EVEX encoding).
If you target x86-64-v4 then you would get dynamic checks for Avx512Vbmi

You could also target individual ISAs, but that's not typically recommended. There are then also some one off ISAs not in the normal hierarchy, such as Aes and Pclmulqdq, but those are exceptionally rare. The same would be true of ISAs that are not part of a target baseline but which support an encoding, such as AvxVnni.

So given the current code here, I expect that NAOT will do just fine. There may be some concern for other areas, but that's likely something we want a feature or attribute to help handle anyways given the same concern applies to user code that is writing small helper methods that do per architecture dispatch.

For S.P.Corelib in particular, it's also worth noting that the trimmer is going to remove the if (AdvSimd.IsSupported) and similar paths when targeting xarch, so in many cases we're going to get down to just the 4 constant size checks.

I agree that we are lucky that this works given other constraints for typical configurations. It does not work well in general.

It does not work well in general.

Agreed. I think this is an area we're going to need to think about more in depth for NAOT when it comes time.

In many cases, the dynamic checks aren't worthwhile unless you have some form of code cloning or are executing more complex logic. While the pattern we tend to encourage for users is to write small helpers that isolate the functionality instead (as a way of improving readability and maintainability).

Perhaps that could be some bool IsSupportedDynamically { get; } (or IsSupportedOpportunistically?) that allows the developer to be explicit they would like a dynamic check instead of an exact check.

tannergooding · 2024-06-07T15:02:21Z

Is this the direction that we want to take with primitive Vector* operations? It does not feel right looking at the amounts of IL that these operations expand into with this change.

There's been requests for something along the lines of this from both the JIT and Mono teams for a while now.

The general consideration is that we currently have to duplicate the xplat acceleration across at least 4 different backends (RyuJIT, MonoLLVM, MonoJIT, and MonoInterpreter). This is error prone and it's fairly trivial for support and/or optimizations to not be done consistently.

The support for this tends to be quite a bit more complex and requires very special consideration of where/how the IR is created. We end up having to write significantly more code to support this in the runtimes as it requires manually creating the IR and documenting the rough equivalent C# it corresponds to. There have been multiple occasions where some subtle ordering rule is missed and nodes are accidentally reordered leading to downstream bugs.

It being in the JIT also ends up confusing downstream developers as to most C# developers it is a bit of a black box where they cannot see what code is getting generated, cannot easily contribute fixes or improvements, etc.

All in all, having the runtime support be centered around the platform specific intrinsics that must be handled independently and the implementation of the xplat surface area be in managed helps things from multiply different angles. The biggest downside is that it requires more inlining to happen to get things right and that's one of the reasons I had pushed back against us moving it into managed for so long. However, with where RyuJIT is today and with the recent improvements to handle ConstantExpected scenarios across inlining boundaries now, we're at a point where it is feasible and hence why I've put up the first couple PRs to experiment in this space.

CC. @EgorBo, @jakobbotsch who may have additional thoughts on this from the RyuJIT side.

jkotas · 2024-06-07T16:11:21Z

The general consideration is that we currently have to duplicate the xplat acceleration across at least 4 different backends (RyuJIT, MonoLLVM, MonoJIT, and MonoInterpreter). This is error prone and it's fairly trivial for support and/or optimizations to not be done consistently.

We have a problem with duplicating every other feature and optimization over different backends, this duplication is one of many. The real fix to this one (and other duplications) would be to switch over to RyuJIT for Mono targets.

tannergooding · 2024-06-07T16:32:12Z

The real fix to this one (and other duplications) would be to switch over to RyuJIT for Mono targets.

That's fair, but potentially many years out.

I think it is ultimately a tradeoff between the work the JIT has to do for common xplat API usage (in terms of inlining and other optimizations) in comparison to the ease at which the code can be versioned, maintained, and is known to be correct (particularly when comparing the accelerated vs the non-accelerated path, which shows up for direct vs indirect invocation).

I'm fine with keeping this in the JIT if we think that is overall better, even with the other considerations. If we decided on that, I'd also opt for reverting #102301 and maintaining the status quo for all APIs (although there's a couple managed side fixes made to the fallback path that I'd need to extract out and keep).

jkotas · 2024-06-07T16:57:08Z

I'd also opt for reverting #102301 and maintaining the status quo for all APIs

#102301 looks fine to me. The method implementations that #102301 replaced the intrinsic expansions with are typically trivial forwarders and they are less essential APIs. It would be useful to delete the Mono intrinsic expansions that match #102301 to prove that this is viable path to reduce duplication between backend.

tannergooding · 2024-06-07T17:32:57Z

#102301 looks fine to me. The method implementations that #102301 replaced the intrinsic expansions with are typically trivial forwarders and they are less essential APIs. It would be useful to delete the Mono intrinsic expansions that match #102301 to prove that this is viable path to reduce duplication between backend.

So then would you say that removing the JIT support for Vector2/3/4, Quaternion, and Plane is fine. But you'd keep the JIT implementation for Vector64/128/256/512<T>? What about Vector<T>?

Or are you saying you'd only remove the JIT support for places that are trivially deferred? Those ones don't really save much in the JIT, because they share an execution path. That is, if we have to intrinsify op_Addition in the JIT, then intrinsifying Add makes sense as well because its the same code path, so we avoid the inlining and improve throughput by just making the same switch case handle both.

jkotas · 2024-06-07T18:29:30Z

you'd keep the JIT implementation for Vector64/128/256/512? What about Vector?

I would keep all Vector64/128/256/512 and Vector in the JIT. I see these as the most frequently used basic building blocks. I think the most frequently used basic building blocks and operations on them need to be cheap and they should not be eating various JIT budgets unnecessarily.

tannergooding · 2024-06-07T18:42:58Z

I would keep all Vector64/128/256/512 and Vector in the JIT.

Will close this and bring back a little bit of the handling from #102301 then. -- At least the As and get_Zero methods since those are very common and no-ops, I could also bring back the named alternatives like BitwiseAnd, but those are less common than the operators and so less problematic since they aren't as much "building blocks".

I'll focus on just getting rid of the remaining JIT logic for Vector2/3 instead then, since those are not core helpers and can just be AsVector128(), op, AsVector2/3() like Vector4/Quaternion/Plane were simplified down to.

That will allow us to still remove some logic while keeping the basic building blocks efficiently handled in the JIT.

Thanks for the input!

jkotas · 2024-06-07T18:57:42Z

delete the Mono intrinsic expansions that match #102301

Anybody plans to look into that? (just curious)

tannergooding · 2024-06-07T19:14:20Z

Anybody plans to look into that? (just curious)

It's on the backlog of things I'd like to look at. Testing and validating Mono tends to be a bit more work so I had planned on doing the RyuJIT work first, leaving Mono intact, and then take a look at Mono independently.

tannergooding added 11 commits June 5, 2024 14:50

Minor cleanup of the Vector64/128/256/512 implementations to improve …

4bab1d8

…fallbacks

Ensure gtNewSimdSumNode maintains consistency with the software fallback

0ed78a4

Ensure Vector128.Sum also does pairwise adds for floating-point

69f352a

Use the right type in the gtNewSimdBinOpNode call

4175905

Don't regress fallback scenarios using AndNot

c465d2d

Move op_Addition and op_Subtraction for Vector64/128/256/512 into man…

a8f22e8

…aged

Move op_BitwiseAnd, op_BitwiseOr, op_ExclusiveOr, and op_OnesCompleme…

ad97c28

…nt for Vector64/128/256/512 into managed

Move op_Division and op_Multiplication for Vector64/128/256/512 into …

5567cec

…managed

Move op_LeftShift, op_RightShift, and op_UnsignedRightShift for Vecto…

b784142

…r64/128/256/512 into managed

Move Abs and AndNot for Vector64/128/256/512 into managed

4467b12

Move Vector<T> into managed

99fdfdf

dotnet-issue-labeler bot added the area-System.Runtime.Intrinsics label Jun 7, 2024

dotnet-policy-service bot assigned tannergooding Jun 7, 2024

This was referenced Jun 7, 2024

LibraryImportGenerator.Unit.Tests crashing on linux-x64 mono interpreter #100800

Open

[Mono/x64] MutualThdRecur-fptr test crash #102995

Open

Failed USB connection via port 54050, error 61, in tvOS arm64 Release AllSubsets_Mono #82637

Open

jkotas reviewed Jun 7, 2024

View reviewed changes

Use typeof where it was meant to be used

64b5f9a

tannergooding closed this Jun 7, 2024

tannergooding mentioned this pull request Jun 7, 2024

Bring back JIT support for core Vector64/128/256/512 APIs #103177

Merged

github-actions bot locked and limited conversation to collaborators Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move more of the xplat hwintrinsic API implementation into managed code #103150

Move more of the xplat hwintrinsic API implementation into managed code #103150

tannergooding commented Jun 7, 2024

jkotas commented Jun 7, 2024

jkotas Jun 7, 2024

tannergooding Jun 7, 2024

jkotas Jun 7, 2024

tannergooding Jun 7, 2024

tannergooding commented Jun 7, 2024

jkotas commented Jun 7, 2024

tannergooding commented Jun 7, 2024

jkotas commented Jun 7, 2024

tannergooding commented Jun 7, 2024

jkotas commented Jun 7, 2024

tannergooding commented Jun 7, 2024 •

edited

Loading

jkotas commented Jun 7, 2024

tannergooding commented Jun 7, 2024

Move more of the xplat hwintrinsic API implementation into managed code #103150

Move more of the xplat hwintrinsic API implementation into managed code #103150

Conversation

tannergooding commented Jun 7, 2024

jkotas commented Jun 7, 2024

jkotas Jun 7, 2024

Choose a reason for hiding this comment

tannergooding Jun 7, 2024

Choose a reason for hiding this comment

jkotas Jun 7, 2024

Choose a reason for hiding this comment

tannergooding Jun 7, 2024

Choose a reason for hiding this comment

tannergooding commented Jun 7, 2024

jkotas commented Jun 7, 2024

tannergooding commented Jun 7, 2024

jkotas commented Jun 7, 2024

tannergooding commented Jun 7, 2024

jkotas commented Jun 7, 2024

tannergooding commented Jun 7, 2024 • edited Loading

jkotas commented Jun 7, 2024

tannergooding commented Jun 7, 2024

tannergooding commented Jun 7, 2024 •

edited

Loading