Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for zmm in .NET #80960

Merged
merged 15 commits into from
Mar 6, 2023

Conversation

DeepakRajendrakumaran
Copy link
Contributor

@DeepakRajendrakumaran DeepakRajendrakumaran commented Jan 21, 2023

This commit includes the following:

  1. Changes required to support zmm registers
  2. changes required to accelerate a subset of Vector512 methods - Create() and Zero using avx512
  3. Changes required to save/restore zmm registers when needed.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jan 21, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 21, 2023
@ghost
Copy link

ghost commented Jan 21, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This commit includes the

  1. Changes required to support zmm registers
  2. changes required to accelerate a subset of Vector512 methods - Create() and Zero using avx512
  3. Changes required to save/restore zmm registers when needed.
Author: DeepakRajendrakumaran
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Jan 21, 2023

@anthonycanino @tannergooding @BruceForstall

One open I have is regarding calculating 'cost'. See change here : 47a7424. I'm not sure how the heuristic works and if '+2' for the movup* makes sense

@teo-tsirpanis teo-tsirpanis added the avx512 Related to the AVX-512 architecture label Jan 21, 2023
@BruceForstall
Copy link
Member

One open I have is regarding calculating 'cost'. See change here : 47a7424. I'm not sure how the heuristic works and if '+2' for the movup* makes sense

I don't know about the cost details of the function you reference (maybe @AndyAyersMS could advise). However, don't SIMD64 and SIMD32 require similar kinds of codegen w.r.t. saving/restoring registers around a call?

I don't know if there are cases where SIMD64 should be given higher (more expensive) weights that SIMD32 just because there are twice as many bits in play (so presumably load/store is more expensive?).

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like good progress.

src/coreclr/inc/corjit.h Outdated Show resolved Hide resolved
src/coreclr/jit/compiler.h Show resolved Hide resolved
src/coreclr/jit/emit.h Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved
@@ -62,6 +62,7 @@ DEF_TP(SIMD8 ,"simd8" , TYP_SIMD8, TI_STRUCT, 8, 8, 8, 2, 8, VTF_S)
DEF_TP(SIMD12 ,"simd12" , TYP_SIMD12, TI_STRUCT,12,16, 16, 4,16, VTF_S)
DEF_TP(SIMD16 ,"simd16" , TYP_SIMD16, TI_STRUCT,16,16, 16, 4,16, VTF_S)
DEF_TP(SIMD32 ,"simd32" , TYP_SIMD32, TI_STRUCT,32,32, 32, 8,16, VTF_S)
DEF_TP(SIMD64 ,"simd64" , TYP_SIMD64, TI_STRUCT,64,64, 64, 16,16, VTF_S)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding It looks like TYP_SIMD32 is enabled for Arm64, and so are a lot of code paths related to it, e.g., in GenTreeVecCon. How much of the Arm64 compiler actually uses it, given that the maximum vector size is 16 bytes? I'm worried that TYP_SIMD64 will impose an even larger burden on Arm64 for no benefit.

Maybe there's not really much of a burden, except GenTreeVecCon is now a large size node, plus a bunch of unused code paths?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also a little surprising to me that SIMD32 alignment is 16, not 32, and SIMD64 is also 16, not 64, but the comment above indicates there's a reason for that (maybe it should point at code that changes the alignment dynamically).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t actually use it to my knowledge and any handling should be superfluous

it likely just was never ifdefd out when Arm64 was brought online

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand the question here

  1. Do you mean the types SIMD32 and SIMD64 should not be defined for Arm64 at all? The following made sense in my head since both arm and x86 use the same calls to create GenTreeVecCon etc and if GenTreeVecCon is using simd64Val now since it's the largest, we need to do the same here :
    https://github.com/dotnet/runtime/pull/80960/files#diff-547b75c8f8dd006693805c30d7a0e2ddcdd6b10691383dfbb16e3ed0c50f236aL1431-R1431 . When you say it doesn't need to be handled, do you mean the changes inside Lowering::LowerHWIntrinsicCreate above are not needed?
  2. Not sure about the alignment being 16 there. That was another of my opens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saying that TYP_SIMD32/64 paths aren't ever used on arm64 and so we could ifdef them out and only have them on xarch.

I don't remember all the things that the alignment: 16 impacts. I believe it primarily impacts stack alignment and I don't believe we have the necessary support to change it to 32/64 today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @DeepakRajendrakumaran doesn't need to worry about the impact of TYP_SIMD64 on Arm64 too much, but we'll see what the throughput (TP) measurements say about the actual impact. Perhaps we do need to go back and make sure Arm64 isn't unduly impacted in TP or JIT code size by unused functionality.

One example: a lot of this PR is adjusting simd32_t to simd64_t in many places. Maybe there should be a "union" type of simd_t (or simd_data_t?) that is a union of simd64_t, simd32_t, simd16_t, etc. Except on Arm64 it wouldn't include simd32_t or simd64_t, since they aren't needed/used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example: a lot of this PR is adjusting simd32_t to simd64_t in many places. Maybe there should be a "union" type of simd_t (or simd_data_t?) that is a union of simd64_t, simd32_t, simd16_t, etc. Except on Arm64 it wouldn't include simd32_t or simd64_t, since they aren't needed/used.

Each type is functionally that already.

That is, simd16_t is a union of each primitive type + 2x simd8_t
simd32_t is likewise of each primitive type + 4x simd8_t and 2x simd16_t

So by using simd64_t you have efficient access to just the simd8/16/32 data if necessary. The downside being that it declares 48 bytes of unused data on Arm64, and 32 bytes of unused data on hardware without AVX-512 support.

src/coreclr/jit/hwintrinsicxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/valuenum.h Outdated Show resolved Hide resolved
@BruceForstall BruceForstall mentioned this pull request Jan 23, 2023
56 tasks
@DeepakRajendrakumaran
Copy link
Contributor Author

While we wait for @AndyAyersMS to chime in, another question is re JITEEVersionIdentifier . I assume this needs to be updated as well? No 'new' methods are added but some changes were made

@BruceForstall
Copy link
Member

another question is re JITEEVersionIdentifier

Yes, the change merits a JITEEVersion change.

However, I would not include that in the PR, if possible, until after all other testing has occurred. That's because it will make it impossible to run superpmi-diffs and superpmi-replay pipelines and see their results.

@BruceForstall
Copy link
Member

Given the asserts I see in superpmi pipelines:

[02:38:39] ISSUE: <ASSERT> #1 D:\a\_work\1\s\src\coreclr\jit\compiler.cpp (2294) - Assertion failed 'instructionSetFlags.Equals(EnsureInstructionSetFlagsAreValid(instructionSetFlags))' in 'System.Runtime.CompilerServices.CastHelpers:StelemRef(System.Array,long,System.Object)' during 'Pre-import' (IL size 88; hash 0xdcdda805; Optimization-Level-Not-Yet-Set)

Maybe the existing SuperPMI collections (using the old JIT-EE interface, e.g., InstructionSet* definitions) isn't going to work.

I wonder if there's some way you can "hack" the change so it basically works with the "old" JIT-EE interface, just so we can leverage the superpmi pipelines (with existing superpmi collections) to measure throughput impact, especially on Arm64 or x64 without AVX-512 enabled, as well as asmdiffs on those platforms as well.

@DeepakRajendrakumaran
Copy link
Contributor Author

Given the asserts I see in superpmi pipelines:

[02:38:39] ISSUE: <ASSERT> #1 D:\a\_work\1\s\src\coreclr\jit\compiler.cpp (2294) - Assertion failed 'instructionSetFlags.Equals(EnsureInstructionSetFlagsAreValid(instructionSetFlags))' in 'System.Runtime.CompilerServices.CastHelpers:StelemRef(System.Array,long,System.Object)' during 'Pre-import' (IL size 88; hash 0xdcdda805; Optimization-Level-Not-Yet-Set)

Maybe the existing SuperPMI collections (using the old JIT-EE interface, e.g., InstructionSet* definitions) isn't going to work.

I wonder if there's some way you can "hack" the change so it basically works with the "old" JIT-EE interface, just so we can leverage the superpmi pipelines (with existing superpmi collections) to measure throughput impact, especially on Arm64 or x64 without AVX-512 enabled, as well as asmdiffs on those platforms as well.

Appreciate the suggestion. I'm struggling with some of these failures. Will try this out :)

@BruceForstall
Copy link
Member

Another alternative is that I could create a new SuperPMI collection using just the minimum required changes to the JIT-EE interface, including the JIT-EE GUID change. Then, you could include in this PR the very same GUID change, and it would pick up the new collection. This is a sometime time-consuming option, but let me know if you need to go this direction.

(This might be a little tricky as the replay job currently explicitly prevents being run if the JIT-EE GUID changes, bit it looks like the diffs job doesn't, for some reason)

@tannergooding
Copy link
Member

and it would pick up the new collection

It would be great if we had an opt in job (azp run superpmi-diffs-full or something) which did spend the time to create a new collection and run diffs. This comes up just often enough that having such a job would save us time and resources for these kinds of changes.

@BruceForstall
Copy link
Member

It would be great if we had an opt in job (azp run superpmi-diffs-full or something) which did spend the time to create a new collection and run diffs. This comes up just often enough that having such a job would save us time and resources for these kinds of changes.

Due to security reasons, we can't do superpmi collections on PRs and update our Azure Storage location where these are usually stored. We could possibly put the generated MCH files in Azure Storage PR artifact storage, and pull them from there, but that requires a bigger change in how our scripting works. Probably do-able, but even more work.

@tannergooding
Copy link
Member

I was thinking just a local collection that is thrown away after the PR. I don't think it happens frequently enough that needing storage would be a concern.

@DeepakRajendrakumaran
Copy link
Contributor Author

@tannergooding All requested changes should be there now.

@tannergooding
Copy link
Member

tannergooding commented Feb 24, 2023

Throughput numbers are a lot better now...

However, given the numbers also are impacting Arm64, I'd guess this is due to changes in the common code paths, maybe the TYP_SIMD64 and simd64_t handling logic.

@BruceForstall, what are your thoughts on taking this "as-is" and me getting up a PR that moves TYP_SIMD32/64 to be xarch only (as discussed here: #80960 (comment)) as a follow up?

I think the alternative is doing some more profiling "now" (maybe with Intel VTune or AMD uProf over crossgen2 of corelib) and trying to identify where the new hot spots are.

At a glance, we might want to make varTypeIsSIMD be return (TypeGet(vt) >= TYP_SIMD8) && (TypeGet(vt) <= TYP_SIMD64) (or similar) for example. We could alternatively have it do a varTypeClassification check like is done for Integral types.

Windows x64

MinOpts (+0.07% to +0.34%)
FullOpts (+0.13% to +0.19%)

Windows arm64

MinOpts (+0.11% to +0.25%)
FullOpts (+0.12% to +0.14%)

@tannergooding
Copy link
Member

Put up #82616 to test the throughput impact of varTypeIsSIMD and having it use a varTypeClassification rather than a switch.

@tannergooding
Copy link
Member

Put up #82624 to test the throughput impact of handling TYP_SIMD32 on Arm64

@tannergooding
Copy link
Member

tannergooding commented Feb 24, 2023

Restricting TYP_SIMD32 to Arm64 only represent an approx -0.09% to -0.19% improvement to Arm64

Fixing varTypeIsSimd is a -0.31% to -0.92% improvement to all architectures.

For Arm64 we'd likely see some additional gains for GT_CNS_VEC if we made it simd16_t only, but I opted to not do that yet given it would be much simpler with the template methods this PR added.

We'll still "regress" some things on xarch compared to not having TYP_SIMD64 at all, but it will overall still be a TP improvement compared to .NET 7

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Feb 24, 2023

Restricting TYP_SIMD32 to Arm64 only represent an approx -0.09% to -0.19% improvement to Arm64

Fixing varTypeIsSimd is a -0.31% to -0.92% improvement to all architectures.

For Arm64 we'd likely see some additional gains for GT_CNS_VEC if we made it simd16_t only, but I opted to not do that yet given it would be much simpler with the template methods this PR added.

We'll still "regress" some things on xarch compared to not having TYP_SIMD64 at all, but it will overall still be a TP improvement compared to .NET 7

So, I assume you want both changes to be merged? I imagine this will need to be modified again as I expect some merge conflicts if the other 2 PRs go in first

@tannergooding
Copy link
Member

So, I assume you want both changes to be merged? I imagine this will need to be modified again as I expect some merge conflicts if the other 2 PRs go in first

Right. I think its worth doing so, at least so we can confirm no negative impact to Arm64. I can take care of resolving the conflicts, they should be relatively straightforward.

@DeepakRajendrakumaran
Copy link
Contributor Author

So, I assume you want both changes to be merged? I imagine this will need to be modified again as I expect some merge conflicts if the other 2 PRs go in first

Right. I think its worth doing so, at least so we can confirm no negative impact to Arm64. I can take care of resolving the conflicts, they should be relatively straightforward.

Sounds good. Would be nice to finally have this merged :)

@tannergooding
Copy link
Member

tannergooding commented Feb 24, 2023

With just the varTypeIsSIMD fix, we now have the below and so are already looking a lot better.

It would be interesting to see where the TP improvement is coming from, likely one of the suggested refactorings (maybe around using genLog2 or similar).

It does look like there are some x64 diffs around Vector512 methods. Looks like we're getting a call emitted where previously we werent...

Windows x64

MinOpts (-0.16% to -0.01%)
FullOpts (-0.01% to +0.02%)

Windows Arm64

MinOpts (-0.05% to -0.00%)
FullOpts (+0.00% to +0.01%)

@tannergooding
Copy link
Member

tannergooding commented Feb 25, 2023

Diffs

TP issues have been resolved. There is still a small up to +0.03% TP regression in full opts and up to a +0.01% TP regression in min opts. But inversely there is an up to -0.16% TP improvement to min opts as well.

The small diffs that have cropped up in Vector512.WithElement are a side effect of us recognizing Vector512 as a valid ISA class now and are exposing a pre-existing "bug" in that get_Count isn't getting recognized as an intrinsic when the respective Vector64/128/256/512 is "valid" but "unsupported" (the runtime behavior is still correct, its just an inlined call and not a constant node as intended).

I plan on giving this one more review pass myself, but it should generally be good and mergable after it gets sign-off from someone in @dotnet/jit-contrib

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes overall LGTM.

I had left a few comments about some semi-related potential post PR cleanup that could happen. I'll work through those and get them completed.

@DeepakRajendrakumaran
Copy link
Contributor Author

Changes overall LGTM.

I had left a few comments about some semi-related potential post PR cleanup that could happen. I'll work through those and get them completed.

Are we waiting on an additional review before merging this?

@tannergooding
Copy link
Member

Are we waiting on an additional review before merging this?

Yes, it needs sign-off from someone that works on the JIT team. CC @dotnet/jit-contrib, @BruceForstall

@DeepakRajendrakumaran
Copy link
Contributor Author

Bumping this since #82731 is dependent on this

CC https://github.com/orgs/dotnet/teams/jit-contrib, @BruceForstall

@kunalspathak
Copy link
Member

I lost track of the discussion on this PR, but have we already tracked why there is TP impact for Arm64?

image

@tannergooding
Copy link
Member

There were various optimization suggestions that apply to common code paths which give the improvements (namely the cases where we’re using genLog2 rather than memory lookups that span multiple cache lines)

There also remain a couple common code paths doing handling of EA32/64BYTE which aren’t ifdefd yet and could be and which can minorly negatively impact TP

@BruceForstall BruceForstall merged commit 6123cb0 into dotnet:main Mar 6, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Apr 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants