Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VPCLMULQDQ intrinsics #109137

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Oct 23, 2024

Fixes #95772

This is one of several similar new ISAs, where an existing ISA (PCLMULQDQ) was extended to 256-bit with one cpuid flag (VPCLMULQDQ) and then to 512-bit when combined with AVX-512 (VPCLMULQDQ+AVX512F) support.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 23, 2024
@saucecontrol
Copy link
Member Author

The existing modeling for ISA support presents a challenge in that JIT wants to see a 1:1 mapping between ISA and implementing class, but the actual ISAs are represented by a combination of flags.

For my first attempt, I have virtualized this in JIT similarly to the way that e.g. Vector256<T> is a 'fake' ISA built from some combination of other ISA support, however this is not working properly with R2R/AOT where it wants all ISAs to be selectable with a switch.

Looking at some of the existing implementations, I see that the fake _VL ISAs are leaking into the ilc --instruction-set list, which is less than ideal. It also seems that the Avx10v1_V512 handling is broken for R2R since the ThunkGenerator code only handles nested class names named X64 or VL.

I'd appreciate some guidance if there's a better way to handle this scenario since there will be more like this.

@saucecontrol
Copy link
Member Author

saucecontrol commented Oct 23, 2024

@MichalStrehovsky could I trouble you to look at this? JIT side is working, but I've left it as draft for now because the NAOT leg is failing due to an assert on the (intentionally) unexposed fake ISA.
At this point it looks like I have two options:

  1. Leak the fake ISA into R2R/AOT configs, as is done for the others
  2. Change the way ISAs are mapped to implementing intrinsic classes in R2R/AOT

I'm not sure how much work 2) is in the end, or whether that's something the runtime team cares about

The currently broken code I mentioned is

case "Avx10v1_V512":
if (nestedTypeName == "X64")
{ return InstructionSet.X64_AVX10v1_V512_X64; }
else
{ return InstructionSet.X64_AVX10v1_V512; }

because that generated method is matching on managed type name, and there's no Avx10v1_V512 type. The generator currently has special handling for VL and X64 nested classes, so I could add V256 and V512 handling to that, but it also only handles one layer of nesting right now, so it can't handle Avx10v1.V512.X64 without bigger changes.

I guess there's also option 3) Do it the ugly way for now and hope somebody cleans it up later...

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky could I trouble you to look at this?

I don't have much guidance to offer, sorry. Things got a lot more complicated since I last touched any of this (when all we had was AVX2) and I haven't exactly been keeping track of it. E.g. I don't know why _VL instructions are fake and whether we intentionally want/or do not want to support them as --instruction-set, and what to do about this one. I think there was some discussion about getting rid of _VL in the past but I can't quickly find the PR.

Ideally RyuJIT implementation details shouldn't leak out into the managed parts of the compiler or R2R file format, so if RyuJIT needs something fake to operate, it ideally shouldn't burden other components (because then the owners of said component who don't know about RyuJIT implementation details and don't know much about hardware intrinsics in general either have no clue about what's going on). But maybe it's necessary, I don't know. We pulled these RyuJIT implementation details from cpufeatures.h in the past, maybe they can be pulled from more places.

@tannergooding and @davidwrighton might have more of an opinion.

@saucecontrol
Copy link
Member Author

saucecontrol commented Oct 24, 2024

I don't have much guidance to offer, sorry.

Fair enough. Thanks for the reply anyway.

For background, the reason the _VL ISAs are 'fake' is that they represent the intersection of other real cpuid bits. e.g. AV512BW_VL actually means AVX512BW+AVX512VL. In reality this distinction is unimportant since there isn't any actual hardware that implements one of those ISAs without the other, meaning no one would ever have a legit reason for wanting to test support of the intersection independent from the individual cpuid bits. It does represent a usability issue, however, since the fake ISA leaks into the --instruction-set options, and their being fake means people can't easily look up what they mean. This is mitigated by the fact we also have the x86-64-v4 option which includes all the _VL fake ISAs and is a well-known shortcut.

These newer intersection-style ISAs are more problematic because 1) they don't fall under a well-known x86-64 version set and 2) they actually do exist in hardware independent of each other. For example, Skylake-X implements PCLMULQDQ+AVX512F but not VPCLMULQDQ, Alder Lake implements PCLMULQDQ+VPCLMULQDQ but not AVX512F, and Zen 4 and 5 implement the full set. I'm hoping we can arrive at a better solution for them that also happens to clean up the handling of the ISAs that are already implemented.

@MichalStrehovsky
Copy link
Member

Thanks for the explanation! I agree that given all this, we should ideally not expose _VL as something people can specify on the command line.

@tannergooding
Copy link
Member

I think there was some discussion about getting rid of _VL in the past but I can't quickly find the PR.

It's spread out in a few places, I think the most recent was in: #103241 (comment)

There's a bit of a balance overall between modeling what the CPU exposes (irrespective of the implementation) and modeling something reasonable for users to consume and handle. For AVX512, I'd say our current setup is "wrong" and we should instead either:
A) Remove the various AVX512*_VL combined flags in favor of just having AVX512VL
B) Remote AVX512F, AVX512BW, AVX512CD, AVX512DQ and the various AVX512*_VL flags; only exposing AVX512

With A, we have something where each --instruction-set matches a corresponding CPUID bit. But the weird nuance is that we then have to figure out how that maps to the implementation since we require F+BW+CD+DQ+VL for any AVX512 functionality to work.

With B, we instead more closely model what the runtime requires and what is effectively the minimum baseline for things to "work".


With VPCLMULQDQ here, I don't think we have a choice other than having both PCLMULQDQ and VPCLMULQDQ as --instruction-set options due to how the CPUID bits and the implementation needs to exist. The rest of the combinations (PCLMULQDQ+AVX, VPCLMULQDQ+AVX, VPCLMULQDQ+AVX512F, and VPCLMULQDQ+AVX512VL) don't need their own flags and are instead represented by the other opt ins the user has given.

So, I think what we want is we need to do is always have virtual instruction sets for any managed exposed ISA class (such as Avx10v1.V512 or Pclmulqdq). These may not have a corresponding --instruction-set. We then ensure that some of the instruction sets do map to --instruction-set bits where it is sensible to do.

I think what you currently have in the PR roughly models that. We have Pclmulqdq (the class) which maps to pclmul (the instruction-set); we have Pclmulqdq_V256 (the class) which maps to vpclmulqdq (the instruction set, which maybe should be called vpclmul for parity); and we have Pclmulqdq_V512 (the class) which doesn't map to any instruction set (as the instruction set is implied by vpclmul+avx512).

@saucecontrol
Copy link
Member Author

saucecontrol commented Oct 25, 2024

Thanks, Tanner. I think if the decision is to change up the handling of the virtual ISAs in general, that's probably better done in a separate PR, which leads me to believe maybe the best path here is to go ahead and follow the existing pattern for now, and clean them all up later.

I've made the required changes to ThunkGenerator to fix the Avx10v1.V512 mappings as well as the V256 and V512 from this new API. That change ended up being bigger than I expected, because it led me to realize that all the _VL_X64 ISAs were in fact not used anywhere. I've split that change out into #109210 and will rebase this one once that's merged.

@saucecontrol saucecontrol marked this pull request as ready for review November 5, 2024 08:05
@tannergooding
Copy link
Member

This mostly all looks right. I believe you're missing some handling in:

  • src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/HardwareIntrinsicHelpers.Aot.cs - a comment should be updated as its no longer just Avx10v1_V512 being handled (should be more general and not list every individual scenario)
  • src/coreclr/tools/Common/InstructionSetHelpers.cs - code should be updated to enable handling the new ISA optimistically
  • src/coreclr/tools/Common/Compiler/InstructionSetSupport.cs - may need handling to account for the v256/v512 nested classes
  • src/mono/mono/mini/simd-intrinsics.c - may need to be updated to handle the nested v256/v512 class since Pclmulqdq is supported
  • src/mono/System.Private.CoreLib/src/ILLink/ILLink.Substitutions.Intrinsics.x86.xml - needs to be updated to track that mono doesn't support the v256/v512 nested classes
  • src/tests/nativeaot/SmokeTests/HardwareIntrinsics/Program.cs - needs to be updated to test the IsSupported property on the nested V256/V512 classes
  • src/tests/JIT/HardwareIntrinsics/X86/General/IsSupported.cs - same
  • src/tests/JIT/HardwareIntrinsics/X86/X86Base/CpuId.cs - same

@saucecontrol
Copy link
Member Author

saucecontrol commented Nov 7, 2024

src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/HardwareIntrinsicHelpers.Aot.cs - a comment should be updated as its no longer just Avx10v1_V512 being handled (should be more general and not list every individual scenario)

This change is in https://github.com/dotnet/runtime/pull/109537/files#diff-65f20fbb1fbd6c815168e9d3b2b358c4fd02aea226f2152099427a594310b876 if that's ok.

src/coreclr/tools/Common/Compiler/InstructionSetSupport.cs - may need handling to account for the v256/v512 nested classes

Ugh. This code hacks up the type name used for lookup in the Thunkgenerator-generated code, so the fix I added there conflicts with this. Let me PR this separately since that was a break.

Never mind, different map. I'll figure out what's up with this one.

src/mono/mono/mini/simd-intrinsics.c - may need to be updated to handle the nested v256/v512 class since Pclmulqdq is supported

I think maybe the proper way to deal with mono will be to put the nested classes in a separate .cs file and include the .PlatformNotSupported version of that file for mono. I didn't notice the base Pclmulqdq was actually supported there. Does that sound right to you?

{
if (potentialType.Name == "X64")
potentialType = (MetadataType)potentialType.ContainingType;
if (potentialType.Name == "VL")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handling of VL classes assumed the IsSupported value would be the same as the containing class. Although that assumption is safe, it was unnecessary since they have their own mappings in the dictionary and can be handled by the generic nesting logic.

The special handling of the 64-bit classes had to stay.

@saucecontrol
Copy link
Member Author

OK, this is ready for another review pass. All feedback addressed and updated tests passing.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC. @dotnet/jit-contrib for secondary review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime.Intrinsics community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API Proposal]: VPCLMULQDQ Intrinsics
3 participants