Add VPCLMULQDQ intrinsics #109137

saucecontrol · 2024-10-23T02:27:51Z

This is one of several similar new ISAs, where an existing ISA (PCLMULQDQ) was extended to 256-bit with one cpuid flag (VPCLMULQDQ) and then to 512-bit when combined with AVX-512 (VPCLMULQDQ+AVX512F) support.

dotnet-issue-labeler · 2024-10-23T02:27:57Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

dotnet-issue-labeler · 2024-10-23T02:27:59Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

saucecontrol · 2024-10-23T02:34:42Z

The existing modeling for ISA support presents a challenge in that JIT wants to see a 1:1 mapping between ISA and implementing class, but the actual ISAs are represented by a combination of flags.

For my first attempt, I have virtualized this in JIT similarly to the way that e.g. Vector256<T> is a 'fake' ISA built from some combination of other ISA support, however this is not working properly with R2R/AOT where it wants all ISAs to be selectable with a switch.

Looking at some of the existing implementations, I see that the fake _VL ISAs are leaking into the ilc --instruction-set list, which is less than ideal. It also seems that the Avx10v1_V512 handling is broken for R2R since the ThunkGenerator code only handles nested class names named X64 or VL.

I'd appreciate some guidance if there's a better way to handle this scenario since there will be more like this.

saucecontrol · 2024-10-23T23:24:04Z

@MichalStrehovsky could I trouble you to look at this? JIT side is working, but I've left it as draft for now because the NAOT leg is failing due to an assert on the (intentionally) unexposed fake ISA.
At this point it looks like I have two options:

Leak the fake ISA into R2R/AOT configs, as is done for the others
Change the way ISAs are mapped to implementing intrinsic classes in R2R/AOT

I'm not sure how much work 2) is in the end, or whether that's something the runtime team cares about

The currently broken code I mentioned is

runtime/src/coreclr/tools/Common/JitInterface/CorInfoInstructionSet.cs

Lines 1896 to 1900 in e70aaa8

    
           case "Avx10v1_V512": 
        
               if (nestedTypeName == "X64") 
        
               { return InstructionSet.X64_AVX10v1_V512_X64; } 
        
               else 
        
               { return InstructionSet.X64_AVX10v1_V512; }

because that generated method is matching on managed type name, and there's no Avx10v1_V512 type. The generator currently has special handling for VL and X64 nested classes, so I could add V256 and V512 handling to that, but it also only handles one layer of nesting right now, so it can't handle Avx10v1.V512.X64 without bigger changes.

I guess there's also option 3) Do it the ugly way for now and hope somebody cleans it up later...

MichalStrehovsky · 2024-10-24T06:09:32Z

@MichalStrehovsky could I trouble you to look at this?

I don't have much guidance to offer, sorry. Things got a lot more complicated since I last touched any of this (when all we had was AVX2) and I haven't exactly been keeping track of it. E.g. I don't know why _VL instructions are fake and whether we intentionally want/or do not want to support them as --instruction-set, and what to do about this one. I think there was some discussion about getting rid of _VL in the past but I can't quickly find the PR.

Ideally RyuJIT implementation details shouldn't leak out into the managed parts of the compiler or R2R file format, so if RyuJIT needs something fake to operate, it ideally shouldn't burden other components (because then the owners of said component who don't know about RyuJIT implementation details and don't know much about hardware intrinsics in general either have no clue about what's going on). But maybe it's necessary, I don't know. We pulled these RyuJIT implementation details from cpufeatures.h in the past, maybe they can be pulled from more places.

@tannergooding and @davidwrighton might have more of an opinion.

saucecontrol · 2024-10-24T06:52:25Z

I don't have much guidance to offer, sorry.

Fair enough. Thanks for the reply anyway.

For background, the reason the _VL ISAs are 'fake' is that they represent the intersection of other real cpuid bits. e.g. AV512BW_VL actually means AVX512BW+AVX512VL. In reality this distinction is unimportant since there isn't any actual hardware that implements one of those ISAs without the other, meaning no one would ever have a legit reason for wanting to test support of the intersection independent from the individual cpuid bits. It does represent a usability issue, however, since the fake ISA leaks into the --instruction-set options, and their being fake means people can't easily look up what they mean. This is mitigated by the fact we also have the x86-64-v4 option which includes all the _VL fake ISAs and is a well-known shortcut.

These newer intersection-style ISAs are more problematic because 1) they don't fall under a well-known x86-64 version set and 2) they actually do exist in hardware independent of each other. For example, Skylake-X implements PCLMULQDQ+AVX512F but not VPCLMULQDQ, Alder Lake implements PCLMULQDQ+VPCLMULQDQ but not AVX512F, and Zen 4 and 5 implement the full set. I'm hoping we can arrive at a better solution for them that also happens to clean up the handling of the ISAs that are already implemented.

MichalStrehovsky · 2024-10-24T08:22:27Z

Thanks for the explanation! I agree that given all this, we should ideally not expose _VL as something people can specify on the command line.

tannergooding · 2024-10-24T15:36:22Z

I think there was some discussion about getting rid of _VL in the past but I can't quickly find the PR.

It's spread out in a few places, I think the most recent was in: #103241 (comment)

There's a bit of a balance overall between modeling what the CPU exposes (irrespective of the implementation) and modeling something reasonable for users to consume and handle. For AVX512, I'd say our current setup is "wrong" and we should instead either:
A) Remove the various AVX512*_VL combined flags in favor of just having AVX512VL
B) Remote AVX512F, AVX512BW, AVX512CD, AVX512DQ and the various AVX512*_VL flags; only exposing AVX512

With A, we have something where each --instruction-set matches a corresponding CPUID bit. But the weird nuance is that we then have to figure out how that maps to the implementation since we require F+BW+CD+DQ+VL for any AVX512 functionality to work.

With B, we instead more closely model what the runtime requires and what is effectively the minimum baseline for things to "work".

With VPCLMULQDQ here, I don't think we have a choice other than having both PCLMULQDQ and VPCLMULQDQ as --instruction-set options due to how the CPUID bits and the implementation needs to exist. The rest of the combinations (PCLMULQDQ+AVX, VPCLMULQDQ+AVX, VPCLMULQDQ+AVX512F, and VPCLMULQDQ+AVX512VL) don't need their own flags and are instead represented by the other opt ins the user has given.

So, I think what we want is we need to do is always have virtual instruction sets for any managed exposed ISA class (such as Avx10v1.V512 or Pclmulqdq). These may not have a corresponding --instruction-set. We then ensure that some of the instruction sets do map to --instruction-set bits where it is sensible to do.

I think what you currently have in the PR roughly models that. We have Pclmulqdq (the class) which maps to pclmul (the instruction-set); we have Pclmulqdq_V256 (the class) which maps to vpclmulqdq (the instruction set, which maybe should be called vpclmul for parity); and we have Pclmulqdq_V512 (the class) which doesn't map to any instruction set (as the instruction set is implied by vpclmul+avx512).

saucecontrol · 2024-10-25T04:06:04Z

Thanks, Tanner. I think if the decision is to change up the handling of the virtual ISAs in general, that's probably better done in a separate PR, which leads me to believe maybe the best path here is to go ahead and follow the existing pattern for now, and clean them all up later.

I've made the required changes to ThunkGenerator to fix the Avx10v1.V512 mappings as well as the V256 and V512 from this new API. That change ended up being bigger than I expected, because it led me to realize that all the _VL_X64 ISAs were in fact not used anywhere. I've split that change out into #109210 and will rebase this one once that's merged.

src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt

saucecontrol · 2024-11-11T15:49:33Z

OK, this is ready for another review pass. All feedback addressed and updated tests passing.

tannergooding

CC. @dotnet/jit-contrib for secondary review

tannergooding · 2024-11-19T17:39:41Z

ping @dotnet/jit-contrib for secondary review

AFAIK methods on the nested X64/Arm64 classes shouldn't be considered intrinsics on 32bit platforms since they are as relevant as e.g. WASM intrinsics. This should fix widespread runtime-nativeaot-outerloop failure on x86. I think this regressed in dotnet#109137.

AFAIK methods on the nested X64/Arm64 classes shouldn't be considered intrinsics on 32bit platforms since they are as relevant as e.g. WASM intrinsics. This should fix widespread runtime-nativeaot-outerloop failure on x86. I think this regressed in #109137.

AFAIK methods on the nested X64/Arm64 classes shouldn't be considered intrinsics on 32bit platforms since they are as relevant as e.g. WASM intrinsics. This should fix widespread runtime-nativeaot-outerloop failure on x86. I think this regressed in dotnet#109137.

* add vpclmulqdq intrinsics * add missing break * add alternate instruction def for evex encoding * rename instruction * whitespace * re-run thunk generator * fix AOT instruction sets * address feedback * apply formatting patch * address feedback round 2 * add missing brace * fix smoketest expected results * fix suffix order * handle implied V512 support in AOT * remove more unnecessary X64 ISA variants --------- Co-authored-by: Tanner Gooding <tagoo@outlook.com>

AFAIK methods on the nested X64/Arm64 classes shouldn't be considered intrinsics on 32bit platforms since they are as relevant as e.g. WASM intrinsics. This should fix widespread runtime-nativeaot-outerloop failure on x86. I think this regressed in dotnet#109137.

add vpclmulqdq intrinsics

df00b72

dotnet-issue-labeler bot added area-System.Runtime.Intrinsics new-api-needs-documentation labels Oct 23, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 23, 2024

add missing break

092a0ae

add alternate instruction def for evex encoding

c4aaa76

saucecontrol added 4 commits October 26, 2024 15:17

rename instruction

c309baa

whitespace

3275606

Merge remote-tracking branch 'upstream/main' into vpclmulqdq

91cc3f1

re-run thunk generator

5ada8e1

build-analysis bot mentioned this pull request Nov 5, 2024

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

3 tasks

fix AOT instruction sets

05bea1c

build-analysis bot mentioned this pull request Nov 9, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

saucecontrol added 2 commits November 10, 2024 15:19

handle implied V512 support in AOT

ff959ad

remove more unnecessary X64 ISA variants

2b466f7

saucecontrol commented Nov 11, 2024

View reviewed changes

src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt Show resolved Hide resolved

tannergooding approved these changes Nov 11, 2024

View reviewed changes

Merge branch 'main' into vpclmulqdq

a862705

build-analysis bot mentioned this pull request Nov 13, 2024

chrome-DebuggerTests failing on mono while waiting for pause #109795

Open

saucecontrol added 2 commits November 14, 2024 15:11

Merge remote-tracking branch 'upstream/main' into vpclmulqdq

bd74c21

Merge remote-tracking branch 'upstream/main' into vpclmulqdq

102bbef

This was referenced Nov 19, 2024

Error reported in diagnostic logs. Please examine the log for more details dotnet/dnceng#1928

Open

[wasm] chrome-DebuggerTests crashing with DeadLetter #109870

Open

Merge branch 'main' into vpclmulqdq

8d5a1e2

This was referenced Nov 19, 2024

[browser][MT] MutexTests - operation does not support unaligned accesses #96191

Open

[MT][Firefox] Wasm test failure: System.IO.FileNotFoundException : Could not load file or assembly xunit.assert #109945

Open

BruceForstall approved these changes Nov 20, 2024

View reviewed changes

tannergooding merged commit 08a36ca into dotnet:main Nov 20, 2024
167 of 169 checks passed

saucecontrol deleted the vpclmulqdq branch November 20, 2024 20:56

MichalStrehovsky mentioned this pull request Nov 28, 2024

Fix GetHardwareIntrinsicId on 32bit platforms #110238

Merged

github-actions bot locked and limited conversation to collaborators Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VPCLMULQDQ intrinsics #109137

Add VPCLMULQDQ intrinsics #109137

saucecontrol commented Oct 23, 2024 •

edited

Loading

dotnet-issue-labeler bot commented Oct 23, 2024

dotnet-issue-labeler bot commented Oct 23, 2024

saucecontrol commented Oct 23, 2024

saucecontrol commented Oct 23, 2024 •

edited

Loading

MichalStrehovsky commented Oct 24, 2024

saucecontrol commented Oct 24, 2024 •

edited

Loading

MichalStrehovsky commented Oct 24, 2024

tannergooding commented Oct 24, 2024

saucecontrol commented Oct 25, 2024 •

edited

Loading

saucecontrol commented Nov 11, 2024

tannergooding left a comment

tannergooding commented Nov 19, 2024

Add VPCLMULQDQ intrinsics #109137

Add VPCLMULQDQ intrinsics #109137

Conversation

saucecontrol commented Oct 23, 2024 • edited Loading

dotnet-issue-labeler bot commented Oct 23, 2024

dotnet-issue-labeler bot commented Oct 23, 2024

saucecontrol commented Oct 23, 2024

saucecontrol commented Oct 23, 2024 • edited Loading

MichalStrehovsky commented Oct 24, 2024

saucecontrol commented Oct 24, 2024 • edited Loading

MichalStrehovsky commented Oct 24, 2024

tannergooding commented Oct 24, 2024

saucecontrol commented Oct 25, 2024 • edited Loading

saucecontrol commented Nov 11, 2024

tannergooding left a comment

Choose a reason for hiding this comment

tannergooding commented Nov 19, 2024

saucecontrol commented Oct 23, 2024 •

edited

Loading

saucecontrol commented Oct 23, 2024 •

edited

Loading

saucecontrol commented Oct 24, 2024 •

edited

Loading

saucecontrol commented Oct 25, 2024 •

edited

Loading