Arm64: SVE/SVE2 encodings #94285

kunalspathak · 2023-11-02T07:08:45Z

Summary

Add SVE/SVE2 encodings for all the possible instructions.

Stats:

606 new instructions
1123 new encodings of newly added instructions
378 new instruction formats
103 different groupings of newly added instruction formats.

Design

I have added 2 new files emitfmtsarm64_sve.h and instrsarm64_sve.h for SVE encodings. They are separate from existing emitfmtsarm64.h and instrsarm64.h because had we added SVE encodings in existing files that contains general/NEON instructions, it would impact JIT throughput heavily specially around the code to retrieve hex code for non-sve instructions. The instruction table preparation happens using existing machinery in a separate method emitSveInsCode(). The codegen, depending on the intrinsic, will call emitSveInsCode() and thus without affecting the code path of non-sve instructions.

I had to readjust the offsets of instrDesc to include the new size of _idIns and _idInsFmt. Here are the offsets before/after:

Before:


clrjit_universal_arm64_x64!emitter::instrDesc
   +0x000 _idIns           : Pos 0, 10 Bits
   +0x000 _idInsFmt        : Pos 10, 8 Bits
   +0x000 _idGCref         : Pos 18, 2 Bits
   +0x000 _idReg1          : Pos 20, 6 Bits
   +0x000 _idReg2          : Pos 26, 6 Bits
   +0x004 _idSmallDsc      : Pos 0, 1 Bit
   +0x004 _idLargeCns      : Pos 1, 1 Bit
   +0x004 _idLargeDsp      : Pos 2, 1 Bit
   +0x004 _idLargeCall     : Pos 3, 1 Bit
   +0x004 _idBound         : Pos 4, 1 Bit
   +0x004 _idCallAddr      : Pos 5, 1 Bit
   +0x004 _idNoGC          : Pos 6, 1 Bit
   +0x004 _idOpSize        : Pos 7, 3 Bits
   +0x004 _idInsOpt        : Pos 10, 6 Bits
   +0x004 _idLclVar        : Pos 16, 1 Bit
   +0x004 _idLclVarPair    : Pos 17, 1 Bit
   +0x004 _idCnsReloc      : Pos 18, 1 Bit
   +0x004 _idDspReloc      : Pos 19, 1 Bit
   +0x004 _idSmallCns      : Pos 20, 12 Bits
   +0x008 _idAddrUnion     : emitter::instrDesc::idAddrUnion

After:

clrjit_universal_arm64_x64!emitter::instrDesc
   +0x000 _idIns           : Pos 0, 11 Bits
   +0x000 _idInsFmt        : Pos 11, 10 Bits
   +0x000 _idOpSize        : Pos 21, 3 Bits
   +0x000 _idInsOpt        : Pos 24, 6 Bits
   +0x000 _idGCref         : Pos 30, 2 Bits
   +0x004 _idReg1          : Pos 0, 6 Bits
   +0x004 _idReg2          : Pos 6, 6 Bits
   +0x004 _idSmallDsc      : Pos 12, 1 Bit
   +0x004 _idLargeCns      : Pos 13, 1 Bit
   +0x004 _idLargeDsp      : Pos 14, 1 Bit
   +0x004 _idLargeCall     : Pos 15, 1 Bit
   +0x004 _idBound         : Pos 16, 1 Bit
   +0x004 _idCallAddr      : Pos 17, 1 Bit
   +0x004 _idNoGC          : Pos 18, 1 Bit
   +0x004 _idLclVar        : Pos 19, 1 Bit
   +0x004 _idLclVarPair    : Pos 20, 1 Bit
   +0x004 _idCnsReloc      : Pos 21, 1 Bit
   +0x004 _idDspReloc      : Pos 22, 1 Bit
   +0x004 _idSmallCns      : Pos 23, 9 Bits
   +0x008 _idAddrUnion     : emitter::instrDesc::idAddrUnion

Nomenclature

All the instruction's id has a prefix of sve_ to differentiate them from regular instructions, e.g. INS_sve_mov vs. INS_mov.
Likewise, instruction formats have prefix of SVE_ to differentiate from existing instruction formats. e.g. IF_SVE_AB_3A vs. IF_DI_3A.
Currently, various instruction formats are grouped based on the common logic various instructions having them share during encoding and in various other methods. I do not have idea at this point how many of newly added instruction formats will share logic between various instructions, hence I have created as many unique as I could, and in future we could eliminate/combine many of them.
The naming convention for instruction formats is as follows: SVE_WW_XY, where WW is unique code like AA, AB, etc., X represents number of registers the instruction operates on, Y represents unique code A, B depending on if same instruction has different formats.

Arm64 Encoding generation tool

As mentioned in #93095, it is impossible to hand write each encoding by looking at the manual. Hence, I wrote a small tool that parses the manual available in xml format on Arm site and generate C++ code that is included in emitfmtsarm64_sve.h and instrsarm64_sve.h. The plan is to open source the tool and check it in https://github.com/dotnet/jitutils/.

Contributes to #93095

ghost · 2023-11-02T07:08:55Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

Add SVE/SVE2 encodings for all the possible instructions.

Stats:

606 new instructions
1123 new encodings of newly added instructions
378 new instruction formats
103 different groupings of newly added instruction formats.

Design

I have added 2 new files emitfmtsarm64_sve.h and instrsarm64_sve.h for SVE encodings. They are separate from existing emitfmtsarm64.h and instrsarm64.h because had we added SVE encodings in existing files that contains general/NEON instructions, it would impact JIT throughput heavily specially around the code to retrieve hex code for non-sve instructions. The instruction table preparation happens using existing machinery in a separate method emitSveInsCode(). The codegen, depending on the intrinsic, will call emitSveInsCode() and thus without affecting the code path of non-sve instructions.

Nomenclature

All the instruction's id has a prefix of sve_ to differentiate them from regular instructions, e.g. INS_sve_mov vs. INS_mov.
Likewise, instruction formats have prefix of SVE_ to differentiate from existing instruction formats. e.g. IF_SVE_AB_3A vs. IF_DI_3A.
Currently, various instruction formats are grouped based on the common logic various instructions having them share during encoding and in various other methods. I do not have idea at this point how many of newly added instruction formats will share logic between various instructions, hence I have created as many unique as I could, and in future we could eliminate/combine many of them.
The naming convention for instruction formats is as follows: SVE_WW_XY, where WW is unique code like AA, AB, etc., X represents number of registers the instruction operates on, Y represents unique code A, B depending on if same instruction has different formats.

Arm64 Encoding generation tool

As mentioned in #93095, it is impossible to hand write each encoding by looking at the manual. Hence, I wrote a small tool that parses the manual available in xml format on Arm site and generate C++ code that is included in emitfmtsarm64_sve.h and instrsarm64_sve.h. The plan is to open source the tool and check it in https://github.com/dotnet/jitutils/.

Contributes to #93095

Author:	kunalspathak
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

kunalspathak · 2023-11-02T14:14:13Z

Need to figure out the cause of TP regression. Most likely it could be the difference in code generated for existing emitInsCode() because of increase in number of entries in enum (?). Understandable to see more regression on MinOpts than FullOpts because more instructions are emitted for minopts.

kunalspathak · 2023-11-02T14:14:44Z

@dotnet/jit-contrib
cc: @TamarChristinaArm , @a74nh , @TIHan @amanasifkhalid

kunalspathak · 2023-11-02T23:13:11Z

Need to figure out the cause of TP regression. Most likely it could be the difference in code generated for existing emitInsCode() because of increase in number of entries in enum (?). Understandable to see more regression on MinOpts than FullOpts because more instructions are emitted for minopts.

On @BruceForstall 's suggestion, I just made a PR #94310 that changes the offsets of instrDesc, the way they are changed in this PR and see the TP impact and it looks exactly the same as one we see here. So clearly it is coming because of reduction of smallCns bit size.

TP diffs:

a74nh · 2023-11-03T10:06:58Z

On @BruceForstall 's suggestion, I just made a PR #94310 that changes the offsets of instrDesc, the way they are changed in this PR and see the TP impact and it looks exactly the same as one we see here. So clearly it is coming because of reduction of smallCns bit size.

Am I right in thinking the overall size of a instrDesc has gone from 32bits to 64bits?

In #94310, instead of increasing the variables, what happens if you instead add padding? Eg:

instruction _idIns : 10;
instruction _pad_idIns : 1;

etc for all the other fields.
That would confirm if the extra cost is due to the overall size of instrDesc or if it is the increased field sizes.

Thinking ahead, SME is going to add a similar slowdown.

What about if SVE instruction enum started again from 0. The same with all the instruction formats. Treat it as a separate architecture.
You could have a single bit in instrDesc to indicate if it is Arm64 or Arm64Sve (and, then maybe increase to 2bits when doing SME). Or override instrDesc, creating an instrDescSve, and then emitNewInstr() can take in an enum which decides which instrDesc to allocate.
You'd have to have different emitIns_R_R_R() etc functions for sve. But, that's ok because it's keeping the switch statements shorter.

kunalspathak · 2023-11-03T14:28:42Z

Am I right in thinking the overall size of a instrDesc has gone from 32bits to 64bits?

It was and is 16 bytes. instrDesc has 2 chunks of 8 bytes each. The first 8 bytes contains various bit fields and the last 8 bytes represents the idAddrUnion . I increased the size of idIns and idInsFmt that falls in the first chunk, but still keeping the overall size of that chunk 8 bytes. To offset that, we have to reduce the size of bits needed for idSmallCns. As seen in https://github.com/dotnet/runtime/pull/94285/files#diff-125fb0b9396b0e85ee82c6d592cdf1750778a7cff38ad382f7c2966ef588dce1R891, what that means is, earlier, we could create instrDesc if a constant in the instruction is 12-bits wide. Now, that space has shrunk to 9-bits. A constant beyond that size, we will have to use instrDescCns which is 24 bytes big. That causes TP regression because there are extra code paths involved to process instrDescCns. It does increase the JIT memory size for constants in the range of 10-bits thru 12-bits. Alternatives are costly, see below.

In #94310, instead of increasing the variables, what happens if you instead add padding? Eg:
instruction _idIns : 10;
instruction _pad_idIns : 1;
etc for all the other fields. That would confirm if the extra cost is due to the overall size of instrDesc or if it is the increased field sizes.

Thinking ahead, SME is going to add a similar slowdown.

What about if SVE instruction enum started again from 0. The same with all the instruction formats. Treat it as a separate architecture. You could have a single bit in instrDesc to indicate if it is Arm64 or Arm64Sve (and, then maybe increase to 2bits when doing SME). Or override instrDesc, creating an instrDescSve, and then emitNewInstr() can take in an enum which decides which instrDesc to allocate.

emitNewInstr() is a hot method and adding a single check will impact TP heavily. @TIHan tried doing that a while back and found out it is costly. If not in emitNewInstrDesc, I was thinking of having a union or something of idIns and idFmt, but even for that we need a bit to tell if it is SVE or not and then code somewhere and possibly in a hot-method if the idIns represents sve or not. With the existing model, it is ok to take the minimal TP cost, but overall it doesn't impact any non-sve code. I am hoping the same once we start adding more cases in common methods like emitOutputInstr(), etc.

You'd have to have different emitIns_R_R_R() etc functions for sve. But, that's ok because it's keeping the switch statements shorter.

That will be lot of methods though.

BruceForstall

Looks really good to me. You can follow up with my comments/questions in a follow-up if desired.

src/coreclr/jit/emitarm64.cpp

src/coreclr/jit/emitfmtsarm64_sve.h

src/coreclr/jit/CMakeLists.txt

src/coreclr/jit/emit.h

src/coreclr/jit/instrsarm64_sve.h

src/coreclr/jit/emitarm64.cpp

src/coreclr/jit/emitfmtsarm64_sve.h

* Add encodings for SVE * Adjust the offsets * jit format * fix the build for riscV64 * Address review feedback from Bruce * Rename files to remove _ * forgot to rename file names in other places Change-Id: I54932d16104a2582d9afc3fca6844183aa8c3536

kunalspathak added 3 commits November 1, 2023 15:53

Add encodings for SVE

4ddc65e

Adjust the offsets

7fa39fb

jit format

2d4aeb8

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 2, 2023

ghost assigned kunalspathak Nov 2, 2023

kunalspathak mentioned this pull request Nov 2, 2023

Arm64: Add SVE/SVE2 support in .NET 9 #93095

Closed

31 tasks

This was referenced Nov 2, 2023

Timeout in System.Net.Quic.Functional.Tests #86019

Closed

CI error: System.Net.Quic.QuicException: The connection timed out from inactivity #91757

Closed

fix the build for riscV64

f28a546

kunalspathak marked this pull request as ready for review November 2, 2023 14:14

kunalspathak requested a review from BruceForstall November 2, 2023 14:15

build-analysis bot mentioned this pull request Nov 2, 2023

Tracking issue for CI build timeouts #76454

Closed

BruceForstall approved these changes Nov 6, 2023

View reviewed changes