Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: SVE/SVE2 encodings #94285

Merged
merged 7 commits into from
Nov 8, 2023
Merged

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Nov 2, 2023

Summary

Add SVE/SVE2 encodings for all the possible instructions.

Stats:

  • 606 new instructions
  • 1123 new encodings of newly added instructions
  • 378 new instruction formats
  • 103 different groupings of newly added instruction formats.

Design

I have added 2 new files emitfmtsarm64_sve.h and instrsarm64_sve.h for SVE encodings. They are separate from existing emitfmtsarm64.h and instrsarm64.h because had we added SVE encodings in existing files that contains general/NEON instructions, it would impact JIT throughput heavily specially around the code to retrieve hex code for non-sve instructions. The instruction table preparation happens using existing machinery in a separate method emitSveInsCode(). The codegen, depending on the intrinsic, will call emitSveInsCode() and thus without affecting the code path of non-sve instructions.

I had to readjust the offsets of instrDesc to include the new size of _idIns and _idInsFmt. Here are the offsets before/after:

Before:


clrjit_universal_arm64_x64!emitter::instrDesc
   +0x000 _idIns           : Pos 0, 10 Bits
   +0x000 _idInsFmt        : Pos 10, 8 Bits
   +0x000 _idGCref         : Pos 18, 2 Bits
   +0x000 _idReg1          : Pos 20, 6 Bits
   +0x000 _idReg2          : Pos 26, 6 Bits
   +0x004 _idSmallDsc      : Pos 0, 1 Bit
   +0x004 _idLargeCns      : Pos 1, 1 Bit
   +0x004 _idLargeDsp      : Pos 2, 1 Bit
   +0x004 _idLargeCall     : Pos 3, 1 Bit
   +0x004 _idBound         : Pos 4, 1 Bit
   +0x004 _idCallAddr      : Pos 5, 1 Bit
   +0x004 _idNoGC          : Pos 6, 1 Bit
   +0x004 _idOpSize        : Pos 7, 3 Bits
   +0x004 _idInsOpt        : Pos 10, 6 Bits
   +0x004 _idLclVar        : Pos 16, 1 Bit
   +0x004 _idLclVarPair    : Pos 17, 1 Bit
   +0x004 _idCnsReloc      : Pos 18, 1 Bit
   +0x004 _idDspReloc      : Pos 19, 1 Bit
   +0x004 _idSmallCns      : Pos 20, 12 Bits
   +0x008 _idAddrUnion     : emitter::instrDesc::idAddrUnion

After:

clrjit_universal_arm64_x64!emitter::instrDesc
   +0x000 _idIns           : Pos 0, 11 Bits
   +0x000 _idInsFmt        : Pos 11, 10 Bits
   +0x000 _idOpSize        : Pos 21, 3 Bits
   +0x000 _idInsOpt        : Pos 24, 6 Bits
   +0x000 _idGCref         : Pos 30, 2 Bits
   +0x004 _idReg1          : Pos 0, 6 Bits
   +0x004 _idReg2          : Pos 6, 6 Bits
   +0x004 _idSmallDsc      : Pos 12, 1 Bit
   +0x004 _idLargeCns      : Pos 13, 1 Bit
   +0x004 _idLargeDsp      : Pos 14, 1 Bit
   +0x004 _idLargeCall     : Pos 15, 1 Bit
   +0x004 _idBound         : Pos 16, 1 Bit
   +0x004 _idCallAddr      : Pos 17, 1 Bit
   +0x004 _idNoGC          : Pos 18, 1 Bit
   +0x004 _idLclVar        : Pos 19, 1 Bit
   +0x004 _idLclVarPair    : Pos 20, 1 Bit
   +0x004 _idCnsReloc      : Pos 21, 1 Bit
   +0x004 _idDspReloc      : Pos 22, 1 Bit
   +0x004 _idSmallCns      : Pos 23, 9 Bits
   +0x008 _idAddrUnion     : emitter::instrDesc::idAddrUnion

Nomenclature

  • All the instruction's id has a prefix of sve_ to differentiate them from regular instructions, e.g. INS_sve_mov vs. INS_mov.
  • Likewise, instruction formats have prefix of SVE_ to differentiate from existing instruction formats. e.g. IF_SVE_AB_3A vs. IF_DI_3A.
  • Currently, various instruction formats are grouped based on the common logic various instructions having them share during encoding and in various other methods. I do not have idea at this point how many of newly added instruction formats will share logic between various instructions, hence I have created as many unique as I could, and in future we could eliminate/combine many of them.
  • The naming convention for instruction formats is as follows: SVE_WW_XY, where WW is unique code like AA, AB, etc., X represents number of registers the instruction operates on, Y represents unique code A, B depending on if same instruction has different formats.

Arm64 Encoding generation tool

As mentioned in #93095, it is impossible to hand write each encoding by looking at the manual. Hence, I wrote a small tool that parses the manual available in xml format on Arm site and generate C++ code that is included in emitfmtsarm64_sve.h and instrsarm64_sve.h. The plan is to open source the tool and check it in https://github.com/dotnet/jitutils/.

Contributes to #93095

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 2, 2023
@ghost ghost assigned kunalspathak Nov 2, 2023
@ghost
Copy link

ghost commented Nov 2, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

Add SVE/SVE2 encodings for all the possible instructions.

Stats:

  • 606 new instructions
  • 1123 new encodings of newly added instructions
  • 378 new instruction formats
  • 103 different groupings of newly added instruction formats.

Design

I have added 2 new files emitfmtsarm64_sve.h and instrsarm64_sve.h for SVE encodings. They are separate from existing emitfmtsarm64.h and instrsarm64.h because had we added SVE encodings in existing files that contains general/NEON instructions, it would impact JIT throughput heavily specially around the code to retrieve hex code for non-sve instructions. The instruction table preparation happens using existing machinery in a separate method emitSveInsCode(). The codegen, depending on the intrinsic, will call emitSveInsCode() and thus without affecting the code path of non-sve instructions.

Nomenclature

  • All the instruction's id has a prefix of sve_ to differentiate them from regular instructions, e.g. INS_sve_mov vs. INS_mov.
  • Likewise, instruction formats have prefix of SVE_ to differentiate from existing instruction formats. e.g. IF_SVE_AB_3A vs. IF_DI_3A.
  • Currently, various instruction formats are grouped based on the common logic various instructions having them share during encoding and in various other methods. I do not have idea at this point how many of newly added instruction formats will share logic between various instructions, hence I have created as many unique as I could, and in future we could eliminate/combine many of them.
  • The naming convention for instruction formats is as follows: SVE_WW_XY, where WW is unique code like AA, AB, etc., X represents number of registers the instruction operates on, Y represents unique code A, B depending on if same instruction has different formats.

Arm64 Encoding generation tool

As mentioned in #93095, it is impossible to hand write each encoding by looking at the manual. Hence, I wrote a small tool that parses the manual available in xml format on Arm site and generate C++ code that is included in emitfmtsarm64_sve.h and instrsarm64_sve.h. The plan is to open source the tool and check it in https://github.com/dotnet/jitutils/.

Contributes to #93095

Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member Author

Need to figure out the cause of TP regression. Most likely it could be the difference in code generated for existing emitInsCode() because of increase in number of entries in enum (?). Understandable to see more regression on MinOpts than FullOpts because more instructions are emitted for minopts.

image

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib
cc: @TamarChristinaArm , @a74nh , @TIHan @amanasifkhalid

@kunalspathak kunalspathak marked this pull request as ready for review November 2, 2023 14:14
@kunalspathak
Copy link
Member Author

Need to figure out the cause of TP regression. Most likely it could be the difference in code generated for existing emitInsCode() because of increase in number of entries in enum (?). Understandable to see more regression on MinOpts than FullOpts because more instructions are emitted for minopts.

image

On @BruceForstall 's suggestion, I just made a PR #94310 that changes the offsets of instrDesc, the way they are changed in this PR and see the TP impact and it looks exactly the same as one we see here. So clearly it is coming because of reduction of smallCns bit size.

TP diffs:

image

@a74nh
Copy link
Contributor

a74nh commented Nov 3, 2023

On @BruceForstall 's suggestion, I just made a PR #94310 that changes the offsets of instrDesc, the way they are changed in this PR and see the TP impact and it looks exactly the same as one we see here. So clearly it is coming because of reduction of smallCns bit size.

Am I right in thinking the overall size of a instrDesc has gone from 32bits to 64bits?

In #94310, instead of increasing the variables, what happens if you instead add padding? Eg:

instruction _idIns : 10;
instruction _pad_idIns : 1;

etc for all the other fields.
That would confirm if the extra cost is due to the overall size of instrDesc or if it is the increased field sizes.


Thinking ahead, SME is going to add a similar slowdown.

What about if SVE instruction enum started again from 0. The same with all the instruction formats. Treat it as a separate architecture.
You could have a single bit in instrDesc to indicate if it is Arm64 or Arm64Sve (and, then maybe increase to 2bits when doing SME). Or override instrDesc, creating an instrDescSve, and then emitNewInstr() can take in an enum which decides which instrDesc to allocate.
You'd have to have different emitIns_R_R_R() etc functions for sve. But, that's ok because it's keeping the switch statements shorter.

@kunalspathak
Copy link
Member Author

Am I right in thinking the overall size of a instrDesc has gone from 32bits to 64bits?

It was and is 16 bytes. instrDesc has 2 chunks of 8 bytes each. The first 8 bytes contains various bit fields and the last 8 bytes represents the idAddrUnion . I increased the size of idIns and idInsFmt that falls in the first chunk, but still keeping the overall size of that chunk 8 bytes. To offset that, we have to reduce the size of bits needed for idSmallCns. As seen in https://github.com/dotnet/runtime/pull/94285/files#diff-125fb0b9396b0e85ee82c6d592cdf1750778a7cff38ad382f7c2966ef588dce1R891, what that means is, earlier, we could create instrDesc if a constant in the instruction is 12-bits wide. Now, that space has shrunk to 9-bits. A constant beyond that size, we will have to use instrDescCns which is 24 bytes big. That causes TP regression because there are extra code paths involved to process instrDescCns. It does increase the JIT memory size for constants in the range of 10-bits thru 12-bits. Alternatives are costly, see below.

In #94310, instead of increasing the variables, what happens if you instead add padding? Eg:

instruction _idIns : 10;
instruction _pad_idIns : 1;

etc for all the other fields. That would confirm if the extra cost is due to the overall size of instrDesc or if it is the increased field sizes.

Thinking ahead, SME is going to add a similar slowdown.

What about if SVE instruction enum started again from 0. The same with all the instruction formats. Treat it as a separate architecture. You could have a single bit in instrDesc to indicate if it is Arm64 or Arm64Sve (and, then maybe increase to 2bits when doing SME). Or override instrDesc, creating an instrDescSve, and then emitNewInstr() can take in an enum which decides which instrDesc to allocate.

emitNewInstr() is a hot method and adding a single check will impact TP heavily. @TIHan tried doing that a while back and found out it is costly. If not in emitNewInstrDesc, I was thinking of having a union or something of idIns and idFmt, but even for that we need a bit to tell if it is SVE or not and then code somewhere and possibly in a hot-method if the idIns represents sve or not. With the existing model, it is ok to take the minimal TP cost, but overall it doesn't impact any non-sve code. I am hoping the same once we start adding more cases in common methods like emitOutputInstr(), etc.

You'd have to have different emitIns_R_R_R() etc functions for sve. But, that's ok because it's keeping the switch statements shorter.

That will be lot of methods though.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good to me. You can follow up with my comments/questions in a follow-up if desired.

src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitfmtsarm64_sve.h Outdated Show resolved Hide resolved
src/coreclr/jit/emitfmtsarm64_sve.h Outdated Show resolved Hide resolved
src/coreclr/jit/emitfmtsarm64_sve.h Outdated Show resolved Hide resolved
src/coreclr/jit/emitfmtsarm64_sve.h Outdated Show resolved Hide resolved
@kunalspathak kunalspathak merged commit ac56247 into dotnet:main Nov 8, 2023
@kunalspathak kunalspathak deleted the sve-encodings branch November 8, 2023 15:08
@kunalspathak kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Nov 9, 2023
a74nh added a commit to a74nh/runtime that referenced this pull request Nov 16, 2023
* Add encodings for SVE

* Adjust the offsets

* jit format

* fix the build for riscV64

* Address review feedback from Bruce

* Rename files to remove _

* forgot to rename file names in other places

Change-Id: I54932d16104a2582d9afc3fca6844183aa8c3536
@github-actions github-actions bot locked and limited conversation to collaborators Dec 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants