Handle > 64 registers + predicate registers for Arm64 #98258

kunalspathak · 2024-02-10T04:02:31Z

Review guide

Predicate Registers

registerarm64.h adds the 16 new predicate registers numbered 64 thru 79. Their masks are from 0x0 thru 0x8000. Following files adds the predicate registers. On arm64, now we need 7 bits to represent the register number and hence the REGNUM_BITS has changed from 6 -> 7 bits (targetarm64.h).

AllRegsMask

The new data structure struct is introduced to represent the register mask. If HAS_MORE_THAN_64_REGISTERS is not defined (for all non-arm64 platforms), this contains a single 64-bit field. But for arm64, (HAS_MORE_THAN_64_REGISTERS is defined), the struct contains an extra field of 4-bytes to represent the predicate registers. The definition of this struct is present in target.h and is very similar to how I mentioned it here.

typedef struct _regMaskAll
{
private:
#ifdef HAS_MORE_THAN_64_REGISTERS
  union
  {
      RegBitSet32 _registers[3];
      struct
      {
          union
          {
              // Represents combined registers bitset including gpr/float
              RegBitSet64 _combinedRegisters;
              struct
              {
                  RegBitSet32 _gprRegs;
                  RegBitSet32 _floatRegs;
              };
          };
          RegBitSet32 _predicateRegs;
      };
  };
#else
  // Represents combined registers bitset including gpr/float and on some platforms
  // mask or predicate registers
  RegBitSet64 _combinedRegisters;
#endif

Few things that are worth explaining for this struct:

As seen, for non-arm64 platforms, where HAS_MORE_THAN_64_REGISTERS is not defined, the struct will continue to operate on a 64-bits field, thus not impacting the TP of these platforms.
For arm64, there are essentially 3 4-bytes fields here, one each for gpr, float and predicate registers, in that order. The way they are defined under union is so that the AllRegsMask can access relevant fields without any branches or conditions. For e.g. If a float register d2 has to be added in the mask, I did not want to have something like:

if (regNum < 32) { _gprRegs = ... ;}
else if (regNum < 64) {_floatRegs = ... ;}
else { _predicateRegs = ... ;}

Instead, a mapping is added in all register*.h files to map all gpr registers -> 0, float registers -> 1 and predicate registers -> 2. Having that, I could rewrite the above code as:

_registers[regIndexForRegNum(regNum)] = ...;

Using predicate registers from the mask is uncommon and mostly, the consumers of AllRegsMask are interested in gpr/float registers. Hence a _combinedRegisters field is added in the union to have easy access of them.
This design also provide an easy way to retrieve just gprRegs() or floatRegs() or predicateRegs() easily.

Until now, the manipulation of registers (adding/removing registers) from the mask was trivial and was done using bit manipulation. However, with AllRegsMask struct, it offers new methods to do such manipulation.

Firstly, various operators are implemented to give seemless manipulation of underlying fields throughout the code base, without having to make changes at all those places.
Most methods that directly take regNumber as input already know which field of the AllRegsMask struct they are operating. If the method takes a set of registers, those methods also need to know the register class of that mask, to determine which field of the AllRegsMask it needs to update. The definition of these methods are present in compiler.hpp.
Two methods that are worth mentioning about are encodeForIndex() and decodeForIndex(). Imagine adding mask of a register (1 << regNumber) in the relevant field, here is how we could write:

mask = genRegMask(regNumber);
if (regNum < 64) { _combinedRegisters |= mask ;}
else {_predicateRegs = mask ;}

Alternatively, since both gpr and predicate register mask starts with bit 0 0x0 and hence can be directly added to _registers[0] or _registers[2], we could rewrite it as following:

index = regIndexForRegNum(regNumber);
mask = genRegMask(regNumber)l
if (regNum < 32 || regNum > 63) _register[index] = mask;
else { _combinedRegisters |= mask; /* float register mask */ }

Either way, we have to do a branch to add the mask to the relevant field. To make this code branch-free, for float register mask, I right shift it by 32 to fit it in 4-bytes using encodeForIndex() and while returning back the float register, use decodeForIndex() by left shifting it by 32-bits. With that, we can just do something like this:

index = regIndexForRegNum(regNumber);
mask = genRegMask(regNumber);
_registers[index] = encodeForIndex(index, mask);

In LSRA, until now, we would use a single 64-bits primitive to represent register set. Whenever we want to extract each register corresponding to the bit ON in the set, we would iterate through it and return the next ON bit and toggle it. Following new methods are added to handle that aspect:

genFirstRegNumFromMaskAndToggle() : With AllRegsMask, once we run out of gpr/float field, we need to iterate over the _predicateRegs field.
genRegNumFromMask() : This method will now also take the type that it expects the register to extract and accordingly add 64 to the extracted ON bit from the mask.
genFirstRegNumFromMask() : This too first scans the gpr/float fields to see if anything is set and if not, will look into _predicateRegs field.

Lsra

lsrabuild.cpp
- General renaming of types, mostly things are renamed from regMaspTP to regMaskOnlyOne
- The signature of various methods (e.g. addRefsForPhysRegMask) that takes mask containing registers of different register class (gpr/float/predicate) are now taking AllRegsMask. They either pass through the mask to other methods, or iterate over all the registers present in the mask (and toggle them) using the newly added genFirstRegNumFromMaskAndToggle().
- Use AllRegsMask instead to save the killMask
- Certain methods that takes regMaskOnlyOne as parameter, an extra parameter of type is passed to know what register class they represent, for further support updating the right fields of AllRegsMask.
- BuildDef* methods: Added few new methods to group some of the logic around building definitions for calls BuildCallDefs()/BuildCallDefsWithKills(), the ones that just build RefPosition for kills BuildKills(). Most of them takes AllRegsMask as the killMask.
lsra.h
- RegisterType field is removed from RegRecord and Interval and moved it inside the parent class Referanceable. This was done, so it is easy to query the type to determine the register class for a given RefPosition. Without this, we would have to check first if the RefPosition represents virtual register (Interval) or a physical register (RegRecord).
- General renaming of regMaskTP to regMaskGpr or regMaskFloat.
- A new overload for certains methods like freeRegisters(), verifyFreeRegisters(), updateDeadCandidatesAtBlockStart(), inActivateRegisters() has been added. The original method will continue operating of the 64-bit mask (regMaskTP), and the overloaded method operates on AllRegsMask. The only difference between the two is how the bits are iterated and toggled inside the method.
- The type of certain fields like m_AvailableRegs, placedArgRegs, registersToDump, m_RegistersWithConstants, fixedRegs, regsBusyUntilKill, regsInUseThisLocation, regsInUseNextLocation is changed from regMaskTP to AllRegsMask. The relevant methods that read/write these fields are updated to take the registerType as parameter. Based on that, it will add/remove the given register mask from the AllRegsMask.
lsra.cpp
- Most of the methods that touch fields like m_AvailableRegs, etc. now have to use the methods from AllRegsMask to add/remove/update the register/register mask. For that, we need to pass the registerType to those methods.
- Methods that previously defined 64-bit register mask variables like regsToFree, delayRegsToFree, regsToMakeInactive, delayRegsToMakeInactive, copyRegsToFree, targetRegsToDo, targetRegsReady, targetRegsFromStack, etc. that tracks the registers during allocation/resolution are now changed to AllRegsMask and so is the way they manipulate the add/removing of registers. They now use methods from AllRegsMask and sometimes need to pass registerType to know the register class of the mask that is being added/removed.
lsraxarch.cpp
lsraarmarch.cpp
lsraarm64.cpp
- General renaming of regMaskTP type.
- Uses the new methods created for building for killMask.

Codegen

codegenarmarch.cpp
codegencommon.cpp
codegenxarch.cpp
- register(s) are added in and retrieved from regSet using new methods on regSet, based on the type.
- General renaming of types
- Use AllRegsMask instead to save the killMask
codegenarm64.cpp
- gen(Save|Restore)CalleeSavedRegisterGroup renamed the type of register mask from regMaskTP to regMaskOnlyOne and it takes the type as additional parameter so it can pass along to other methods like genBuildRegPairsStack, etc.
- gen(Save|Restore)CalleeSavedRegistersHelp() now takes AllRegsMask instead of regMaskTP, because it has to also save/restore predicate registers. This method pass the individual register class mask to gen(Save|Restore)CalleeSavedRegisterGroup. Callers of this method basically extract the AllRegsMask from the regSet field to send all the callee saved registers. I might simplofy some of this to erase some of the changes.

Compiler

compiler.cpp
- Most of the RBM_* masks that are today defined in various target*.h files, we need to have corresponding AllRegsMask_* equivalent. Most of them rely on the availability of float registers and for AVX512, they are not known until we initialize the compiler object with CPU features. Hence these fields are defined after such initialization so they contain the accurate active register set, specially the float registers, needed for the compilation of the method.

Misc

Most of the other changes mentioned below are minimal and are needed because the regMaskTP is renamed to one of regMaskGpr, regMaskFloat, etc.

Following files just has changes to the type names
- abi.cpp
- abi.h
- block.h
- codegen.h (along with using AllRegsMask in FuncletInfo)
- codegeninterface.h
- codegenarm.cpp
- codegenlinear.cpp
- emit.cpp (along with some new methods to display the AllRegsMask)
- emit.h (along with ID_EXTRA_BITFIELD_BITS increased from 21 to 23 bits because we have two REGNUM_BITS in instrDesc)
- emitarm.cpp
- emitarm.h
- emitarm64.cpp
- emitarm64.h
- emitxarch.cpp
- emitxarch.h
- emitinl.h
- emitpub.h
- gcinfo.cpp
- gcencode.cpp
- instr.cpp
- jitgcinfo.h
- lclvars.cpp
- morph.cpp
- optimizer.cpp
- regalloc.cpp
- registerargconvention.cpp
- registerargconvention.h
- targetamd64.cpp
- targetarm.cpp
- targetarm64.cpp
- targetx86.cpp
- typelist.h
- unwindarmarch.cpp
Following files add new parameter to categorize the register class
- register.h
- registerarm.h
- registerriscv64.h
- registerloongarch64.h
- emitloongarch64.cpp
- emitriscv64.cpp
Added REG_FP_COUNT, REG_MASK_COUNT and RBM_ALLGPR
- targetamd64.h
- targetarm.h
- targetx86.h
We have RegSet class that tracks the registers touched and is used during codegen, return unused register or spill registers. To track all the different type of registers, I converted rsModifiedRegsMask from regMaskTP to AllRegsMask. All the other methods that were changed was saving a particular register or set of registers (depending on the type) to rsModifiedRegsMask and returning back the register set for given type.
- regset.h
- regset.cpp
Some of the methods in GenTree* now need to return AllRegsMask because the ABI might require it (I am certain that this can be just floatgpr, but had it for now):
- gentree.h
- gentree.cpp
Refactoring
- lsraarm.cpp
Handling of mask registers
- unwind.cpp

Old TODO

This is just a prototype to see if the asserts added are hit or not.

TODO:

Fixes: #99658

kunalspathak · 2024-04-10T23:36:48Z

This is ready for review. Please go through it and let me know what you think. It seems a one-time cost of around ~10% regression is what we are getting for MinOpts and ~4% regression on FullOpts is what we are looking for. The regression is currently just on arm64 platform. The non-arm64 are mostly untouched.

I have not yet updated the names of following typedefs and would like some suggestions. Here is my proposal:

regMaskGpr : Should be called GprRegs
regMaskFloat: Should be called FloatRegs
regMaskPredicate: Should be called PredicateRegs
regMaskOnlyOne: Should be called SingleTypeRegs
singleRegMask: Should be called SingleReg
AllRegsMask: Should be called AllRegs
RegBitSet64: Should be called _64Regs
RegBitSet32: Should be called _32Regs

AndyAyersMS · 2024-04-12T17:52:01Z

@kunalspathak you should nominate some people specifically for review.

Also seems like you ought to remove the "NO" labels on the PR.

jakobbotsch · 2024-04-12T19:48:18Z

I have not yet updated the names of following typedefs and would like some suggestions. Here is my proposal:

regMaskGpr : Should be called GprRegs

regMaskFloat: Should be called FloatRegs

regMaskPredicate: Should be called PredicateRegs

regMaskOnlyOne: Should be called SingleTypeRegs

singleRegMask: Should be called SingleReg

AllRegsMask: Should be called AllRegs

RegBitSet64: Should be called _64Regs

RegBitSet32: Should be called _32Regs

My two cents: Set in the name seems nice to indicate whenever something is a set. r in "gpr" already stands for register, so GprRegs is a bit redundant. My proposal:

GpRegSet
FloatRegSet
PredicateRegSet
SingleTypeRegSet
SingleReg (what is the difference between this and regNumber?)
AnyTypeRegSet
RegSet64, RegSet32 (what is the intended use case for these?)

On a related note I think we shouldn't mix naming conventions for some of the new fields, like AllRegsMask_CALLEE_TRASH_NOGC. Perhaps some new prefix can be used (e.g. instead of RBM_ it could be ALLREGS_, or whatever is appropriate for the final name of the set type that we end up with...)

jakobbotsch · 2024-04-12T19:51:13Z

src/coreclr/jit/compiler.cpp

+    AllRegsMask_STOP_FOR_GC_TRASH =
+        AllRegsMask((RBM_INT_CALLEE_TRASH & ~RBM_INTRET), (RBM_FLT_CALLEE_TRASH & ~RBM_FLOATRET), RBM_MSK_CALLEE_TRASH);
+    AllRegsMask_PROFILER_ENTER_TRASH = AllRegsMask_CALLEE_TRASH;
+#endif // UNIX_AMD64_ABI
+
+    AllRegsMask_PROFILER_LEAVE_TRASH    = AllRegsMask_STOP_FOR_GC_TRASH;
+    AllRegsMask_PROFILER_TAILCALL_TRASH = AllRegsMask_PROFILER_LEAVE_TRASH;
+
+    // The registers trashed by the CORINFO_HELP_INIT_PINVOKE_FRAME helper.
+    AllRegsMask_INIT_PINVOKE_FRAME_TRASH     = AllRegsMask_CALLEE_TRASH;
+    AllRegsMask_VALIDATE_INDIRECT_CALL_TRASH = GprRegsMask(RBM_VALIDATE_INDIRECT_CALL_TRASH);
+
+#elif defined(TARGET_ARM)
+
+    AllRegsMask_CALLEE_TRASH_NOGC    = GprRegsMask(RBM_CALLEE_TRASH_NOGC);
+    AllRegsMask_PROFILER_ENTER_TRASH = AllRegsMask_NONE;
+
+    // Registers killed by CORINFO_HELP_ASSIGN_REF and CORINFO_HELP_CHECKED_ASSIGN_REF.
+    AllRegsMask_CALLEE_TRASH_WRITEBARRIER = GprRegsMask(RBM_R0 | RBM_R3 | RBM_LR | RBM_DEFAULT_HELPER_CALL_TARGET);
+
+    // Registers no longer containing GC pointers after CORINFO_HELP_ASSIGN_REF and CORINFO_HELP_CHECKED_ASSIGN_REF.
+    AllRegsMask_CALLEE_GCTRASH_WRITEBARRIER = AllRegsMask_CALLEE_TRASH_WRITEBARRIER;
+
+    // Registers killed by CORINFO_HELP_ASSIGN_BYREF.
+    AllRegsMask_CALLEE_TRASH_WRITEBARRIER_BYREF =
+        GprRegsMask(RBM_WRITE_BARRIER_DST_BYREF | RBM_WRITE_BARRIER_SRC_BYREF | RBM_CALLEE_TRASH_NOGC);
+
+    // Registers no longer containing GC pointers after CORINFO_HELP_ASSIGN_BYREF.
+    // Note that r0 and r1 are still valid byref pointers after this helper call, despite their value being changed.
+    AllRegsMask_CALLEE_GCTRASH_WRITEBARRIER_BYREF = AllRegsMask_CALLEE_TRASH_NOGC;
+    AllRegsMask_PROFILER_RET_SCRATCH              = GprRegsMask(RBM_R2);
+    // While REG_PROFILER_RET_SCRATCH is not trashed by the method, the register allocator must
+    // consider it killed by the return.
+    AllRegsMask_PROFILER_LEAVE_TRASH    = AllRegsMask_PROFILER_RET_SCRATCH;
+    AllRegsMask_PROFILER_TAILCALL_TRASH = AllRegsMask_NONE;
+    // The registers trashed by the CORINFO_HELP_STOP_FOR_GC helper (JIT_RareDisableHelper).
+    // See vm\arm\amshelpers.asm for more details.
+    AllRegsMask_STOP_FOR_GC_TRASH =
+        AllRegsMask((RBM_INT_CALLEE_TRASH & ~(RBM_LNGRET | RBM_R7 | RBM_R8 | RBM_R11)),
+                    (RBM_FLT_CALLEE_TRASH & ~(RBM_DOUBLERET | RBM_F2 | RBM_F3 | RBM_F4 | RBM_F5 | RBM_F6 | RBM_F7)));
+    // The registers trashed by the CORINFO_HELP_INIT_PINVOKE_FRAME helper.
+    AllRegsMask_INIT_PINVOKE_FRAME_TRASH =
+        (AllRegsMask_CALLEE_TRASH | GprRegsMask(RBM_PINVOKE_TCB | RBM_PINVOKE_SCRATCH));
+
+    AllRegsMask_VALIDATE_INDIRECT_CALL_TRASH = GprRegsMask(RBM_INT_CALLEE_TRASH);
+
+#elif defined(TARGET_ARM64)
+
+    AllRegsMask_CALLEE_TRASH_NOGC    = GprRegsMask(RBM_CALLEE_TRASH_NOGC);
+    AllRegsMask_PROFILER_ENTER_TRASH = AllRegsMask((RBM_INT_CALLEE_TRASH & ~(RBM_ARG_REGS | RBM_ARG_RET_BUFF | RBM_FP)),
+                                                   (RBM_FLT_CALLEE_TRASH & ~RBM_FLTARG_REGS), RBM_MSK_CALLEE_TRASH);
+    // Registers killed by CORINFO_HELP_ASSIGN_REF and CORINFO_HELP_CHECKED_ASSIGN_REF.
+    AllRegsMask_CALLEE_TRASH_WRITEBARRIER = GprRegsMask(RBM_R14 | RBM_CALLEE_TRASH_NOGC);
+
+    // Registers no longer containing GC pointers after CORINFO_HELP_ASSIGN_REF and CORINFO_HELP_CHECKED_ASSIGN_REF.
+    AllRegsMask_CALLEE_GCTRASH_WRITEBARRIER = AllRegsMask_CALLEE_TRASH_NOGC;
+
+    // Registers killed by CORINFO_HELP_ASSIGN_BYREF.
+    AllRegsMask_CALLEE_TRASH_WRITEBARRIER_BYREF =
+        GprRegsMask(RBM_WRITE_BARRIER_DST_BYREF | RBM_WRITE_BARRIER_SRC_BYREF | RBM_CALLEE_TRASH_NOGC);
+
+    // Registers no longer containing GC pointers after CORINFO_HELP_ASSIGN_BYREF.
+    // Note that x13 and x14 are still valid byref pointers after this helper call, despite their value being changed.
+    AllRegsMask_CALLEE_GCTRASH_WRITEBARRIER_BYREF = AllRegsMask_CALLEE_TRASH_NOGC;
+
+    AllRegsMask_PROFILER_LEAVE_TRASH    = AllRegsMask_PROFILER_ENTER_TRASH;
+    AllRegsMask_PROFILER_TAILCALL_TRASH = AllRegsMask_PROFILER_ENTER_TRASH;
+
+    // The registers trashed by the CORINFO_HELP_STOP_FOR_GC helper
+    AllRegsMask_STOP_FOR_GC_TRASH = AllRegsMask_CALLEE_TRASH;
+    // The registers trashed by the CORINFO_HELP_INIT_PINVOKE_FRAME helper.
+    AllRegsMask_INIT_PINVOKE_FRAME_TRASH     = AllRegsMask_CALLEE_TRASH;
+    AllRegsMask_VALIDATE_INDIRECT_CALL_TRASH = GprRegsMask(RBM_VALIDATE_INDIRECT_CALL_TRASH);
+#endif
+
+#if defined(TARGET_ARM)
+    // profiler scratch remains gc live
+    AllRegsMask_PROF_FNC_LEAVE = AllRegsMask_PROFILER_LEAVE_TRASH & ~AllRegsMask_PROFILER_RET_SCRATCH;
+#else
+    AllRegsMask_PROF_FNC_LEAVE = AllRegsMask_PROFILER_LEAVE_TRASH;
+#endif // TARGET_ARM
+
+#ifdef TARGET_XARCH
+
+    // Make sure we copy the register info and initialize the
+    // trash regs after the underlying fields are initialized
+
+    const regMaskTP vtCalleeTrashRegs[TYP_COUNT]{
+#define DEF_TP(tn, nm, jitType, sz, sze, asze, st, al, regTyp, regFld, csr, ctr, tf) ctr,
+#include "typelist.h"
+#undef DEF_TP
+    };
+    memcpy(varTypeCalleeTrashRegs, vtCalleeTrashRegs, sizeof(regMaskTP) * TYP_COUNT);
+
+    if (codeGen != nullptr)
+    {
+        codeGen->CopyRegisterInfo();
+    }
+#endif // TARGET_XARCH
+}


You may have said this earlier, but why can these not be static variables with values baked into the .dll?

Because compiler object has ISA information that we use to determine the float/mask registers to include. #98258 (comment)

Do all of these sets need dynamic creation? Isn't it just the ones defined here that do?

runtime/src/coreclr/jit/codegeninterface.h

Lines 62 to 91 in b7c3446

#if defined(TARGET_AMD64)

regMaskTP rbmAllFloat;

regMaskTP rbmFltCalleeTrash;

FORCEINLINE regMaskTP get_RBM_ALLFLOAT() const

{

return this->rbmAllFloat;

}

FORCEINLINE regMaskTP get_RBM_FLT_CALLEE_TRASH() const

{

return this->rbmFltCalleeTrash;

}

#endif // TARGET_AMD64

#if defined(TARGET_XARCH)

regMaskTP rbmAllMask;

regMaskTP rbmMskCalleeTrash;

// Call this function after the equivalent fields in Compiler have been initialized.

void CopyRegisterInfo();

FORCEINLINE regMaskTP get_RBM_ALLMASK() const

{

return this->rbmAllMask;

}

FORCEINLINE regMaskTP get_RBM_MSK_CALLEE_TRASH() const

{

return this->rbmMskCalleeTrash;

}

#endif // TARGET_XARCH

kunalspathak · 2024-04-12T19:55:52Z

SingleReg (what is the difference between this and regNumber?)

Probably the name should SingleRegBitSet which basically says that the mask contains just 1 bit set...It is usually the returned type from genRegMask(regNumber).

RegSet64, RegSet32 (what is the intended use case for these?)

RegSet64 is basically just regMaskTP to indicate that the entity represents 64 registers and likewise for RegSet32. I introduced RegSet32 so I can make the GPR to that type, but I will do it in a follow-up PR.

jakobbotsch · 2024-04-15T13:05:12Z

src/coreclr/jit/target.h

+
+// Represents that the mask in this type is from one of the register type - gpr/float/predicate
+// but not more than 1.
+typedef unsigned __int64 regMaskOnlyOne;


Is it possible to have this type come with a tag that can ensure (in DEBUG only) that we don't try to do operations with mismatched register types? Currently I assume that would result in bogus results.

AndyAyersMS · 2024-04-17T20:22:22Z

TP cost still seems awfully high...

@jakobbotsch had an idea which might give us the freedom to explore how to control costs better, and also unblock dependent work, with (we suspect) very little or no downside.

What if we restrict arm64 for the time being to only be able to allocate 24 FP registers? Then we can fit 32 GPR + 24 FP + 8 Mask into 64 bits, presumably with fairly minimal TP impact. Likely we have no cases where we really need more than 24 FP regs, so there won't be much CQ impact either.

We will still need to solve the > 64 allocatable things problem but, we'll have time to work on it independently.

a74nh · 2024-04-18T09:06:47Z

What if we restrict arm64 for the time being to only be able to allocate 24 FP registers? Then we can fit 32 GPR + 24 FP + 8 Mask into 64 bits, presumably with fairly minimal TP impact. Likely we have no cases where we really need more than 24 FP regs, so there won't be much CQ impact either.

I suspect it'll also be rare to require more than 8 mask registers.

There are a large group of instructions (at least 200, see use of isLowPredicateRegister()) where only predicates p0 to p7 are allowed. There are 10 instructions where only predicates p8 to p15 are allowed (see use of isHighPredicateRegister()). We don't have any APIs which can directly access the high predicate instructions and I doubt we'll need to generate them indirectly. If we did need to later then we could offset the 8 values we get from the allocator by 2, giving predicates p2 to p9 ?

We might want to reserve two mask registers for all zero and all ones as that is very common usage. Due to the low predicate instructions, these would have to come from the low predicates registers. Maybe this is a good use for predicates p0 and p1. There would still need to be a mechanism to keep track of whether these had been set for the current function, but it can be separate from the standard register mask?

Thinking wider, how many other registers are fixed? LR, FP and SP won't ever be directly allocated? Are there any others always in use (thread local storage etc?). If so these don't need to be in the register mask either, freeing up more space?

tannergooding · 2024-04-18T15:46:22Z

I expect we're going to end up paying more in terms of actual execution cost by trying to play funny tricks with bitpacking (i.e. using only 29 fp registers) than we would be just using the extra space.

I expect there is going to be a short term higher cost to getting the support in here, no matter what route we take, and we're ultimately going to need the same work done for the x64 APX feature. While many functions may not need the full register set, there are many instructions which have special allocation requirements (like 4 sequential registers or having to start at a register divisible by n) where having a smaller set impacts codegen. We also can win a lot of this back longer term. We will have more opportunities for cleanup, refactorings, and simplification to get more out of it.

I'd also like to call out again that it's very hard to gauge actual cost by TP numbers alone. SPMI doesn't factor in the overhead of the VM calls/token resolution, it doesn't really factor in the difference between debug vs release costs, it doesn't factor in that methods that are substantially slower may be infrequently compiled (excluding crossgen), it doesn't factor in that instructions can be pipelined, fused, or that some single instructions (division or multiplication) can have the cost of dozens of other instructions.

We've seen this nuance in the actual perf numbers we track and triage weekly, such as https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362_RunKind%3dcrossgen_scenarios%2fCrossgen2%20Throughput%20-%20Single%20-%20System.Private.CoreLib.html, where the single biggest jump in that was the enabling of DPGO and the steady increase since is due to the organic growth of S.P.Corelib otherwise. The increase in time for changes like this (even being "10% throughput impact") are incredibly minimal in comparison to simply adding new APIs to the BCL or broadly enabling new optimizations.

I think we should worry more about pushing this in the right direction and ensuring the code is understandable/maintainable, and then longer term get it optimized around what we land on from that perspective.

kunalspathak · 2024-04-21T17:39:46Z

Can't agree more on what @tannergooding says.

I expect we're going to end up paying more in terms of actual execution cost by trying to play funny tricks with bitpacking (i.e. using only 29 fp registers) than we would be just using the extra space.

Yes, and we might introduce bugs in longer run and would have to add more workarounds to deal with them.

I expect there is going to be a short term higher cost to getting the support in here, no matter what route we take, and we're ultimately going to need the same work done for the x64 APX feature.

Yes, this is precisely my thinking that I expressed somewhere above. The entire code base had an assumption that we will not surpass more than 64 registers and now, we do. So, we need to include that functionality in various places. It is similar to how arm32 has higher TP cost just because of special handling that is needed to handle even-odd pair of registers, that is not present in other platforms.

no matter what route we take

And there were several routes (4~5 prototypes) that were explored in last couple of months just in pursuit of bringing the TP numbers (reported by superpmi down), after which I had to settle down on current solution, which is simpler and more importantly maintainable.

While many functions may not need the full register set, there are many instructions which have special allocation requirements (like 4 sequential registers or having to start at a register divisible by n) where having a smaller set impacts codegen.

Back when I added consecutive registers support in #80297, I had to disable some special stress register modes just because that wouldn't satisfy the register requirements for the method, given that they needed consecutive registers at multiple places within the same method.

We also can win a lot of this back longer term. We will have more opportunities for cleanup, refactorings, and simplification to get more out of it.

For sure. I have a work item in mind to reduce the size of float/vector registers field from 8 bytes to 4 bytes. With that, all the register masks will be reduced to 4 bytes, which will reduce the size of common data structures like GenTree, RefPosition, Interval by 4 bytes each.

I'd also like to call out again that it's very hard to gauge actual cost by TP numbers alone. SPMI doesn't factor in the overhead of the VM calls/token resolution, it doesn't really factor in the difference between debug vs release costs, it doesn't factor in that methods that are substantially slower may be infrequently compiled (excluding crossgen), it doesn't factor in that instructions can be pipelined, fused, or that some single instructions (division or multiplication) can have the cost of dozens of other instructions.

The numbers I collected in #99658 (comment) proves that there was no impact seen on crossgen2 throughput. Even my previous TP improvements done in #96386, #85144, #87424 and #85842 that combined improved TP numbers by around 15%, none of that showed up in https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362_RunKind%3dcrossgen_scenarios%2fCrossgen2%20Throughput%20-%20Single%20-%20System.Private.CoreLib.html.

We've seen this nuance in the actual perf numbers we track and triage weekly, such as https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362_RunKind%3dcrossgen_scenarios%2fCrossgen2%20Throughput%20-%20Single%20-%20System.Private.CoreLib.html, where the single biggest jump in that was the enabling of DPGO and the steady increase since is due to the organic growth of S.P.Corelib otherwise. The increase in time for changes like this (even being "10% throughput impact") are incredibly minimal in comparison to simply adding new APIs to the BCL or broadly enabling new optimizations.

I think we should worry more about pushing this in the right direction and ensuring the code is understandable/maintainable, and then longer term get it optimized around what we land on from that perspective.

Yes, I would like to get more feedback on "understandable/maintainable" and "code readability" part.

What if we restrict arm64 for the time being to only be able to allocate 24 FP registers? Then we can fit 32 GPR + 24 FP + 8 Mask into 64 bits, presumably with fairly minimal TP impact. Likely we have no cases where we really need more than 24 FP regs, so there won't be much CQ impact either.

@AndyAyersMS - I assume you are talking about having these restrictions on methods that need mask registers, but continue to have 32 fp registers otherwise. We will still not know about it until we get past importer, but by then, we populate most of the register masks like callee-save and callee-trash masks, etc. and will need to reset.

jakobbotsch · 2024-04-21T18:33:33Z

The idea would be to just treat some FP registers to not exist universally, so the change would be simple. From my side it was merely a suggestion on how to unblock work if we were unhappy about taking the TP regressions. The wall clock measurements above clearly show MinOpts impact in clrjit.dll (what it translates to on actual startup scenarios is another question).

I had to settle down on current solution, which is simpler and more importantly maintainable.

IMO the current solution seems complex and less maintainable. It adds multiple thousand lines of code and makes it possible to silently get register set operations wrong (like union between two regMaskOnlyOne representing different register types). There's a bunch of different set types you have to decide between when to use and how to convert between.

It's still surprising to me that just bumping regMaskTP to a 12 or 16 byte struct had such large measured throughput impact throughout the JIT. I wonder if there is a simple explanation for some of the cost or if it truly boils down to the more expensive bit set operations throughout the JIT.
One thing we've seen previously is that expanding the size of some types can have disproportionately large impact in number of instructions executed because multiplying by the size of the type can start using different patterns of instructions. These are the kind of costs that we should definitely feel free to ignore.

I have a work item in mind to reduce the size of float/vector registers field from 8 bytes to 4 bytes. With that, all the register masks will be reduced to 4 bytes, which will reduce the size of common data structures like GenTree, RefPosition, Interval by 4 bytes each.

I agree it would be great to have these optimizations.

The numbers I collected in #99658 (comment) proves that there was no impact seen on crossgen2 throughput. Even my previous TP improvements done in #96386, #85144, #87424 and #85842 that combined improved TP numbers by around 15%, none of that showed up in https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362_RunKind%3dcrossgen_scenarios%2fCrossgen2%20Throughput%20-%20Single%20-%20System.Private.CoreLib.html.

Does this measure a significant number of MinOpts compilations? I assume crossgen2 is compiling everything in FullOpts, and we know that crossgen2 itself interacts poorly with tiered compilation (#83112), so it might just be dominated by other costs than jitting.

jakobbotsch · 2024-04-21T20:13:55Z

I tried your PR at #96196 and see the following for benchmarks.run_pgo. This is arm64 cross compiled from x64 where I use the intrinsic for BitOperations::PopCount (which reduced the SPMI tp impact from +5.6% to +4.07% initially -- I believe we should be able to use an intrinsic for popcount on arm64 host?)

Base: 141063333131, Diff: 146798692822, +4.0658%

1339309854 : +33.69%     : 20.61% : +0.9494% : public: unsigned __int64 __cdecl LinearScan::RegisterSelection::select<0>(class Interval *, class RefPosition *)                                                                                                                                                                                                      
790662416  : +12966.72%  : 12.16% : +0.5605% : public: void __cdecl Interval::mergeRegisterPreferences(unsigned __int64)                                                                                                                                                                                                                                             
668963185  : +22.84%     : 10.29% : +0.4742% : public: void __cdecl LinearScan::allocateRegisters<0>(void)                                                                                                                                                                                                                                                           
534637331  : +26.96%     : 8.23%  : +0.3790% : private: void __cdecl LinearScan::processBlockStartLocations(struct BasicBlock *)                                                                                                                                                                                                                                     
358948739  : +31.97%     : 5.52%  : +0.2545% : public: void __cdecl LinearScan::allocateRegistersMinimal(void)                                                                                                                                                                                                                                                       
318294318  : +25.55%     : 4.90%  : +0.2256% : private: class RefPosition * __cdecl LinearScan::newRefPosition(class Interval *, unsigned int, enum RefType, struct GenTree *, unsigned __int64, unsigned int)                                                                                                                                                       
256068471  : +39.17%     : 3.94%  : +0.1815% : protected: enum _regNumber_enum __cdecl CodeGen::genConsumeReg(struct GenTree *)                                                                                                                                                                                                                                      
230314932  : +21.92%     : 3.54%  : +0.1633% : private: void __cdecl LinearScan::associateRefPosWithInterval(class RefPosition *)                                                                                                                                                                                                                                    
153840898  : +34.77%     : 2.37%  : +0.1091% : private: void __cdecl LinearScan::addRefsForPhysRegMask(unsigned __int64, unsigned int, enum RefType, bool)                                                                                                                                                                                                           
153099855  : +40.08%     : 2.36%  : +0.1085% : private: void __cdecl LinearScan::freeRegisters(unsigned __int64)                                                                                                                                                                                                                                                     
124799070  : +122.19%    : 1.92%  : +0.0885% : public: void __cdecl GCInfo::gcMarkRegPtrVal(enum _regNumber_enum, enum var_types)                                                                                                                                                                                                                                    
104909768  : +10.68%     : 1.61%  : +0.0744% : protected: void __cdecl CodeGen::genCodeForBBlist(void)                                                                                                                                                                                                                                                               
85800424   : NA          : 1.32%  : +0.0608% : public: void __cdecl emitter::emitUpdateLiveGCregs(enum GCtype, unsigned __int64, unsigned char *)                                                                                                                                                                                                                    
72840134   : +8.71%      : 1.12%  : +0.0516% : private: int __cdecl LinearScan::BuildNode(struct GenTree *)

So the vast majority of the TP impact is coming from a small number of functions within LSRA. I wonder if it would worth it to try to restrict the introduction of the segregated register sets to LSRA only such that the rest of the JIT doesn't need to learn about the differences.

kunalspathak · 2024-04-26T18:24:23Z

Startup impact

I did some measurements on TE benchmarks and measured startup and first request time. I barely see ~1.5% regression.
Edit: Added results from Orchard benchmarks that JITs around 34,125 methods

Benchmarks	Avg. First request (Base)	Avg. First request (Diff)	% diff	# of Tier 0 methods
Fortune	324.1	325.9	0.56%	6899
Json-minimal	193	195.8	1.45%	3353
Json	204.5	208.3	1.86%	5339
Orchard	5220	5293	1.40%	34,125

Benchmarks	Avg. Startup time (Base)	Avg. Startup time (Diff)	% diff
Fortune	313	315.3	0.73%
Json-minimal	248.2	249.8	0.64%
Json	347.9	350.4	0.72%
Orchard	502	510	1.59%

10 iterations data

Fortunes

First Request Base	First Request Diff	Startup Base	Startup Diff
332	341	309	312
321	348	315	311
348	328	322	311
323	320	313	322
323	321	310	317
326	320	318	316
319	322	313	310
325	314	308	328
313	326	306	312
311	319	316	314

Json-Minimal

First Request Base	First Request Diff	Startup Base	Startup Diff
189	197	249	249
192	196	244	256
196	199	256	248
191	192	244	248
193	195	248	258
194	196	248	244
193	193	247	250
194	196	256	251
194	198	244	244
194	196	246	250

Json Mvc

First Request Base	First Request Diff	Startup Base	Startup Diff
209	210	354	353
211	212	355	350
210	200	340	347
204	215	346	352
193	205	348	351
199	206	352	348
204	207	342	341
206	205	334	344
201	213	349	361
208	210	359	357

Orchard

First Request Base	First Request Diff	Startup Base	Startup Diff
5,130	5,295	514	523
5,103	5,389	513	515
5,389	5,303	506	501
5,152	5,256	505	510
5,250	5,266	513	519
5,314	5,261	495	502
5,148	5,201	483	505
5,224	5,213	500	506
5,368	5,353	493	504
5,119	5,388	497	513

Crossgen2 throughout impact

Crossgen2 Throughput data: #99658 (comment)

TP impact

Now, let's take a look at TP impact reported by superpmi. Looking at the TP difference for benchmarks.run_tiered collection for windows/arm64, I see 6% regression for MinOpts, which contains around 37,089 method contexts.

Overall (+4.25%)

Collection	PDIFF
benchmarks.run_tiered.windows.arm64.checked.mch	+4.25%

MinOpts (+6.47%)

Collection	PDIFF
benchmarks.run_tiered.windows.arm64.checked.mch	+6.47%

FullOpts (+2.57%)

Collection	PDIFF
benchmarks.run_tiered.windows.arm64.checked.mch	+2.57%

?allocateRegistersMinimal@LinearScan@@QEAAXXZ                                                                                       : 340392599  : +45.58%  : 25.24% : +2.9085%
?addRefsForPhysRegMask@LinearScan@@AEAAXAEBU_regMaskAll@@IW4RefType@@_N@Z                                                           : 184181130  : NA       : 13.66% : +1.5737%
?freeRegisters@LinearScan@@AEAAXU_regMaskAll@@@Z                                                                                    : 86201120   : NA       : 6.39%  : +0.7365%
?allocateRegMinimal@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z                                          : 81623577   : +13.48%  : 6.05%  : +0.6974%
?freeRegister@LinearScan@@AEAAXPEAVRegRecord@@@Z                                                                                    : 70217892   : NA       : 5.21%  : +0.6000%
?gtGetGprRegMask@GenTree@@QEBA_KXZ                                                                                                  : 45216669   : NA       : 3.35%  : +0.3863%
?writeRegisters@LinearScan@@QEAAXPEAVRefPosition@@PEAUGenTree@@@Z                                                                   : 30848120   : NA       : 2.29%  : +0.2636%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                                                            : 28280466   : +40.08%  : 2.10%  : +0.2416%
?newRefPosition@LinearScan@@AEAAPEAVRefPosition@@PEAVInterval@@IW4RefType@@PEAUGenTree@@_KI@Z                                       : 23259358   : +8.59%   : 1.72%  : +0.1987%
?unassignPhysReg@LinearScan@@AEAAXPEAVRegRecord@@PEAVRefPosition@@@Z                                                                : 17337530   : +21.69%  : 1.29%  : +0.1481%
?assignPhysReg@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                                                                     : 16785216   : +27.27%  : 1.24%  : +0.1434%
?buildKillPositionsForNode@LinearScan@@AEAA_NPEAUGenTree@@IAEBU_regMaskAll@@@Z                                                      : 11895844   : NA       : 0.88%  : +0.1016%
??0LinearScan@@QEAA@PEAVCompiler@@@Z                                                                                                : 11645946   : +50.81%  : 0.86%  : +0.0995%
?PopCount@BitOperations@@SAI_K@Z                                                                                                    : 10508274   : +134.96% : 0.78%  : +0.0898%
?gcMarkRegPtrVal@GCInfo@@QEAAXW4_regNumber_enum@@W4var_types@@@Z                                                                    : 9900926    : +40.80%  : 0.73%  : +0.0846%
?genConsumeReg@CodeGen@@IEAA?AW4_regNumber_enum@@PEAUGenTree@@@Z                                                                    : 8915253    : +7.52%   : 0.66%  : +0.0762%
?BuildNode@LinearScan@@AEAAHPEAUGenTree@@@Z                                                                                         : 8813598    : +5.66%   : 0.65%  : +0.0753%
?newRefPositionRaw@LinearScan@@AEAAPEAVRefPosition@@IPEAUGenTree@@W4RefType@@@Z                                                     : 7806564    : +1.89%   : 0.58%  : +0.0667%
?buildPhysRegRecords@LinearScan@@AEAAXXZ                                                                                            : 7529067    : +16.06%  : 0.56%  : +0.0643%
?BuildCall@LinearScan@@AEAAHPEAUGenTreeCall@@@Z                                                                                     : 5614378    : +14.60%  : 0.42%  : +0.0480%
?genCodeForTreeNode@CodeGen@@IEAAXPEAUGenTree@@@Z                                                                                   : 5226233    : +3.53%   : 0.39%  : +0.0447%
?ins_Copy@CodeGen@@QEAA?AW4instruction@@W4_regNumber_enum@@W4var_types@@@Z                                                          : 4644888    : NA       : 0.34%  : +0.0397%
?genProduceReg@CodeGen@@IEAAXPEAUGenTree@@@Z                                                                                        : 3660030    : +3.24%   : 0.27%  : +0.0313%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                                                                        : 3084940    : +0.89%   : 0.23%  : +0.0264%
?genSetRegToConst@CodeGen@@IEAAXW4_regNumber_enum@@W4var_types@@PEAUGenTree@@@Z                                                     : 3041553    : +30.06%  : 0.23%  : +0.0260%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                                                                   : 2856060    : +1.26%   : 0.21%  : +0.0244%
?instGen_Set_Reg_To_Imm@CodeGen@@QEAAXW4emitAttr@@W4_regNumber_enum@@_JW4insFlags@@@Z                                               : 2522843    : +6.25%   : 0.19%  : +0.0216%
?genRestoreCalleeSavedRegistersHelp@CodeGen@@IEAAXAEBU_regMaskAll@@HH@Z                                                             : 2100312    : NA       : 0.16%  : +0.0179%
?emitOutputInstr@emitter@@IEAA_KPEAUinsGroup@@PEAUinstrDesc@1@PEAPEAE@Z                                                             : 1693293    : +0.44%   : 0.13%  : +0.0145%
?genSaveCalleeSavedRegistersHelp@CodeGen@@IEAAXAEBU_regMaskAll@@HH@Z                                                                : 1689330    : NA       : 0.13%  : +0.0144%
?compCompileHelper@Compiler@@QEAAHPEAUCORINFO_MODULE_STRUCT_@@PEAVICorJitInfo@@PEAUCORINFO_METHOD_INFO@@PEAPEAXPEAIPEAVJitFlags@@@Z : 1594827    : +17.09%  : 0.12%  : +0.0136%
?HasMultiRegRetVal@GenTreeCall@@QEBA_NXZ                                                                                            : -1430094   : -12.99%  : 0.11%  : -0.0122%
?BuildDefsWithKills@LinearScan@@AEAAXPEAUGenTree@@H_K1@Z                                                                            : -1555166   : -100.00% : 0.12%  : -0.0133%
?inst_Mov@CodeGen@@QEAAXW4var_types@@W4_regNumber_enum@@1_NW4emitAttr@@W4insFlags@@@Z                                               : -1927004   : -13.42%  : 0.14%  : -0.0165%
?UpdateLifeVar@?$TreeLifeUpdater@$00@@AEAAXPEAUGenTree@@PEAUGenTreeLclVarCommon@@@Z                                                 : -2246490   : -8.57%   : 0.17%  : -0.0192%
?resetAllRegistersState@LinearScan@@AEAAXXZ                                                                                         : -3366369   : -6.09%   : 0.25%  : -0.0288%
?BuildUse@LinearScan@@AEAAPEAVRefPosition@@PEAUGenTree@@_KH@Z                                                                       : -5405424   : -3.81%   : 0.40%  : -0.0462%
?updateMaxSpill@LinearScan@@QEAAXPEAVRefPosition@@@Z                                                                                : -7422214   : -9.99%   : 0.55%  : -0.0634%
??$resolveRegisters@$0A@@LinearScan@@QEAAXXZ                                                                                        : -10955233  : -4.12%   : 0.81%  : -0.0936%
?buildKillPositionsForNode@LinearScan@@AEAA_NPEAUGenTree@@I_K@Z                                                                     : -11299476  : -100.00% : 0.84%  : -0.0965%
?BuildDefs@LinearScan@@AEAAXPEAUGenTree@@H_K@Z                                                                                      : -12740041  : -100.00% : 0.94%  : -0.1089%
?gtGetRegMask@GenTree@@QEBA_KXZ                                                                                                     : -30758149  : -100.00% : 2.28%  : -0.2628%
?freeRegisters@LinearScan@@AEAAX_K@Z                                                                                                : -94590308  : -100.00% : 7.02%  : -0.8082%
?addRefsForPhysRegMask@LinearScan@@AEAAX_KIW4RefType@@_N@Z                                                                          : -104673921 : -100.00% : 7.76%  : -0.8944%

Most of the regression is coming from allocateRegistersMinimal (which mostly operates on AllRegsMask instead of regMaskTP) and addRefsForPhysRegMask (which now iterates over AllRegsMask on all the register bits set to create RefPosition). I will take a more deeper look on what can be optimized here.

kunalspathak · 2024-04-29T06:15:15Z

Added a Review guide in the PR description.

kunalspathak · 2024-06-14T20:36:14Z

Replaced by following PRs:

Other misc. PRs:

kunalspathak added 30 commits February 6, 2024 16:11

Introduce regMaskGpr, regMaskFloat, regMaskPredicate

dcb4b16

Add IsGpr() and IsFloat()

87b0f97

Renamed to regMaskOnlyOne and regMaskAny

ba62855

convert from #define to typedef for better intellisense support

375bf8e

Make IsGprRegMask() and IsFloatRegMask()

da71584

Update codegen.h and related code

0ba172e

Update codegenarm.cpp and related code

8639bb7

Update codegenarm64.cpp and related code

3c5bdaf

Update codegenarmarch.cpp and related code

bad9e8b

Partial update codegencommon.cpp

8384ff2

Update all gcRef and gcByRef to regMaskGpr

c78dc4b

Update codegencommon.cpp and related code

e05cece

Update codegeninterface.h and related code

82ed9aa

Update codegenxarch.cpp and related code

91462c7

Update compiler.cpp and related code

b9d1ed5

Update compiler.h and related code

83f0f01

Update compiler.hpp and related code

c3797e0

Update emit.cpp and related code

147891c

Update emit.h and related code

2238e41

Update emitxarch.cpp and related code

e5e71d1

Update gentree.cpp and related code

7e15435

Update gentree.h and related code

6bbc2d2

Update lclvars.cpp and related code

3e3d51a

Update morph.cpp and related code

41959f2

Update regalloc.cpp and related code

4b72781

Update registerargconvention.h and related code

7214d7b

Update regset.cpp and related code

c7e913e

Update regset.h and related code

d417ad5

Update target.h and related code

dc54b20

Update targetamd64.h and related code

5b8a1e6

kunalspathak added 2 commits April 10, 2024 16:20

fix linux-x86 build

22077eb

fix linux-x64 Native_GCC build error

893f6be

kunalspathak marked this pull request as ready for review April 10, 2024 23:36

kunalspathak changed the title ~~Predicate registers~~ Handle > 64 registers + predicate registers for Arm64 Apr 10, 2024

kunalspathak removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels Apr 12, 2024

kunalspathak requested review from AndyAyersMS, tannergooding and BruceForstall April 12, 2024 18:38

jakobbotsch reviewed Apr 12, 2024

View reviewed changes

jakobbotsch self-requested a review April 15, 2024 13:01

jakobbotsch reviewed Apr 15, 2024

View reviewed changes

jakobbotsch mentioned this pull request Apr 27, 2024

JIT: Move internal reserved registers to a side table #101647

Merged

kunalspathak mentioned this pull request May 6, 2024

Handle more than 64 registers - Part 1 #101950

Merged

kunalspathak mentioned this pull request May 20, 2024

LSRA: Refactor some of the methods around creating kill set #102469

Merged

kunalspathak closed this Jun 14, 2024

github-actions bot locked and limited conversation to collaborators Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle > 64 registers + predicate registers for Arm64 #98258

Handle > 64 registers + predicate registers for Arm64 #98258

kunalspathak commented Feb 10, 2024 •

edited

Loading

kunalspathak commented Apr 10, 2024

AndyAyersMS commented Apr 12, 2024

jakobbotsch commented Apr 12, 2024

jakobbotsch Apr 12, 2024

kunalspathak Apr 12, 2024

jakobbotsch Apr 15, 2024

kunalspathak commented Apr 12, 2024

jakobbotsch Apr 15, 2024

AndyAyersMS commented Apr 17, 2024

a74nh commented Apr 18, 2024

tannergooding commented Apr 18, 2024

kunalspathak commented Apr 21, 2024

jakobbotsch commented Apr 21, 2024 •

edited

Loading

jakobbotsch commented Apr 21, 2024

kunalspathak commented Apr 26, 2024 •

edited

Loading

Fortunes

Json-Minimal

Json Mvc

Orchard

kunalspathak commented Apr 29, 2024

kunalspathak commented Jun 14, 2024

	#if defined(TARGET_AMD64)
	regMaskTP rbmAllFloat;
	regMaskTP rbmFltCalleeTrash;

	FORCEINLINE regMaskTP get_RBM_ALLFLOAT() const
	{
	return this->rbmAllFloat;
	}
	FORCEINLINE regMaskTP get_RBM_FLT_CALLEE_TRASH() const
	{
	return this->rbmFltCalleeTrash;
	}
	#endif // TARGET_AMD64

	#if defined(TARGET_XARCH)
	regMaskTP rbmAllMask;
	regMaskTP rbmMskCalleeTrash;

	// Call this function after the equivalent fields in Compiler have been initialized.
	void CopyRegisterInfo();

	FORCEINLINE regMaskTP get_RBM_ALLMASK() const
	{
	return this->rbmAllMask;
	}
	FORCEINLINE regMaskTP get_RBM_MSK_CALLEE_TRASH() const
	{
	return this->rbmMskCalleeTrash;
	}
	#endif // TARGET_XARCH

Handle > 64 registers + predicate registers for Arm64 #98258

Handle > 64 registers + predicate registers for Arm64 #98258

Conversation

kunalspathak commented Feb 10, 2024 • edited Loading

Review guide

Predicate Registers

AllRegsMask

Lsra

Codegen

Compiler

Misc

kunalspathak commented Apr 10, 2024

AndyAyersMS commented Apr 12, 2024

jakobbotsch commented Apr 12, 2024

jakobbotsch Apr 12, 2024

Choose a reason for hiding this comment

kunalspathak Apr 12, 2024

Choose a reason for hiding this comment

jakobbotsch Apr 15, 2024

Choose a reason for hiding this comment

kunalspathak commented Apr 12, 2024

jakobbotsch Apr 15, 2024

Choose a reason for hiding this comment

AndyAyersMS commented Apr 17, 2024

a74nh commented Apr 18, 2024

tannergooding commented Apr 18, 2024

kunalspathak commented Apr 21, 2024

jakobbotsch commented Apr 21, 2024 • edited Loading

jakobbotsch commented Apr 21, 2024

kunalspathak commented Apr 26, 2024 • edited Loading

Startup impact

Fortunes

Json-Minimal

Json Mvc

Orchard

Crossgen2 throughout impact

TP impact

kunalspathak commented Apr 29, 2024

kunalspathak commented Jun 14, 2024

kunalspathak commented Feb 10, 2024 •

edited

Loading

jakobbotsch commented Apr 21, 2024 •

edited

Loading

kunalspathak commented Apr 26, 2024 •

edited

Loading