Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSRA: Add support to track more than 64 registers #99658

Closed
kunalspathak opened this issue Mar 13, 2024 · 14 comments · Fixed by #103387
Closed

LSRA: Add support to track more than 64 registers #99658

kunalspathak opened this issue Mar 13, 2024 · 14 comments · Fixed by #103387
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Mar 13, 2024

Currently LSRA supports following number of registers for various platforms:

Architecture # of General purpose # of Float/Vector # of Predicate/Mask Total registers
x86 8 8 8 24
x64 16 32 8 **56
arm 16 32 0 48
arm64 32 32 *16 80

*16 new predicate registers for SVE
** There are 16 new GPR registers that will be added with APX, which will make the total for x64 to 72

Until now, the number of registers for all platforms were at most 64, we used regMaskTP data structure (typedef unsigned __int64) to represent them and pass them around. Throughout the RyuJIT codebase, whenever we have to pass, return or track a pool of registers, we use regMaskTP. Since it is 64 bits long, each bit represents a register. Lower bits represent GPR registers, while higher bits represent float/vector registers. However, with the #93095 to add SVE support for arm64, we need to add 16 predicate/mask registers, totaling the number of registers to be tracked from 64 to 80. They will not fit in regMaskTP anymore and we need an alternate way to represent these registers so that we can track the new 16 predicate registers that need to be added for SVE/arm64 work, but also 16 new GPR registers that will be added to support APX/x64 work.

Here are few options to solve this problem:

1. Convert regMaskTP to struct

The first option that we tried in #96196 was to just convert the regMaskTP to a struct that looks something like this:

typedef struct __regMaskTP
{
  unsigned __int64 low;
  unsigned __int64 high;
} regMaskTP;

To avoid refactoring all the code paths that use regMaskTP, we overloaded various operators for this struct. We found out that it regressed the throughput by around 10% for MinOpts and 7% for FullOpts as seen here.

2. Convert regMaskTP to intrinsic vector128

Next option we explored in #94589 was to use unsigned __int128 (only supported by clang on linux) which under the hood uses vector128. Our assumption was that compiler can optimize the access pattern of regMaskTP_128 and we will see lesser TP impact than option 1. However, when we cross compiled this change on linux/x64, we started seeing lot of seg faults in places where regMaskTP was initialized. The problem, as mentioned here was that __int128 assumed that regMaskTP field is at 16-byte aligned and would generate seg fault, whenever that was not the case. So, we had to give up on this option.

3. Segregate the gpr/float/predicate registers usage

This is WIP that I am working on currently in #98258. I created regMaskGpr, regMaskFloat and regMaskPredicate. Then I went through all the places in the code base and used the relevant types. Places where any register pool is accessed, I created a struct AllRegMask() that looks like this:

struct AllRegMask()
{
  regMaskGpr gprRegs;
  regMaskFloat floatRegs;
  regMaskPredicate predicateRegs;
}

 
In the code, I then pass AllRegMask() around and whenever we have to update (add/remove/track) a register from the mask, I added a check to see if the register in question is GPR/float/predicate and update the relevant field accordingly. Currently, we see the TP impact for this is around 6% regression in Minopts and 2% in FullOpts. With that, this is much better than option 1, but still is not acceptable.

4. Track predicate registers separately

Another option is to just track predicate registers separately and pass them around. There are not many places where we need to track them. The GPR/float registers will continue to get represented as regMaskTP 64-bits, and predicate registers will be tracked separately for platforms that has more than 64 registers (SVE/arm and in future APX/x64). The downside is, in future when number of GPR+float registers go beyond 64 registers, we will have to fall back to option 3. The other drawback of this approach is there are lot of places, more relevant is GenTree*, RefPosition and Interval that has regMaskTP as a field. Adding another field for "predicate registers" will consume more memory in these data structures, so probably a union and a bit to indicate if the regMaskTP indicates gpr+float or predicate might do the trick.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 13, 2024
@kunalspathak kunalspathak added arm-sve Work related to arm64 SVE/SVE2 support and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support labels Mar 13, 2024
@kunalspathak kunalspathak self-assigned this Mar 13, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 13, 2024
@kunalspathak
Copy link
Member Author

@dotnet/arm64-contrib

@BruceForstall
Copy link
Member

@dotnet/jit-contrib @dotnet/avx512-contrib

@jakobbotsch jakobbotsch added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 13, 2024
@tannergooding
Copy link
Member

tannergooding commented Mar 13, 2024

3 has always been my preferred approach here.

I expect, especially long term, it will result in the best performance and the most extensibility for new features/architectures. It does, however, require the most up front work.

I would expect that most places don't actually need AllRegMask() as most instructions only access 1 register set. Those that can access multiple tend to be well known (like movd which explicitly moves data between the SIMD and GPR sets), but even then it's not a case that needs more than 1x 64-bit mask per node, as just like with how its handled in emit (and the underlying insDesc) we intrinsically know the register set (GPR or SIMD) based on the gtType of the node.

I would expect the relatively few places that LSRA needs to look at all register sets would be better handled by passing such an AllRegMask struct byref (as AllRegMask&) or better, seeing if a minimal refactoring can be done to iterate through each mask independently since they rarely ever need to be processed in coordination with eachother.

@BruceForstall
Copy link
Member

I'm in favor of (4). There is no platform known or expected where we need more than 64 GPR + float/SIMD registers. On our 64-bit platforms (the ones we are focusing on), this means the most important case of GPR/float all fits in one machine register. It's also useful that GPR and float register mask values are non-overlapping, to avoid cases of confusion (and bugs) where "0x1" could be a mask for r0 or v0, say. Mask/predicate registers I expect will continue to be rare, meaning checking for a mask register set equal to zero would be a fast path exit.

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Mar 18, 2024
@kunalspathak kunalspathak removed the untriaged New issue has not been triaged by the area owner label Mar 20, 2024
@kunalspathak
Copy link
Member Author

kunalspathak commented Mar 21, 2024

I just want to give an update on these efforts and things I have explored so far with regards to option# 3.

Branch-free access

I converted the fields of AllRegsMask to array as seen below and then read/write the field based on the index depending on if the reg or type represents GPR, float or predicate registers.

struct AllRegMask
{
   uint32_t registers[3];
}

This reduced the TP regression by around 3% as seen in #98258 (comment). However, there is still around 3.5% TP regression on MinOpts for arm64 and my goal is bring it down to 0 or signficantly less for arm64. The reason being that arm64 does not have (yet) the predicate registers and so this work should not impact it. There will be some TP regression on x64, but improving that will come only after I improve the TP for arm64. So all my discussion below is tailored around arm64 momentarily.


Convert regMaskFloat from 8-bytes to 4-bytes

One of the ambition I had was to make all the regMaskTP i.e. the new regMaskGpr and regMaskFloat to 4-bytes long instead of 8-bytes because each category has 32 registers maximum and should be represented by just 4-bytes. Getting it down to 4-bytes will improve memory significantly because this data structure is a field in some of the common other data structures like GenTree, RefPosition and Interval. Today, the way regMaskTP works is the GPR have numbers 0 (REG_R0) thru 31 (REG_R31) and are mapped to the mask representation value of 0 thru 0x80000000 respectively. After that, the float registers are numbered from 32 (REG_V0) thru 63 (REG_V31) and are mapped to the mask representation of 0x100000000 thru 0x8000000000000000. We can make the size of gpr registers regMaskGpr to 4-bytes without any problems, because they can be represented in the 4-bytes space. However, the only way that we can make float registers fit in 4-bytes space is to re-map the values for float registers such that REG_V0 will have numeric value of 31 but will have mask representation as 0, REG_V1 will be 0x1 and so forth until REG_V31 will map to 0x80000000. With that in place, the code that operates on mask, will not have any idea if it is operating on the gpr or float register under the hood. For e.g. genRegMask() that has around 281 references would start returning value 0x100000 for both REG_R20 (gpr) and REG_V20 (float) and it will be hard to differentiate between these without knowing the actual type we are operating on, which means extra checks, which means TP regression. Additionally, the checks such as mask & REG_INT_CALLEE_TRASH will return true, even if the underlying mask was meant to be for float registers. We will also have to switch float register mask to << 32 once in a while and back to >> 32, which will just add complexity and will be hard to future proof it for other developers. So for now, I will hold off on this effort. I think I can still get the regMaskGpr to 4-bytes so at least at 166 places or so, where we operate on just gpr, we will just pass aronud 4 bytes instead of 8 bytes. However, regMaskOnlyOne will continue to be 8-bytes.


AllRegMask to work on single entity

So assuming that float register's mask will still be 8-bytes long and they will be only present in high-32 bites, another prototype I am working is to represent gpr and float registers as a single entity in AllRegMask instead of 2 separate fields or an array of 2 entries. I can possibly make a union of those, so I can access in whichever relevant way is easy to understand in the code base.

struct AllRegsMask
{
    union 
    {
        struct
        {
            uint64_t combined;
#ifdef HAS_PREDICATE_REG
            uint32_t predicate;
#endif
        };
#ifdef HAS_PREDICATE_REG
        uint32_t registers[3];
#else
        uint32_t registers[2];
#endif
        struct
        {
            uint32_t gpr;
            uint32_t fpr;
#ifdef HAS_PREDICATE_REG
            uint32_t pr;
#endif
        };
    };
};

And of course, I have not yet started working on passing AllRegsMask by-ref, which should further bring down the TP regression. I could also inherit a struct AllRegsMaskWithPredicate which add the field for predicate registers, while keeping AllRegsMask just for gpr/float. All the places where it is guaranteed to not make use of predicate registers, we can continue using AllRegsMask.


Regarding option 4

Tracking predicate registers is not an easy thing, specially because regMaskTP currently is used in basic data structures GenTree, Interval, etc. We will have to still have a way to differentiate if the underlying mask is predicate or gpr/float. The number of function calls where predicate registers need to be sent will doubled, once for gpr/float and then for predicate and some endless problems to iron out. However, I am trying to blend this option in AllRegsMask that I described above so that we will track it separately only at places where it is used and not at any other places.

@kunalspathak
Copy link
Member Author

Next steps:

@kunalspathak
Copy link
Member Author

Update:

Implement union approach that I mentioned above.

This cuts down the TP impact on MinOpts around 0.5%, but don't have much significant impact. The top reason being the implementation of AllRegsMask contributes to the TP. I tried moving much of the manipulation logic of these methods under #ifdef HAS_PREDICATE_REGS and just operating on combined field for platforms that do not have predicate registers currently, e.g. arm64. The TP regression from this change drops to just 0.5% from 4% which clearly proves the case. So here are few things I am trying out:

HAS_MORE_THAN_64_REGISTERS

Introduce a #define HAS_MORE_THAN_64_REGISTERS, which will be only set for platforms that has more than 64 registers. Currently, as of main, no platform falls in that category and for SVE, we want to move arm64 in that category, while x64 will move into it once we add APX10 support. So, in .NET 9, arm64 will be the only platform that will have HAS_MORE_THAN_64_REGISTERS set. All the other platforms will continue to use AllRegsMask but will operate on existing 64-bit representation of the bit mask and hence, the TP impact for those will be very minimal. For e.g. when I locally tried it, win/x64 TP regression dropped from 5% --> 0.58% and arm64's TP regression dropped from 3% --> 0.18%. Basically, here is the rough sketch of what I am thinking:

#ifdef HAS_MORE_THAN_64_REGISTERS
    union
    {
        RegBitSet32 _registers[REGISTER_TYPE_COUNT];
        struct
        {
            RegBitSet64 _float_gpr;
            RegBitSet32 _predicateRegs;
        };
        struct
        {
            RegBitSet32 _gprRegs;
            RegBitSet32 _floatRegs;
            RegBitSet32 _predicateRegs;
        };
    };
#else
    RegBitSet64 _allRegisters;
#endif

Handling more than 64 registers

For arm64, HAS_MORE_THAN_64_REGISTERS need to be set and as previously mentioned, that increases the TP cost to ~4%. I was thinking of various options to reduce the cost further. As we know, 99.9% of the methods we will JIT will not have predicate registers in them and for those methods, we should just fallback to the existing mechanism of tracking registers in 64 bitmask, 32 for gpr and 32 for float. An ideal approach would be to have a template value parameter on AllRegsMask<hasPredicateRegs> which will choose the implementation of all its methods depending on if it is handling just gpr and float or it has gpr/float/predicate. However, that decision can be made only after importer when we know that there will not be any predicate registers needed and hence invocation of AllRegsMask for creating the mask or other methods invocation is not possible based on the dynamic decision we happen to know only after importer. As an alternative, I could use a bool flag inside the AllRegsMask that will tell if the method under compilation needs predicate registers or not. We will use this flag to decide if we should use an optimal path of manipulating just _float_gpr or also need to take into account the _predicateRegs. The flag check will definitely add a branch in the hot path, but the current thinking is that 99.99% of methods will have this set to false, branch predictor should do a good job of guessing which branch will be taken. Of course, we will see the TP impact from this condition (2 ~ 3 extra instructions), but hopefully that will get CSEed by the compiler. Without this condition, all the methods that do not even have predicate registers will need to go through the code path that manipulates both _float_gpr and _predicateRegs and hence I think add a field check should be better. I have not yet measured the TP impact, but will post it when I have that data. Here is how I can repurpose the _predicateRegs field:

- RegBitSet32 _predicateRegs;
+ RegBitSet32 _predicateRegs: 28
+ bool hasPredicateReg : 8

@tannergooding
Copy link
Member

@kunalspathak, Has the TP cost been measured to show that the additional retired instructions are actually impacting the end to end performance of the JIT?

More instructions does not necessarily mean slower and we may be doing a lot of work trying to reduce a number that doesn't actually matter in this particular scenario. We've seen it multiple times in the past, so it might be pertinent to actually check things like how the 4% TP regression impacts application startup or time taken to crossgen, etc.

@kunalspathak
Copy link
Member Author

More instructions does not necessarily mean slower

I am of similar opinion and we do not know the actual end-to-end impact unless we measure the application startup. Looking back at my work in #96386 where MinOpts performance was improved by around 10% TP measurement wise, I don't see it impacted even a tiny bit.

image

However, that is the closest proxy we have to see how the changes could have an impact and we need to try our best to get it to reasonable number and have a theory of why the numbers are showing the way they are. I don't think I will chase the last 1% TP regression, but here we were seeing around 5~8% and getting it down is my goal by exploring all the feasible options I possibly could.

@tannergooding
Copy link
Member

tannergooding commented Mar 26, 2024

However, that is the closest proxy we have to see how the changes could have an impact

Can we not just explicitly run some start time benchmarks to see if it's actually impacting? I've done that for similar PRs that showed TP regressions in the past.

where MinOpts performance was improved by around 10% TP measurement wise, I don't see it impacted even a tiny bit.

Right, this is super typical and why I raised the question. If we look at the actual perf measurements taken over the past couple years, there is near zero correlation between the TP change measured by SPMI (which is just retired instruction change) and the actual change to start/JIT time. We've had several PRs with massive TP improvements that have no impact and PRs with almost no TP change have a large negative impact to start time.

My main concern is then really how much we're relying on what is effectively an arbitrary number to show if a change is good or not and that we may be spending hundreds of hours of work chasing after something that's actually a non-issue. Where finding something that makes the code easier to maintain and allows us to knowingly cut out work is overall more important.

@kunalspathak
Copy link
Member Author

Can we not just explicitly run some start time benchmarks to see if it's actually impacting? I've done that for similar PRs that showed TP regressions in the past.

Can you share the links? I will piggy back on similar benchmarks.

Where finding something that makes the code easier to maintain and allows us to knowingly cut out work is overall more important.

+1

@kunalspathak
Copy link
Member Author

I can repurpose the _predicateRegs field:

- RegBitSet32 _predicateRegs;
+ RegBitSet32 _predicateRegs: 28
+ bool hasPredicateReg : 8

Repurposing can have risk of getting the boolean field overwritten while trying to set predicateRegs mask. So having it outside of union is better. It increases the struct size but doesn't have any TP impact overall (having it separate 1.48%, and having it collapsed have 1.58% regression). So if I pursue, this route, I will have this field separate.

@tannergooding
Copy link
Member

Can you share the links? I will piggy back on similar benchmarks.

I ended up doing an analysis for the EVEX work #83648 (comment)

There is also the general historical testing we do for dotnet/performance: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/TestHistoryIndexIndex.html. https://github.com/dotnet/performance/tree/main/docs has more instructions on how to run those same general tests locally including for custom dotnet/runtime

@kunalspathak
Copy link
Member Author

Update:

To share some data about the experiments I conducted based on the #99658 (comment). Here is the rough template of the individual AllRegsMask methods that I will discuss further.

void AllRegsMask::Method1(regNumber reg)
{
#ifdef HAS_MORE_THAN_64_REGISTERS
    if (_hasPredicateRegister)
    {
      int index = findIndexForReg(reg);
      _registers[index] = ...
    }
    else
#else
    {
      _allRegisters = ...
    }
#endif
}

I collected the crossgen2 throughput numbers by following the steps in https://github.com/dotnet/performance/blob/main/docs/crossgen-scenarios.md to understand the impact of various options on windows/arm64.

tldr: I don't see any throughput impact for any of the options I tried, even though TP numbers claim the changes to be regressing by 6%.

main:

[2024/03/29 21:22:46][INFO] Crossgen2 Throughput - Composite - framework-r2r
[2024/03/29 21:22:46][INFO] Metric               |Average        |Min            |Max
[2024/03/29 21:22:46][INFO] ---------------------|---------------|---------------|---------------
[2024/03/29 21:22:46][INFO] Process Time         |68992.323 ms   |59755.103 ms   |78229.542 ms
[2024/03/29 21:22:46][INFO] Loading Interval     |13941.552 ms   |4528.772 ms    |23354.331 ms
[2024/03/29 21:22:46][INFO] Emitting Interval    |18524.141 ms   |18401.726 ms   |18646.555 ms
[2024/03/29 21:22:46][INFO] Jit Interval         |29016.631 ms   |27923.049 ms   |30110.212 ms
[2024/03/29 21:22:46][INFO] Compilation Interval |58367.727 ms   |57423.558 ms   |59311.895 ms
[2024/03/29 21:22:46][INFO] Time on Thread       |315331.398 ms  |255444.186 ms  |375218.611 ms
[2024/03/29 21:22:46][INFO]

no_condition:

In this configuration, I turned off HAS_MORE_THAN_64_REGISTERS for arm64 and hence, the AllRegsMask methods will directly operate on _allRegisters field.

[2024/03/29 21:41:24][INFO] Crossgen2 Throughput - Composite - framework-r2r
[2024/03/29 21:41:24][INFO] Metric               |Average        |Min            |Max
[2024/03/29 21:41:24][INFO] ---------------------|---------------|---------------|---------------
[2024/03/29 21:41:24][INFO] Process Time         |58759.277 ms   |58294.442 ms   |59224.113 ms
[2024/03/29 21:41:24][INFO] Loading Interval     |4697.966 ms    |4687.554 ms    |4708.379 ms
[2024/03/29 21:41:24][INFO] Emitting Interval    |18715.915 ms   |18539.105 ms   |18892.726 ms
[2024/03/29 21:41:24][INFO] Jit Interval         |28563.225 ms   |27976.587 ms   |29149.863 ms
[2024/03/29 21:41:24][INFO] Compilation Interval |58323.070 ms   |57856.458 ms   |58789.681 ms
[2024/03/29 21:41:24][INFO] Time on Thread       |336415.736 ms  |281625.783 ms  |391205.689 ms
[2024/03/29 21:41:24][INFO]

has_predicate_is_false

In this configuration, I turned on HAS_MORE_THAN_64_REGISTERS and I set _hasPredicateRegister to false always and hence for arm64, we would always operate on _allRegisters. This configuration was used to test the cost of additional check in AllRegsMask methods.

[2024/03/29 21:50:32][INFO] Crossgen2 Throughput - Composite - framework-r2r
[2024/03/29 21:50:32][INFO] Metric               |Average        |Min            |Max
[2024/03/29 21:50:32][INFO] ---------------------|---------------|---------------|---------------
[2024/03/29 21:50:32][INFO] Process Time         |59162.564 ms   |58527.948 ms   |59797.180 ms
[2024/03/29 21:50:32][INFO] Loading Interval     |4778.086 ms    |4777.805 ms    |4778.367 ms
[2024/03/29 21:50:32][INFO] Emitting Interval    |18717.171 ms   |18447.803 ms   |18986.538 ms
[2024/03/29 21:50:32][INFO] Jit Interval         |28880.934 ms   |28066.373 ms   |29695.495 ms
[2024/03/29 21:50:32][INFO] Compilation Interval |58736.078 ms   |58102.806 ms   |59369.349 ms
[2024/03/29 21:50:32][INFO] Time on Thread       |267720.847 ms  |257456.723 ms  |277984.972 ms
[2024/03/29 21:50:32][INFO]

has_predicate_is_true

In this configuration, I turned on HAS_MORE_THAN_64_REGISTERS and I set _hasPredicateRegister to true always and hence for arm64, we would always operate on the longer code that checks for appropriate field to update, depending on the type or register we are using.

[2024/03/29 21:46:16][INFO] Crossgen2 Throughput - Composite - framework-r2r
[2024/03/29 21:46:16][INFO] Metric               |Average        |Min            |Max
[2024/03/29 21:46:16][INFO] ---------------------|---------------|---------------|---------------
[2024/03/29 21:46:16][INFO] Process Time         |58316.585 ms   |58214.245 ms   |58418.925 ms
[2024/03/29 21:46:16][INFO] Loading Interval     |4944.463 ms    |4908.054 ms    |4980.873 ms
[2024/03/29 21:46:16][INFO] Emitting Interval    |18528.432 ms   |18487.536 ms   |18569.329 ms
[2024/03/29 21:46:16][INFO] Jit Interval         |28079.146 ms   |27905.317 ms   |28252.976 ms
[2024/03/29 21:46:16][INFO] Compilation Interval |57880.929 ms   |57784.402 ms   |57977.457 ms
[2024/03/29 21:46:16][INFO] Time on Thread       |337370.279 ms  |333428.103 ms  |341312.456 ms
[2024/03/29 21:46:16][INFO]

I introduced flag _hasPredicateRegister that will be set to true only if method operates on predicate registers and we need to take care of updating them. Since 99% of methods might not even have predicate registers, I thought having this flag would help in not impacting the performance of 99% of these methods. However, this flag will have to be either passed as a ctor call of AllRegsMask() or a separate method call at each instance of AllRegsMask() depending on the information that most likely will be on the compiler object. I am afraid that with this model, we might run into instances where 1) compiler object is not available to pass the flag value to AllRegsMask 2) we forget to set this flag to true for some instances of AllRegsMask in our code base leading to working inconsistently with other instances that has this flag set to false. Hence, I am not in favor altogether of having such flag and as seen below, it proves that having this flag doesn't make any difference on the real world throughput scenario. Instead, we would just have #ifdef HAS_MORE_THAN_64_REGISTERS that will operate on individual fields and otherwise on _allRegisters. I will start working on this option and make the #98258 in reviewable state.

@kunalspathak kunalspathak added the Priority:2 Work that is important, but not critical for the release label May 3, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Projects
None yet
5 participants