-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSRA: Add support to track more than 64 registers #99658
Comments
@dotnet/arm64-contrib |
@dotnet/jit-contrib @dotnet/avx512-contrib |
3 has always been my preferred approach here. I expect, especially long term, it will result in the best performance and the most extensibility for new features/architectures. It does, however, require the most up front work. I would expect that most places don't actually need I would expect the relatively few places that LSRA needs to look at all register sets would be better handled by passing such an |
I'm in favor of (4). There is no platform known or expected where we need more than 64 GPR + float/SIMD registers. On our 64-bit platforms (the ones we are focusing on), this means the most important case of GPR/float all fits in one machine register. It's also useful that GPR and float register mask values are non-overlapping, to avoid cases of confusion (and bugs) where "0x1" could be a mask for |
I just want to give an update on these efforts and things I have explored so far with regards to option# 3. Branch-free accessI converted the fields of struct AllRegMask
{
uint32_t registers[3];
} This reduced the TP regression by around 3% as seen in #98258 (comment). However, there is still around 3.5% TP regression on MinOpts for arm64 and my goal is bring it down to 0 or signficantly less for arm64. The reason being that arm64 does not have (yet) the predicate registers and so this work should not impact it. There will be some TP regression on x64, but improving that will come only after I improve the TP for arm64. So all my discussion below is tailored around arm64 momentarily. Convert regMaskFloat from 8-bytes to 4-bytesOne of the ambition I had was to make all the AllRegMask to work on single entitySo assuming that float register's mask will still be 8-bytes long and they will be only present in high-32 bites, another prototype I am working is to represent gpr and float registers as a single entity in struct AllRegsMask
{
union
{
struct
{
uint64_t combined;
#ifdef HAS_PREDICATE_REG
uint32_t predicate;
#endif
};
#ifdef HAS_PREDICATE_REG
uint32_t registers[3];
#else
uint32_t registers[2];
#endif
struct
{
uint32_t gpr;
uint32_t fpr;
#ifdef HAS_PREDICATE_REG
uint32_t pr;
#endif
};
};
}; And of course, I have not yet started working on passing
Tracking predicate registers is not an easy thing, specially because |
Next steps:
|
Update:
This cuts down the TP impact on MinOpts around 0.5%, but don't have much significant impact. The top reason being the implementation of HAS_MORE_THAN_64_REGISTERSIntroduce a #ifdef HAS_MORE_THAN_64_REGISTERS
union
{
RegBitSet32 _registers[REGISTER_TYPE_COUNT];
struct
{
RegBitSet64 _float_gpr;
RegBitSet32 _predicateRegs;
};
struct
{
RegBitSet32 _gprRegs;
RegBitSet32 _floatRegs;
RegBitSet32 _predicateRegs;
};
};
#else
RegBitSet64 _allRegisters;
#endif Handling more than 64 registersFor arm64, - RegBitSet32 _predicateRegs;
+ RegBitSet32 _predicateRegs: 28
+ bool hasPredicateReg : 8 |
@kunalspathak, Has the TP cost been measured to show that the additional retired instructions are actually impacting the end to end performance of the JIT? More instructions does not necessarily mean slower and we may be doing a lot of work trying to reduce a number that doesn't actually matter in this particular scenario. We've seen it multiple times in the past, so it might be pertinent to actually check things like how the 4% TP regression impacts application startup or time taken to crossgen, etc. |
I am of similar opinion and we do not know the actual end-to-end impact unless we measure the application startup. Looking back at my work in #96386 where MinOpts performance was improved by around 10% TP measurement wise, I don't see it impacted even a tiny bit. However, that is the closest proxy we have to see how the changes could have an impact and we need to try our best to get it to reasonable number and have a theory of why the numbers are showing the way they are. I don't think I will chase the last 1% TP regression, but here we were seeing around 5~8% and getting it down is my goal by exploring all the feasible options I possibly could. |
Can we not just explicitly run some start time benchmarks to see if it's actually impacting? I've done that for similar PRs that showed TP regressions in the past.
Right, this is super typical and why I raised the question. If we look at the actual perf measurements taken over the past couple years, there is near zero correlation between the TP change measured by SPMI (which is just retired instruction change) and the actual change to start/JIT time. We've had several PRs with massive TP improvements that have no impact and PRs with almost no TP change have a large negative impact to start time. My main concern is then really how much we're relying on what is effectively an arbitrary number to show if a change is good or not and that we may be spending hundreds of hours of work chasing after something that's actually a non-issue. Where finding something that makes the code easier to maintain and allows us to knowingly cut out work is overall more important. |
Can you share the links? I will piggy back on similar benchmarks.
+1 |
Repurposing can have risk of getting the boolean field overwritten while trying to set |
I ended up doing an analysis for the EVEX work #83648 (comment) There is also the general historical testing we do for dotnet/performance: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/TestHistoryIndexIndex.html. https://github.com/dotnet/performance/tree/main/docs has more instructions on how to run those same general tests locally including for custom dotnet/runtime |
Update: To share some data about the experiments I conducted based on the #99658 (comment). Here is the rough template of the individual void AllRegsMask::Method1(regNumber reg)
{
#ifdef HAS_MORE_THAN_64_REGISTERS
if (_hasPredicateRegister)
{
int index = findIndexForReg(reg);
_registers[index] = ...
}
else
#else
{
_allRegisters = ...
}
#endif
} I collected the crossgen2 throughput numbers by following the steps in https://github.com/dotnet/performance/blob/main/docs/crossgen-scenarios.md to understand the impact of various options on windows/arm64. tldr: I don't see any throughput impact for any of the options I tried, even though TP numbers claim the changes to be regressing by 6%. main:
no_condition:In this configuration, I turned off
has_predicate_is_falseIn this configuration, I turned on
has_predicate_is_trueIn this configuration, I turned on
I introduced flag |
Currently LSRA supports following number of registers for various platforms:
*16 new predicate registers for SVE
** There are 16 new GPR registers that will be added with APX, which will make the total for x64 to 72
Until now, the number of registers for all platforms were at most 64, we used
regMaskTP
data structure (typedef unsigned __int64
) to represent them and pass them around. Throughout the RyuJIT codebase, whenever we have to pass, return or track a pool of registers, we useregMaskTP
. Since it is 64 bits long, each bit represents a register. Lower bits represent GPR registers, while higher bits represent float/vector registers. However, with the #93095 to add SVE support for arm64, we need to add16
predicate/mask registers, totaling the number of registers to be tracked from 64 to 80. They will not fit inregMaskTP
anymore and we need an alternate way to represent these registers so that we can track the new 16 predicate registers that need to be added for SVE/arm64 work, but also 16 new GPR registers that will be added to support APX/x64 work.Here are few options to solve this problem:
1. Convert regMaskTP to struct
The first option that we tried in #96196 was to just convert the
regMaskTP
to astruct
that looks something like this:To avoid refactoring all the code paths that use
regMaskTP
, we overloaded various operators for this struct. We found out that it regressed the throughput by around 10% for MinOpts and 7% for FullOpts as seen here.2. Convert regMaskTP to intrinsic vector128
Next option we explored in #94589 was to use
unsigned __int128
(only supported by clang on linux) which under the hood uses vector128. Our assumption was that compiler can optimize the access pattern ofregMaskTP_128
and we will see lesser TP impact than option 1. However, when we cross compiled this change on linux/x64, we started seeing lot of seg faults in places whereregMaskTP
was initialized. The problem, as mentioned here was that__int128
assumed thatregMaskTP
field is at 16-byte aligned and would generate seg fault, whenever that was not the case. So, we had to give up on this option.3. Segregate the gpr/float/predicate registers usage
This is WIP that I am working on currently in #98258. I created
regMaskGpr
,regMaskFloat
andregMaskPredicate
. Then I went through all the places in the code base and used the relevant types. Places where any register pool is accessed, I created a structAllRegMask()
that looks like this:In the code, I then pass
AllRegMask()
around and whenever we have to update (add/remove/track) a register from the mask, I added a check to see if the register in question is GPR/float/predicate and update the relevant field accordingly. Currently, we see the TP impact for this is around 6% regression in Minopts and 2% in FullOpts. With that, this is much better than option 1, but still is not acceptable.4. Track predicate registers separately
Another option is to just track predicate registers separately and pass them around. There are not many places where we need to track them. The GPR/float registers will continue to get represented as
regMaskTP
64-bits, and predicate registers will be tracked separately for platforms that has more than 64 registers (SVE/arm and in future APX/x64). The downside is, in future when number of GPR+float registers go beyond 64 registers, we will have to fall back to option 3. The other drawback of this approach is there are lot of places, more relevant isGenTree*
,RefPosition
andInterval
that hasregMaskTP
as a field. Adding another field for "predicate registers" will consume more memory in these data structures, so probably aunion
and a bit to indicate if theregMaskTP
indicates gpr+float or predicate might do the trick.The text was updated successfully, but these errors were encountered: