-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64/SVE: Add support to handle predicate registers as callee-trash #104065
Arm64/SVE: Add support to handle predicate registers as callee-trash #104065
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
@dotnet/arm64-contrib |
I am reconsidering this option and instead thinking of just using the same mechanism that is used for access stack offset for locals. |
src/coreclr/jit/lclvars.cpp
Outdated
@@ -5872,6 +5872,14 @@ void Compiler::lvaFixVirtualFrameOffsets() | |||
for (TempDsc* temp = codeGen->regSet.tmpListBeg(); temp != nullptr; temp = codeGen->regSet.tmpListNxt(temp)) | |||
{ | |||
temp->tdAdjustTempOffs(delta); | |||
#if defined(TARGET_ARM64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some guarantee that all the predicate temps end up adjacent on this list? Otherwise it seems like this indexing scheme might not work out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you see below, we iterate over all the type
and call tmpPreAllocateTemps
with the number of slots we need for that type.
runtime/src/coreclr/jit/lsra.cpp
Lines 7814 to 7828 in 2b4a4bc
for (int i = 0; i < TYP_COUNT; i++) | |
{ | |
if (var_types(i) != RegSet::tmpNormalizeType(var_types(i))) | |
{ | |
// Only normalized types should have anything in the maxSpill array. | |
// We assume here that if type 'i' does not normalize to itself, then | |
// nothing else normalizes to 'i', either. | |
assert(maxSpill[i] == 0); | |
} | |
if (maxSpill[i] != 0) | |
{ | |
JITDUMP(" %s: %d\n", varTypeName(var_types(i)), maxSpill[i]); | |
compiler->codeGen->regSet.tmpPreAllocateTemps(var_types(i), maxSpill[i]); | |
} | |
} |
In tmpPreAllocateTemps()
, we iterate through the number of slots we want to allocate and create them:
runtime/src/coreclr/jit/regset.cpp
Lines 694 to 708 in 2b4a4bc
for (unsigned i = 0; i < count; i++) | |
{ | |
tmpCount++; | |
tmpSize += size; | |
#ifdef TARGET_ARM | |
if (type == TYP_DOUBLE) | |
{ | |
// Adjust tmpSize to accommodate possible alignment padding. | |
// Note that at this point the offsets aren't yet finalized, so we don't yet know if it will be required. | |
tmpSize += TARGET_POINTER_SIZE; | |
} | |
#endif // TARGET_ARM | |
TempDsc* temp = new (m_rsCompiler, CMK_Unknown) TempDsc(-((int)tmpCount), size, type); |
TP impact confirms that I should just use the model I am using for "local" |
In this new approach where do you reserve the stack space and initialize the base pointer reg? |
The stack space is reserved similar to how we do it today. It is just when we have to spill and reload, I use reserved register to come up with the stack offset. See some diff examples in the PR description. |
We are not doing anything special because currently for us, everything is fixed size, so we are just using the existing mechanism. When we start making it variable length, we will adjust the code to find the right place in the stack frame where we can put the variable registers. |
Actually, there was a typo in my change because of which I was not eliminating the predicate registers from the kill mask if the method does not have float point registers. Fixed it. |
#define RBM_ALLMASK (RBM_LOWMASK|RBM_HIGHMASK) | ||
|
||
#define RBM_MSK_CALLEE_SAVED (0) | ||
#define RBM_MSK_CALLEE_TRASH RBM_ALLMASK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhere I should just zero it out if we are not running on SVE machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the TP cost coming from the additional killed registers? I assume that's because we don't have a predicate registers equivalent of compFloatingPointUsed
.
I wonder if you could just add a case for predicate registers here:
runtime/src/coreclr/jit/lsrabuild.cpp
Line 3094 in 55f2bc6
compiler->compFloatingPointUsed = true; |
And then during allocation, mask out the predicate registers when processing kills if no predicate registers were used.
We would still be creating additional RegRecords though, but maybe this helps a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would still be creating additional RegRecords though, but maybe this helps a bit.
Actually I guess we were creating those RegRecord
s even before this PR, so I imagine it would help quite a bit for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am already doing it in https://github.com/dotnet/runtime/pull/104065/files#diff-ad66a6bcf1fd550d5ad10d995c03218afbbc39463d36e1f2a224f9ca070a2f99R858-R860. Predicate registers exist only in presence of floating point usage. Yes, we do the newly added extra predicate registers in processKills()
and that's what show up impacting TP. For non-sve arm64 machine, we don't have to iterate through them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhere I should just zero it out if we are not running on SVE machine.
although note that when we altjit, we say that "sve capability enable", so we will see predicate registers and will process them during kills. The TP information will be misleading for those cases, but I will add this anyway so that on non-sve arm64 machine, we do not process them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of predicate registers is going to be much more rare than using float registers, hence adding this extra check would help regardless.
I will add this anyway so that on non-sve arm64 machine, we do not process them.
I don't see a good reason to try optimizing for non-SVE machines. In the future we would expect most arm64 machines to be SVE enabled, right?
I think we should rather optimize for the common case of "predicate registers not used". It should be possible now that we are only creating oneRefTypeKill
per call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a good reason to try optimizing for non-SVE machines. In the future we would expect most arm64 machines to be SVE enabled, right?
Yes
I think the use of predicate registers is going to be much more rare than using float registers, hence adding this extra check would help regardless.
Agree. I will do a separate pass for it. #104157 to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotnet/jit-contrib - appreciate if someone can review and let me know if there is any feedback. Want to get this merge sooner, so it can unblock remaining SVE work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can look into some of the other TP improvements separately.
} | ||
|
||
assert(isVectorRegister(reg1)); | ||
fmt = IF_SVE_IE_2A; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, I wonder if this code (for SVE vectors) should be refactored call out to an emit_R_R_I function instead of falling into the non-sve code below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
Overview:
I started adding the SVE calling convention support as described in Scalable vector registers and Scalable predicate registers. However, we soon find out that for scenarios where a sve method (the one that takes or returns
Vector<T>
) calls into regular method (the one that doesn't take or returnVector<T>
), we would end up having to preserve extra registers. This will be true for any regular method including helper call. E.g. https://godbolt.org/z/j9s9E1jWv. That tied with the fact that we have repurposedVector<T>
as a representation of "sve type" (both for scalable vector and scalable predicate) as opposed to dedicated types, as done for c++. In future releases, we want to light upVector<T>
with sve instructions, the requirement of preserving registers across call boundary might degrade the performance of scenarios where a sve method calls regular method. Hence, we decided to just stick to the NEON calling convention and just include thep0-p15
as part of "callee-trash" register set. In other words, if these predicate registers are live across the call boundary, the caller will have to preserve these registers. Everything else with respect to scalable vector registers stay the same. This would work for now, because currently, .NET is just supporting 128-bit register size anyway. In future, when we make it "truly scalable", we will re-evaluate the option of using "sve calling conventions".Design:
By including predicate registers in "callee-trash" means we need to have space to store these registers on the stack. As
mentioned above, since the size of them is fixed, I am treating them as any other float 16-byte registers. With that, we do not need to use
addvl
instruction to find out the length of these registers.Load and store of predicate registers needs to be done using LDR (predicate) and STR (predicate) instruction that accounts for variable vector length. For locals, that holds predicate register values, a reserved register
xip1
is used to store the stack offset, and then it is used inldr
andstr
instruction with#imm
being 0. So, we would see something like this:We might see cases where we need 2 instructions to store the predicate register on stack, but it is very rare that we will have this scenario.
However, for temps, since they are stored in sequence, I decided to use a "sequence number" to refer them using
str
instruction. So, for example, ifp0
,p1
andp2
are temps, we can access them like following, without having to re-populatexip1
every single time:Example diffs:
References:
All stress tests pass: https://gist.github.com/kunalspathak/0ff14dd7175cb000622e0b33fde4e42c