Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64/SVE: Add support to handle predicate registers as callee-trash #104065

Merged
merged 15 commits into from
Jun 28, 2024

Conversation

kunalspathak
Copy link
Member

Overview:

I started adding the SVE calling convention support as described in Scalable vector registers and Scalable predicate registers. However, we soon find out that for scenarios where a sve method (the one that takes or returns Vector<T>) calls into regular method (the one that doesn't take or return Vector<T>), we would end up having to preserve extra registers. This will be true for any regular method including helper call. E.g. https://godbolt.org/z/j9s9E1jWv. That tied with the fact that we have repurposed Vector<T> as a representation of "sve type" (both for scalable vector and scalable predicate) as opposed to dedicated types, as done for c++. In future releases, we want to light up Vector<T> with sve instructions, the requirement of preserving registers across call boundary might degrade the performance of scenarios where a sve method calls regular method. Hence, we decided to just stick to the NEON calling convention and just include the p0-p15 as part of "callee-trash" register set. In other words, if these predicate registers are live across the call boundary, the caller will have to preserve these registers. Everything else with respect to scalable vector registers stay the same. This would work for now, because currently, .NET is just supporting 128-bit register size anyway. In future, when we make it "truly scalable", we will re-evaluate the option of using "sve calling conventions".

Design:

By including predicate registers in "callee-trash" means we need to have space to store these registers on the stack. As
mentioned above, since the size of them is fixed, I am treating them as any other float 16-byte registers. With that, we do not need to use addvl instruction to find out the length of these registers.

Load and store of predicate registers needs to be done using LDR (predicate) and STR (predicate) instruction that accounts for variable vector length. For locals, that holds predicate register values, a reserved register xip1 is used to store the stack offset, and then it is used in ldr and str instruction with #imm being 0. So, we would see something like this:

add     xip1, fp, #32
str     p0, [xip1]

We might see cases where we need 2 instructions to store the predicate register on stack, but it is very rare that we will have this scenario.

However, for temps, since they are stored in sequence, I decided to use a "sequence number" to refer them using str instruction. So, for example, if p0, p1 and p2 are temps, we can access them like following, without having to re-populate xip1 every single time:

add     xip1, fp, #32
str     p0, [xip1, mul vl]
str     p1, [xip1, #1, mul vl]
str     p2, [xip1, #2, mul vl]

Example diffs:

References:

All stress tests pass: https://gist.github.com/kunalspathak/0ff14dd7175cb000622e0b33fde4e42c

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 27, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@kunalspathak
Copy link
Member Author

@dotnet/arm64-contrib

@kunalspathak
Copy link
Member Author

However, for temps, since they are stored in sequence, I decided to use a "sequence number"

I am reconsidering this option and instead thinking of just using the same mechanism that is used for access stack offset for locals.

@@ -5872,6 +5872,14 @@ void Compiler::lvaFixVirtualFrameOffsets()
for (TempDsc* temp = codeGen->regSet.tmpListBeg(); temp != nullptr; temp = codeGen->regSet.tmpListNxt(temp))
{
temp->tdAdjustTempOffs(delta);
#if defined(TARGET_ARM64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some guarantee that all the predicate temps end up adjacent on this list? Otherwise it seems like this indexing scheme might not work out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see below, we iterate over all the type and call tmpPreAllocateTemps with the number of slots we need for that type.

for (int i = 0; i < TYP_COUNT; i++)
{
if (var_types(i) != RegSet::tmpNormalizeType(var_types(i)))
{
// Only normalized types should have anything in the maxSpill array.
// We assume here that if type 'i' does not normalize to itself, then
// nothing else normalizes to 'i', either.
assert(maxSpill[i] == 0);
}
if (maxSpill[i] != 0)
{
JITDUMP(" %s: %d\n", varTypeName(var_types(i)), maxSpill[i]);
compiler->codeGen->regSet.tmpPreAllocateTemps(var_types(i), maxSpill[i]);
}
}

In tmpPreAllocateTemps(), we iterate through the number of slots we want to allocate and create them:

for (unsigned i = 0; i < count; i++)
{
tmpCount++;
tmpSize += size;
#ifdef TARGET_ARM
if (type == TYP_DOUBLE)
{
// Adjust tmpSize to accommodate possible alignment padding.
// Note that at this point the offsets aren't yet finalized, so we don't yet know if it will be required.
tmpSize += TARGET_POINTER_SIZE;
}
#endif // TARGET_ARM
TempDsc* temp = new (m_rsCompiler, CMK_Unknown) TempDsc(-((int)tmpCount), size, type);

@kunalspathak kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Jun 27, 2024
@kunalspathak
Copy link
Member Author

However, for temps, since they are stored in sequence, I decided to use a "sequence number"

I am reconsidering this option and instead thinking of just using the same mechanism that is used for access stack offset for locals.

TP impact confirms that I should just use the model I am using for "local"

image

@AndyAyersMS
Copy link
Member

In this new approach where do you reserve the stack space and initialize the base pointer reg?

@kunalspathak
Copy link
Member Author

In this new approach where do you reserve the stack space and initialize the base pointer reg?

The stack space is reserved similar to how we do it today. It is just when we have to spill and reload, I use reserved register to come up with the stack offset. See some diff examples in the PR description.

@kunalspathak
Copy link
Member Author

The stack space is reserved similar to how we do it today

We are not doing anything special because currently for us, everything is fixed size, so we are just using the existing mechanism. When we start making it variable length, we will adjust the code to find the right place in the stack frame where we can put the variable registers.

@kunalspathak
Copy link
Member Author

TP impact confirms that I should just use the model I am using for "local"

Actually, there was a typo in my change because of which I was not eliminating the predicate registers from the kill mask if the method does not have float point registers. Fixed it.

#define RBM_ALLMASK (RBM_LOWMASK|RBM_HIGHMASK)

#define RBM_MSK_CALLEE_SAVED (0)
#define RBM_MSK_CALLEE_TRASH RBM_ALLMASK
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhere I should just zero it out if we are not running on SVE machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the TP cost coming from the additional killed registers? I assume that's because we don't have a predicate registers equivalent of compFloatingPointUsed.

I wonder if you could just add a case for predicate registers here:

compiler->compFloatingPointUsed = true;

And then during allocation, mask out the predicate registers when processing kills if no predicate registers were used.

We would still be creating additional RegRecords though, but maybe this helps a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would still be creating additional RegRecords though, but maybe this helps a bit.

Actually I guess we were creating those RegRecords even before this PR, so I imagine it would help quite a bit for this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am already doing it in https://github.com/dotnet/runtime/pull/104065/files#diff-ad66a6bcf1fd550d5ad10d995c03218afbbc39463d36e1f2a224f9ca070a2f99R858-R860. Predicate registers exist only in presence of floating point usage. Yes, we do the newly added extra predicate registers in processKills() and that's what show up impacting TP. For non-sve arm64 machine, we don't have to iterate through them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhere I should just zero it out if we are not running on SVE machine.

although note that when we altjit, we say that "sve capability enable", so we will see predicate registers and will process them during kills. The TP information will be misleading for those cases, but I will add this anyway so that on non-sve arm64 machine, we do not process them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of predicate registers is going to be much more rare than using float registers, hence adding this extra check would help regardless.

I will add this anyway so that on non-sve arm64 machine, we do not process them.

I don't see a good reason to try optimizing for non-SVE machines. In the future we would expect most arm64 machines to be SVE enabled, right?

I think we should rather optimize for the common case of "predicate registers not used". It should be possible now that we are only creating oneRefTypeKill per call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good reason to try optimizing for non-SVE machines. In the future we would expect most arm64 machines to be SVE enabled, right?

Yes

I think the use of predicate registers is going to be much more rare than using float registers, hence adding this extra check would help regardless.

Agree. I will do a separate pass for it. #104157 to track it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good reason to try optimizing for non-SVE machines

Thinking about this a bit more, I think I will just revert bcfd8a8 and will do it properly in #104157

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib - appreciate if someone can review and let me know if there is any feedback. Want to get this merge sooner, so it can unblock remaining SVE work.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can look into some of the other TP improvements separately.

}

assert(isVectorRegister(reg1));
fmt = IF_SVE_IE_2A;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, I wonder if this code (for SVE vectors) should be refactored call out to an emit_R_R_I function instead of falling into the non-sve code below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants