Arm64/SVE: Add support to handle predicate registers as callee-trash #104065

kunalspathak · 2024-06-27T01:56:53Z

Overview:

I started adding the SVE calling convention support as described in Scalable vector registers and Scalable predicate registers. However, we soon find out that for scenarios where a sve method (the one that takes or returns Vector<T>) calls into regular method (the one that doesn't take or return Vector<T>), we would end up having to preserve extra registers. This will be true for any regular method including helper call. E.g. https://godbolt.org/z/j9s9E1jWv. That tied with the fact that we have repurposed Vector<T> as a representation of "sve type" (both for scalable vector and scalable predicate) as opposed to dedicated types, as done for c++. In future releases, we want to light up Vector<T> with sve instructions, the requirement of preserving registers across call boundary might degrade the performance of scenarios where a sve method calls regular method. Hence, we decided to just stick to the NEON calling convention and just include the p0-p15 as part of "callee-trash" register set. In other words, if these predicate registers are live across the call boundary, the caller will have to preserve these registers. Everything else with respect to scalable vector registers stay the same. This would work for now, because currently, .NET is just supporting 128-bit register size anyway. In future, when we make it "truly scalable", we will re-evaluate the option of using "sve calling conventions".

Design:

By including predicate registers in "callee-trash" means we need to have space to store these registers on the stack. As
mentioned above, since the size of them is fixed, I am treating them as any other float 16-byte registers. With that, we do not need to use addvl instruction to find out the length of these registers.

Load and store of predicate registers needs to be done using LDR (predicate) and STR (predicate) instruction that accounts for variable vector length. For locals, that holds predicate register values, a reserved register xip1 is used to store the stack offset, and then it is used in ldr and str instruction with #imm being 0. So, we would see something like this:

add     xip1, fp, #32
str     p0, [xip1]

We might see cases where we need 2 instructions to store the predicate register on stack, but it is very rare that we will have this scenario.

However, for temps, since they are stored in sequence, I decided to use a "sequence number" to refer them using str instruction. So, for example, if p0, p1 and p2 are temps, we can access them like following, without having to re-populate xip1 every single time:

add     xip1, fp, #32
str     p0, [xip1, mul vl]
str     p1, [xip1, #1, mul vl]
str     p2, [xip1, #2, mul vl]

Example diffs:

FusedMulitply
Splice
ConditionalSelect
Bug fix - Here we were storing predicate at wrong offset
Gather Vector - Predicates stored in Temp

References:

All stress tests pass: https://gist.github.com/kunalspathak/0ff14dd7175cb000622e0b33fde4e42c

dotnet-policy-service · 2024-06-27T01:58:32Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

kunalspathak · 2024-06-27T02:01:34Z

@dotnet/arm64-contrib

kunalspathak · 2024-06-27T02:04:28Z

However, for temps, since they are stored in sequence, I decided to use a "sequence number"

I am reconsidering this option and instead thinking of just using the same mechanism that is used for access stack offset for locals.

AndyAyersMS · 2024-06-27T02:20:13Z

src/coreclr/jit/lclvars.cpp

@@ -5872,6 +5872,14 @@ void Compiler::lvaFixVirtualFrameOffsets()
    for (TempDsc* temp = codeGen->regSet.tmpListBeg(); temp != nullptr; temp = codeGen->regSet.tmpListNxt(temp))
    {
        temp->tdAdjustTempOffs(delta);
+#if defined(TARGET_ARM64)


Is there some guarantee that all the predicate temps end up adjacent on this list? Otherwise it seems like this indexing scheme might not work out.

If you see below, we iterate over all the type and call tmpPreAllocateTemps with the number of slots we need for that type.

runtime/src/coreclr/jit/lsra.cpp

Lines 7814 to 7828 in 2b4a4bc

for (int i = 0; i < TYP_COUNT; i++)

{

if (var_types(i) != RegSet::tmpNormalizeType(var_types(i)))

{

// Only normalized types should have anything in the maxSpill array.

// We assume here that if type 'i' does not normalize to itself, then

// nothing else normalizes to 'i', either.

assert(maxSpill[i] == 0);

}

if (maxSpill[i] != 0)

{

JITDUMP(" %s: %d\n", varTypeName(var_types(i)), maxSpill[i]);

compiler->codeGen->regSet.tmpPreAllocateTemps(var_types(i), maxSpill[i]);

}

}

In tmpPreAllocateTemps(), we iterate through the number of slots we want to allocate and create them:

runtime/src/coreclr/jit/regset.cpp

Lines 694 to 708 in 2b4a4bc

for (unsigned i = 0; i < count; i++)

{

tmpCount++;

tmpSize += size;

#ifdef TARGET_ARM

if (type == TYP_DOUBLE)

{

// Adjust tmpSize to accommodate possible alignment padding.

// Note that at this point the offsets aren't yet finalized, so we don't yet know if it will be required.

tmpSize += TARGET_POINTER_SIZE;

}

#endif // TARGET_ARM

TempDsc* temp = new (m_rsCompiler, CMK_Unknown) TempDsc(-((int)tmpCount), size, type);

kunalspathak · 2024-06-27T05:54:43Z

However, for temps, since they are stored in sequence, I decided to use a "sequence number"

I am reconsidering this option and instead thinking of just using the same mechanism that is used for access stack offset for locals.

TP impact confirms that I should just use the model I am using for "local"

AndyAyersMS · 2024-06-27T14:59:34Z

In this new approach where do you reserve the stack space and initialize the base pointer reg?

kunalspathak · 2024-06-27T15:01:18Z

In this new approach where do you reserve the stack space and initialize the base pointer reg?

The stack space is reserved similar to how we do it today. It is just when we have to spill and reload, I use reserved register to come up with the stack offset. See some diff examples in the PR description.

kunalspathak · 2024-06-27T15:02:43Z

The stack space is reserved similar to how we do it today

We are not doing anything special because currently for us, everything is fixed size, so we are just using the existing mechanism. When we start making it variable length, we will adjust the code to find the right place in the stack frame where we can put the variable registers.

kunalspathak · 2024-06-27T15:54:35Z

TP impact confirms that I should just use the model I am using for "local"

Actually, there was a typo in my change because of which I was not eliminating the predicate registers from the kill mask if the method does not have float point registers. Fixed it.

kunalspathak · 2024-06-27T19:38:16Z

src/coreclr/jit/targetarm64.h

+  #define RBM_ALLMASK              (RBM_LOWMASK|RBM_HIGHMASK)
+
+  #define RBM_MSK_CALLEE_SAVED    (0)
+  #define RBM_MSK_CALLEE_TRASH    RBM_ALLMASK


somewhere I should just zero it out if we are not running on SVE machine.

Is the TP cost coming from the additional killed registers? I assume that's because we don't have a predicate registers equivalent of compFloatingPointUsed.

I wonder if you could just add a case for predicate registers here:

runtime/src/coreclr/jit/lsrabuild.cpp

Line 3094 in 55f2bc6

compiler->compFloatingPointUsed = true;

And then during allocation, mask out the predicate registers when processing kills if no predicate registers were used.

We would still be creating additional RegRecords though, but maybe this helps a bit.

We would still be creating additional RegRecords though, but maybe this helps a bit.

Actually I guess we were creating those RegRecords even before this PR, so I imagine it would help quite a bit for this PR.

I am already doing it in https://github.com/dotnet/runtime/pull/104065/files#diff-ad66a6bcf1fd550d5ad10d995c03218afbbc39463d36e1f2a224f9ca070a2f99R858-R860. Predicate registers exist only in presence of floating point usage. Yes, we do the newly added extra predicate registers in processKills() and that's what show up impacting TP. For non-sve arm64 machine, we don't have to iterate through them.

somewhere I should just zero it out if we are not running on SVE machine.

although note that when we altjit, we say that "sve capability enable", so we will see predicate registers and will process them during kills. The TP information will be misleading for those cases, but I will add this anyway so that on non-sve arm64 machine, we do not process them.

I think the use of predicate registers is going to be much more rare than using float registers, hence adding this extra check would help regardless.

I will add this anyway so that on non-sve arm64 machine, we do not process them.

I don't see a good reason to try optimizing for non-SVE machines. In the future we would expect most arm64 machines to be SVE enabled, right?

I think we should rather optimize for the common case of "predicate registers not used". It should be possible now that we are only creating oneRefTypeKill per call.

I don't see a good reason to try optimizing for non-SVE machines. In the future we would expect most arm64 machines to be SVE enabled, right?

Yes

I think the use of predicate registers is going to be much more rare than using float registers, hence adding this extra check would help regardless.

Agree. I will do a separate pass for it. #104157 to track it.

I don't see a good reason to try optimizing for non-SVE machines

Thinking about this a bit more, I think I will just revert bcfd8a8 and will do it properly in #104157

kunalspathak · 2024-06-28T05:08:53Z

@dotnet/jit-contrib - appreciate if someone can review and let me know if there is any feedback. Want to get this merge sooner, so it can unblock remaining SVE work.

jakobbotsch

LGTM. Can look into some of the other TP improvements separately.

src/coreclr/jit/emitarm64.cpp

a74nh · 2024-06-28T13:43:11Z

src/coreclr/jit/emitarm64.cpp

+            }
+
+            assert(isVectorRegister(reg1));
+            fmt = IF_SVE_IE_2A;


Eventually, I wonder if this code (for SVE vectors) should be refactored call out to an emit_R_R_I function instead of falling into the non-sve code below.

This reverts commit 5535c69.

This reverts commit bb97d80.

This reverts commit bcfd8a8.

kunalspathak added 5 commits June 26, 2024 18:08

Add MSK_CALLEE_TRASH and include it in CALLEE_TRASH

b23e311

Assign correct registerType for predicate registers

a15808b

Handle the save/restore of predicate registers

570583f

misc changes

a4b687c

jit format

2a175f9

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 27, 2024

kunalspathak requested review from AndyAyersMS and tannergooding June 27, 2024 01:57

dotnet-policy-service bot assigned kunalspathak Jun 27, 2024

AndyAyersMS reviewed Jun 27, 2024

View reviewed changes

kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Jun 27, 2024

Remove handling of Temps and use the same as locals

c19234e

use GetPredicateRegSet()

08f3748

Merge remote-tracking branch 'origin/main' into sve-calling-convention

1e69fd3

kunalspathak commented Jun 27, 2024

View reviewed changes

kunalspathak added 3 commits June 27, 2024 17:35

Disable mask registers if on non-sve

bcfd8a8

small change in DbgEnc

bb97d80

jit format

5535c69

jakobbotsch approved these changes Jun 28, 2024

View reviewed changes

a74nh reviewed Jun 28, 2024

View reviewed changes

src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved

a74nh reviewed Jun 28, 2024

View reviewed changes

kunalspathak mentioned this pull request Jun 28, 2024

Arm64/Sve: Track if mask registers will be used and zero out if not #104157

Open

kunalspathak added 4 commits June 28, 2024 07:57

Revert "jit format"

775db15

This reverts commit 5535c69.

Revert "small change in DbgEnc"

a9b64a8

This reverts commit bb97d80.

Revert "Disable mask registers if on non-sve"

0d91f29

This reverts commit bcfd8a8.

minor review feedback

387ceef

This was referenced Jun 28, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 #100558

Open

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

kunalspathak merged commit bcffebf into dotnet:main Jun 28, 2024
100 of 107 checks passed

kunalspathak mentioned this pull request Jun 28, 2024

Arm64/Sve: Revisit scalable register save/restore in prolog/epilogue #103320

Closed

kunalspathak deleted the sve-calling-convention branch June 28, 2024 18:30

This was referenced Jun 28, 2024

Arm64/Sve: Enable Sve only if vector length is 128 #104174

Merged

JIT ARM64-SVE: Add Sve.ConditionalExtract* APIs #104150

Merged

github-actions bot locked and limited conversation to collaborators Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm64/SVE: Add support to handle predicate registers as callee-trash #104065

Arm64/SVE: Add support to handle predicate registers as callee-trash #104065

kunalspathak commented Jun 27, 2024

dotnet-policy-service bot commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

AndyAyersMS Jun 27, 2024

kunalspathak Jun 27, 2024

kunalspathak commented Jun 27, 2024

AndyAyersMS commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

kunalspathak Jun 27, 2024

jakobbotsch Jun 27, 2024

jakobbotsch Jun 27, 2024

kunalspathak Jun 27, 2024

kunalspathak Jun 27, 2024

jakobbotsch Jun 28, 2024

kunalspathak Jun 28, 2024

kunalspathak Jun 28, 2024

kunalspathak commented Jun 28, 2024

jakobbotsch left a comment

a74nh Jun 28, 2024

kunalspathak Jun 28, 2024

	for (int i = 0; i < TYP_COUNT; i++)
	{
	if (var_types(i) != RegSet::tmpNormalizeType(var_types(i)))
	{
	// Only normalized types should have anything in the maxSpill array.
	// We assume here that if type 'i' does not normalize to itself, then
	// nothing else normalizes to 'i', either.
	assert(maxSpill[i] == 0);
	}
	if (maxSpill[i] != 0)
	{
	JITDUMP(" %s: %d\n", varTypeName(var_types(i)), maxSpill[i]);
	compiler->codeGen->regSet.tmpPreAllocateTemps(var_types(i), maxSpill[i]);
	}
	}

	for (unsigned i = 0; i < count; i++)
	{
	tmpCount++;
	tmpSize += size;

	#ifdef TARGET_ARM
	if (type == TYP_DOUBLE)
	{
	// Adjust tmpSize to accommodate possible alignment padding.
	// Note that at this point the offsets aren't yet finalized, so we don't yet know if it will be required.
	tmpSize += TARGET_POINTER_SIZE;
	}
	#endif // TARGET_ARM

	TempDsc* temp = new (m_rsCompiler, CMK_Unknown) TempDsc(-((int)tmpCount), size, type);

Arm64/SVE: Add support to handle predicate registers as callee-trash #104065

Arm64/SVE: Add support to handle predicate registers as callee-trash #104065

Conversation

kunalspathak commented Jun 27, 2024

dotnet-policy-service bot commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak commented Jun 27, 2024

AndyAyersMS commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

kunalspathak commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak commented Jun 28, 2024

jakobbotsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment