-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT ARM64-SVE: Allow LCL_VARs to store as mask #99608
Conversation
@dotnet/arm64-contrib @kunalspathak @tannergooding |
d689060
to
309b60d
Compare
309b60d
to
6628904
Compare
src/coreclr/jit/codegenarm64.cpp
Outdated
if (ins == INS_sve_str && !varTypeUsesMaskReg(targetType)) | ||
{ | ||
emit->emitIns_S_R(ins, attr, dataReg, varNum, /* offset */ 0, INS_SCALABLE_OPTS_UNPREDICATED); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this pass down UNPREDICATED
rather than doing the inverse?
That is, I imagine most instructions we encounter will end up unpredicated (or effectively unpredicated by using a TrueMask
), so I'd expect we end up with overall less checks if we simply said if (varTypeUsesMaskReg(targetType)) { insOpts |= INS_SCALABLE_OPTS_PREDICATED; }
Otherwise, we end up having to special case every single instruction that has a predicated and unpredicated form and additionally check if they use a mask register or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because the emit function is inconveniently the wrong way around. I was going to fix the emit function up in this PR, but, once register allocation is working we can get rid the enum and get rid of all these checks.
Given the register allocation work is going to take some time, then maybe I should fix up the emit code in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Just wanted to get clarification as it did seem backwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because the emit function is inconveniently the wrong way around
Switched this around. It no longer matches some of the other emit_R_R_etc functions, but that's ok because it'll all vanish eventually.
src/coreclr/jit/importer.cpp
Outdated
// Masks must be converted to vectors before being stored to memory. | ||
// But, for local stores we can optimise away the conversion | ||
if (op1->OperIsHWIntrinsic() && op1->AsHWIntrinsic()->GetHWIntrinsicId() == NI_Sve_ConvertMaskToVector) | ||
{ | ||
op1 = op1->AsHWIntrinsic()->Op(1); | ||
lvaTable[lclNum].lvType = TYP_MASK; | ||
lclTyp = lvaGetActualType(lclNum); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do they need to be converted before being stored to memory?
Shouldn't this be entirely dependent on the target type, just with a specialization for locals since we want to allow efficient consumption in the typical case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a suggestion to change the comments, or a code change too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a change to the comment.
I think we only need to clarify that we're optimizing masks stored to locals to allow better consumption in the expected typical case and that we have the handling in place to ensure that consumers which actually need a vector get the ConvertMaskToVector
inserted back in.
Noting, however, that this could be an incorrect optimization in some cases. For example, if it's a user-defined local where actual vectors are also stored then it would require a ConvertVectorToMask
to be inserted but it would also be a lossy conversion and therefore unsafe.
So I'm not entirely certain this is the "right" place to be doing this either. We might need to actually do this in a later phase where we can observe all uses of a local from the perspective of user defined code, so that we only do this optimization when all values being stored are masks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting, however, that this could be an incorrect optimization in some cases. For example, if it's a user-defined local where actual vectors are also stored then it would require a
ConvertVectorToMask
to be inserted but it would also be a lossy conversion and therefore unsafe.So I'm not entirely certain this is the "right" place to be doing this either. We might need to actually do this in a later phase where we can observe all uses of a local from the perspective of user defined code, so that we only do this optimization when all values being stored are masks.
I'm not sure this is the right place either. Originally I wanted to avoid creating the conversion in the first place, but realised it has to be created and then removed after/during creation of the local. That's how I ended up putting it in importer.
I can't see an obvious later phase this would be done in. Morph? Or maybe during lowering? Anything that already parses local vars would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the correct location for this would be lclmorph
, which is the same place we do sequencing, mark address exposed locals, and even combine field stores to SIMD.
However, @EgorBo or @jakobbotsch may have a better idea on where the code should go.
The consideration in general is really just that we need to find non-address exposed TYP_SIMD
locals where all stores are ConvertMaskToVector
so we can replace it with a TYP_MASK
local instead. There are of course other optimizations that could be done, including splitting a local into two if some stores are masks and some are vectors, but those will be less common than the first.
This work needs to happen after import since that's the only code that would be using pre-existing locals. Any locals introduced by CSE or other phases involving a mask will already be TYP_MASK
themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this kind of brute force changing of a local's type is not safe. It will break all sorts of IR invariants we want to be able to rely on if you don't also do a full pass to fix up other uses of the local.
It seems like this kind of optimization should be its own separate pass after local morph when we know whether or not it is address exposed. We have linked locals at that point, so finding the occurrences is relatively efficient. You can leave some breadcrumb around to only run the pass when there are opportunities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this kind of brute force changing of a local's type is not safe.
I'll add a new pass then. For now I've removed the importer changes.
Technically, this PR could be merged as is. It won't make any jit difference by itself, but it's quite a lot of code that is a blocker for other hw intrinsic work I'm doing (the embedded masks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have linked locals at that point
Can you expand a little on what you mean here? I want to make sure I'm parsing the right data in the pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it in a separate PR sounds good to me. Presumably it needs some heuristics to figure out if it's profitable to make the replacements as well.
Can you expand a little on what you mean here? I want to make sure I'm parsing the right data in the pass.
See Statement::LocalsTreeList
. It allows to quickly check whether a statement contains a local you are interested in.
src/coreclr/jit/importer.cpp
Outdated
if (op1->OperIsHWIntrinsic() && op1->AsHWIntrinsic()->GetHWIntrinsicId() == NI_Sve_ConvertMaskToVector) | ||
{ | ||
op1 = op1->AsHWIntrinsic()->Op(1); | ||
lvaTable[lclNum].lvType = TYP_MASK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we're doing this retyping here. But I don't see where we're doing any fixups to ensure that something which reads the TYP_MASK
local but expects a TYP_SIMD
will get the ConvertMaskToVector
inserted back.
Imagine for example, something like:
Vector<int> mask = Vector.GreaterThan(x, y);
return mask + Vector.Create(5);
Where Vector.GreaterThan
will produce a TYP_MASK
, but the latter +
consumes it as a vector. -- Noting that this is an abnormal pattern, but something we still need to account for and ensure works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might expect such logic to insert a general ConvertMaskToVector
helper to exist as part of impSimdPopStack
and/or as part of the general import helpers we have in hwintrinsic.cpp
.
TP issues should now be resolved. My addition of a Instead, I've removed all that code and instead added a new instruction All of this is removable once register allocation is done. |
I'm happy again with the PR, everything is fixed up and all tests look good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this PR is kind of adding support for future scenarios, I wouldn't expect any TP impact from this, but there is some regression. The changes in emitIns_R_S
might be the cause. I am wondering, the switching for format if isScalable
, is that correct? Basically, we still want to execute the code e.g. like codeGen->instGen_Set_Reg_To_Imm(EA_PTRSIZE, rsvdReg, imm);
, but still retaining the scalable format?
src/coreclr/jit/instr.h
Outdated
@@ -66,6 +66,10 @@ enum instruction : uint32_t | |||
|
|||
INS_lea, // Not a real instruction. It is used for load the address of stack locals | |||
|
|||
// TODO-SVE: Removable once REG_V0 and REG_P0 are distinct | |||
INS_sve_str_mask, // Not a real instruction. It is used to load masks from the stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go in instrsarm64.h
. We already have one for "align" for example.
Looking through the PR, everything that could effect the TP is:
Yes, that's right. I've refactored this a little now so it should be clearer. There should be no overall function change, but it should be clearer and hopefully a better throughput. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/azp run runtime-coreclr superpmi-diffs |
/azp run runtime-coreclr superpmi-replay |
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 1 pipeline(s). |
superpmi-replay failures seems timeout on arm32. |
Currently all mask variables are being converted to a vector before being stored
to memory. They are then converted from vector to mask after loading from
memory.
This patch allows LCL_VARs to have a type of mask, and skip the conversions.
When creating the convert to mask, there is no way of knowing what the parent
node is. This means the convert to mask must always be created and then can
be removed by the LCL_VAR. This is done in importer.cpp (there may be a better
place to do this).
Other changes are require to allow the LCL_VAR to have a type TYP_MASK.
I suspect the changes in codegenarm64.cpp and emitarm64.cpp introduce a TP
regression. These are removable once predicates have register allocation.