-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable EVEX feature: embedded broadcast for Vector128/256/512.Add() in limited cases #84821
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsDescription:Enabled EVEX feature: embedded broadcast and provided optimization in some use cases of Vector256.Add() when vector base type is The enabling work involves the following changes:
Covered cases:Embedded Broadcast is enabled in Vector256.Add() with limited cases: //case 1
Vector256.Add(Vec, Vector256.Create(FloatConst));
//case 2
Vector256<float> VecCns = Vector256.Create(FloatConst);
Vector256.Add(Vec, VecCns);
//case 3
Vector256.Add(Vec, Vector256.Create(LCL_VAR));
//case 4
Vector256<float> VecCns = Vector256.Create(LCL_VAR);
Vector256.Add(Vec, VecCns); Note: Case 2 4 can only be optimized when DOTNET_TieredCompilation = 0. For case 1,2, embedded broadcast reduces the memory reference size by:
For case 3, 4, embedded broadcast reduces the number of instruction and the memory reference size by:
Future works to do:This commit is intended to give a showcase of the optimization provided by embedded broadcast feature. We will increase the coverage of supported datatype(Int, Double, etc) and operator (Bitwise AND, OR, etc) to complete this PR.
|
cc @dotnet/avx512-contrib |
I'm not sure 1 is the right approach, most of the other points seem about like I'd expect. Embedded broadcast is supported for most EVEX instructions, not just Embedded broadcast requires the data to be coming "from memory". So, we're basically just optimizing the amount of memory used for any So, I'd expect us to basically just continue containing I'd then expect us to update genOperandDesc to have handling for contained We'd then flow the |
Changes based on the reviews:
|
@tannergooding Hi, please check the changes on genOperandDesc() and inst_RV_RV_TT(). The way we handle the As for the last part of the previous review, I am a bit unclear about how a flag on the OperandDesc can be passed into the mentioned emit calls, since those calls only take a part of information from OperandDesc, say Field Handle, VarNum, IndirForm, etc. |
I imagine we'll need to pass down the
We will likely need to extend the emitter slightly. On Arm64, there is an optional |
src/coreclr/jit/emit.h
Outdated
@@ -780,6 +780,9 @@ class emitter | |||
unsigned _idCallRegPtr : 1; // IL indirect calls: addr in reg | |||
unsigned _idCallAddr : 1; // IL indirect calls: can make a direct call to iiaAddr | |||
unsigned _idNoGC : 1; // Some helpers don't get recorded in GC tables | |||
#if defined(TARGET_XARCH) | |||
unsigned _idEmbBroadcast : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll ultimately need space to encode:
EVEX.aaa
: Embedded opmask register specifierEVEX.Z
: Zeroing/MergingEVEX.b
: Broadcast/RC/SAE ContextEVEX.L'L
: Vector length/RC
This is 7 additional bits, so it might be good to make this be the more general EVEX.b
bit that way it is usable for Broadcast
, RC
, and SAE
. Maybe _idEvexBContext
as a name?
--
There are also a couple comments higher up that need to be fixed as well, discussing the number of bits being used in each area, etc.
Thanks for the feedback! we will make the adjustments on the design based on the suggestion. |
9a2af6a
to
b66b77a
Compare
Rebased the branch and resolved the conflicts. |
b66b77a
to
f6ac011
Compare
1. deleted irrelevant comments. Move the contain check up to cover more cases.
2763f8f
to
3b3d0d1
Compare
From the results, |
More samples from different benchmarks. |
The same test failed in the last run passed in this run, look like the fail should be a random fail in CI. |
Hi @kunalspathak @tannergooding, based on the regression study attached above, do we have further actions in this PR? |
I will get back to you today. Sorry, I forgot about this. |
From the results, `TakeEvexPrefix` is the major cause of regression, does this indicate the increase in the size of `instrDesc` could be the issue? Is this for Minopts benchnmarks collection? Because that's the one that is showing 0.14% TP regression. |
Seems this is coming from update to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments. @tannergooding - could you double check the asserts in instr.cpp? I have limited knowledge in that. Other than that looks good.
case NI_AVX_BroadcastScalarToVector128: | ||
case NI_AVX_BroadcastScalarToVector256: | ||
case NI_SSE3_MoveAndDuplicate: | ||
case NI_AVX512F_BroadcastScalarToVector512: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you open the issue for the follow-up work?
Thanks for pointing out, @kunalspathak. From the number, I believe the result should be all contexts collections, I am not sure how I can collect the MinOpts context number alone, will setting some env vars work? |
1. Update comment to keep up with the changes in InstrDesc. 2. Removed un-needed argumnet in the irrelevant method.
The conclusion makes sense to me. |
An issue explicitly tracking us splitting out the behavior into a new |
Thanks for the suggestion. This issue is giving a track of this part of work. |
Hi @tannergooding @kunalspathak, except a few questions I left above, I would suppose the work in this PR can be considered as completed, or are we expecting additional reviews/works in this PR before merging? |
Just need to finish doing my review pass on this. I expect to finish it early tomorrow and will hopefully merge when the completes |
Thanks so much for all the help and suggestions! |
Description:
Enabled EVEX feature: embedded broadcast and provided optimization in some use cases of Vector256.Add() when vector base type is
TYP_FLOAT
.The enabling work involves the following changes:
GTF_SIMD_ADD_EB
, to track if aNI_AVX(2)_Add
node requires embedded broadcast or not.GTF_VECCON_FROMSCALAR
, to track if a constant vector is created from a single scalar._idEmbBroadcast
, to propagate the embedded broadcast flag after lowering, continue to track this feature when interacting with hardware intrinsicaddps
.INS_Flags_EmbeddedBroadcastSupported
: to track if an intrinsic is embedded broadcast compatible.NI_Vector256_Create
node.emitIns_SIMD_R_R_C
,emitIns_R_R_C
,emitIns_SIMD_R_R_S
,emitIns_R_R_S
Covered cases:
Embedded Broadcast is enabled in Vector256.Add() with limited cases:
Note: Case 2 4 can only be optimized when DOTNET_TieredCompilation = 0.
For case 1,2, embedded broadcast reduces the memory reference size by:
For case 3, 4, embedded broadcast reduces the number of instruction and the memory reference size by:
Future works to do:
This commit is intended to give a showcase of the optimization provided by embedded broadcast feature. We will increase the coverage of supported datatype(Int, Double, etc) and operator (Bitwise AND, OR, etc) to complete this PR.