Enable EVEX feature: embedded broadcast for Vector128/256/512.Add() in limited cases #84821

Ruihan-Yin · 2023-04-14T03:10:35Z

Description:

Enabled EVEX feature: embedded broadcast and provided optimization in some use cases of Vector256.Add() when vector base type is TYP_FLOAT.

The enabling work involves the following changes:

a new Gentree flag: GTF_SIMD_ADD_EB, to track if a NI_AVX(2)_Add node requires embedded broadcast or not.
a new Gentree flag: GTF_VECCON_FROMSCALAR, to track if a constant vector is created from a single scalar.
an extra bit on instruction descriptor: _idEmbBroadcast, to propagate the embedded broadcast flag after lowering, continue to track this feature when interacting with hardware intrinsic addps.
a new instruction flag: INS_Flags_EmbeddedBroadcastSupported: to track if an intrinsic is embedded broadcast compatible.
modification on the logics when setting containment for Add node when lowering.
more checking logics and special lowering when lowering the NI_Vector256_Create node.
arguments list for these emit methods: emitIns_SIMD_R_R_C, emitIns_R_R_C, emitIns_SIMD_R_R_S, emitIns_R_R_S

Covered cases:

Embedded Broadcast is enabled in Vector256.Add() with limited cases:

//case 1
Vector256.Add(Vec, Vector256.Create(FloatConst));

//case 2
Vector256<float> VecCns = Vector256.Create(FloatConst); 
Vector256.Add(Vec, VecCns);

//case 3
Vector256.Add(Vec, Vector256.Create(LCL_VAR));

//case 4
Vector256<float> VecCns = Vector256.Create(LCL_VAR); 
Vector256.Add(Vec, VecCns);

Note: Case 2 4 can only be optimized when DOTNET_TieredCompilation = 0.

For case 1,2, embedded broadcast reduces the memory reference size by:

Original assembly: 
vaddps ymm, ymm, m256
Optimized assembly: 
vaddps ymm, ymm, m32 {1to8}

For case 3, 4, embedded broadcast reduces the number of instruction and the memory reference size by:

Original assembly: 
vbroadcast ymm, m32
vaddps ymm, ymm, ymm
Optimized assembly: 
vaddps ymm, ymm, m32 {1to8}

Future works to do:

This commit is intended to give a showcase of the optimization provided by embedded broadcast feature. We will increase the coverage of supported datatype(Int, Double, etc) and operator (Bitwise AND, OR, etc) to complete this PR.

ghost · 2023-04-14T03:10:47Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Description:

Enabled EVEX feature: embedded broadcast and provided optimization in some use cases of Vector256.Add() when vector base type is TYP_FLOAT.

The enabling work involves the following changes:

a new Gentree flag: GTF_SIMD_ADD_EB, to track if a NI_AVX(2)_Add node requires embedded broadcast or not.
a new Gentree flag: GTF_VECCON_FROMSCALAR, to track if a constant vector is created from a single scalar.
an extra bit on instruction descriptor: _idEmbBroadcast, to propagate the embedded broadcast flag after lowering, continue to track this feature when interacting with hardware intrinsic addps.
a new instruction flag: INS_Flags_EmbeddedBroadcastSupported: to track if an intrinsic is embedded broadcast compatible.
modification on the logics when setting containment for Add node when lowering.
more checking logics and special lowering when lowering the NI_Vector256_Create node.
arguments list for these emit methods: emitIns_SIMD_R_R_C, emitIns_R_R_C, emitIns_SIMD_R_R_S, emitIns_R_R_S

Covered cases:

Embedded Broadcast is enabled in Vector256.Add() with limited cases:

//case 1
Vector256.Add(Vec, Vector256.Create(FloatConst));

//case 2
Vector256<float> VecCns = Vector256.Create(FloatConst); 
Vector256.Add(Vec, VecCns);

//case 3
Vector256.Add(Vec, Vector256.Create(LCL_VAR));

//case 4
Vector256<float> VecCns = Vector256.Create(LCL_VAR); 
Vector256.Add(Vec, VecCns);

Note: Case 2 4 can only be optimized when DOTNET_TieredCompilation = 0.

For case 1,2, embedded broadcast reduces the memory reference size by:

Original assembly: 
vaddps ymm, ymm, m256
Optimized assembly: 
vaddps ymm, ymm, m32 {1to8}

For case 3, 4, embedded broadcast reduces the number of instruction and the memory reference size by:

Original assembly: 
vbroadcast ymm, m32
vaddps ymm, ymm, ymm
Optimized assembly: 
vaddps ymm, ymm, m32 {1to8}

Future works to do:

This commit is intended to give a showcase of the optimization provided by embedded broadcast feature. We will increase the coverage of supported datatype(Int, Double, etc) and operator (Bitwise AND, OR, etc) to complete this PR.

Author:	Ruihan-Yin
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `community-contribution`
Milestone:	-

BruceForstall · 2023-04-14T05:51:57Z

cc @dotnet/avx512-contrib

tannergooding · 2023-04-14T14:03:07Z

The enabling work involves the following changes:

a new Gentree flag: GTF_SIMD_ADD_EB, to track if a NI_AVX(2)_Add node requires embedded broadcast or not.

a new Gentree flag: GTF_VECCON_FROMSCALAR, to track if a constant vector is created from a single scalar.

an extra bit on instruction descriptor: _idEmbBroadcast, to propagate the embedded broadcast flag after lowering, continue to track this feature when interacting with hardware intrinsic addps.

a new instruction flag: INS_Flags_EmbeddedBroadcastSupported: to track if an intrinsic is embedded broadcast compatible.

modification on the logics when setting containment for Add node when lowering.

more checking logics and special lowering when lowering the NI_Vector256_Create node.

arguments list for these emit methods: emitIns_SIMD_R_R_C, emitIns_R_R_C, emitIns_SIMD_R_R_S, emitIns_R_R_S

I'm not sure 1 is the right approach, most of the other points seem about like I'd expect.

Embedded broadcast is supported for most EVEX instructions, not just add, so we want to ensure names aren't specific to any individual instruction.

Embedded broadcast requires the data to be coming "from memory". So, we're basically just optimizing the amount of memory used for any GenTreeVecCon nodes by allowing us to emit a 4/8-byte constant rather than a 16/32/64-byte constant and also allowing containment of Broadcast intrinsics...

So, I'd expect us to basically just continue containing GenTreeVecCon as is, and simply extending lowering to also support containing relevant NI_*_Broadcast nodes based on the EmbeddedBroadcastSupported flag.

I'd then expect us to update genOperandDesc to have handling for contained Broadcast nodes and to optimize the GenTreeVecCon case when all elements are equal. -- Tracking some "this was created from a scalar" can help, but I don't know if it buys much as we'll still need to check for other broadcast patterns "somewhere" in the list. It'd basically just be an optimization if we needed to check that scenario more than once, and I'm not sure if we do.

We'd then flow the OperandDesc info, which includes a flag on "embedded broadcast", down into emitIns_R_*_C, R_*_S, and R_*_A calls (the last is important for containing NI_*_Broadcast nodes) so the instrDesc can also track the information somehow.

Ruihan-Yin · 2023-04-19T16:48:38Z

Changes based on the reviews:

In the genOperandDesc(), now it has logics to check if an operand is a contained broadcast node, this logic can be complicated since we are actually seeking for a pattern of embedded broadcast compatible intrinsic -> broadcast -> createScalar -> local variable.
So I introduced a flag on the broadcast node: GTF_BROADCAST_EMBEDDED to avoid duplicated checking which might happen in lowering, contain checks and emit phases. This flag can be applied to multiple broadcast intrinsic to indicate if this broadcast node is a part of embedded broadcast that needs to be contained.
For the GenTreeVecCon node, if we keep as it is in the previous stage, and only emit a scalar during codegen, we will face the following difficulty: it will be hard to get the base type of the vector, then it will hard to create the corresponding scalar operand descriptor. In the implementation, I simply used the emitting instruction to get the base type, but I am not sure what is the best way at this phase to handle this problem.
convert GTF_SIMD_ADD_EB to a intrinsic flag HW_Flag_EmbBroadcastCompatible, only indicates the embedded broadcast compatibility of the intrinsic, and whether embedded broadcast is enabled will be checked during emit.

Ruihan-Yin · 2023-04-20T18:06:51Z

@tannergooding Hi, please check the changes on genOperandDesc() and inst_RV_RV_TT().

The way we handle the broadcast node can be as expected, but for the GenTreeVecCon can be trickier, given the information we have within inst_RV_RV_TT(), say instruction and constant vector node, the base type of the vector is unknown, then we may only be able to get the base type from the instruction. I am not sure how we can handle this at this point.

As for the last part of the previous review, I am a bit unclear about how a flag on the OperandDesc can be passed into the mentioned emit calls, since those calls only take a part of information from OperandDesc, say Field Handle, VarNum, IndirForm, etc.

tannergooding · 2023-04-21T19:08:58Z

given the information we have within inst_RV_RV_TT(), say instruction and constant vector node, the base type of the vector is unknown, then we may only be able to get the base type from the instruction. I am not sure how we can handle this at this point.

I imagine we'll need to pass down the simdBaseType so the right decision can be made when the constant is being emitted

I am a bit unclear about how a flag on the OperandDesc can be passed into the mentioned emit calls, since those calls only take a part of information from OperandDesc, say Field Handle, VarNum, IndirForm, etc.

We will likely need to extend the emitter slightly. On Arm64, there is an optional insOpts parameter passed in (which defaults to INS_OPTS_NONE). I imagine we may need a similar thing to fully support EVEX such that we can track embedded broadcast, embedded rounding control, SAE, or any other features in the future.

tannergooding · 2023-04-21T19:24:55Z

src/coreclr/jit/emit.h

@@ -780,6 +780,9 @@ class emitter
        unsigned _idCallRegPtr : 1; // IL indirect calls: addr in reg
        unsigned _idCallAddr : 1;   // IL indirect calls: can make a direct call to iiaAddr
        unsigned _idNoGC : 1;       // Some helpers don't get recorded in GC tables
+#if defined(TARGET_XARCH)
+        unsigned _idEmbBroadcast : 1;


We'll ultimately need space to encode:

EVEX.aaa: Embedded opmask register specifier

EVEX.Z: Zeroing/Merging

EVEX.b: Broadcast/RC/SAE Context

EVEX.L'L: Vector length/RC

This is 7 additional bits, so it might be good to make this be the more general EVEX.b bit that way it is usable for Broadcast, RC, and SAE. Maybe _idEvexBContext as a name?

--
There are also a couple comments higher up that need to be fixed as well, discussing the number of bits being used in each area, etc.

src/coreclr/jit/emit.h

src/coreclr/jit/emitxarch.cpp

Ruihan-Yin · 2023-04-21T21:19:14Z

Thanks for the feedback! we will make the adjustments on the design based on the suggestion.

Ruihan-Yin · 2023-04-25T20:30:13Z

Rebased the branch and resolved the conflicts.

src/coreclr/jit/emit.h

src/coreclr/jit/emitxarch.cpp

src/coreclr/jit/gentree.cpp

src/coreclr/jit/hwintrinsiclistxarch.h

src/coreclr/jit/gentree.h

src/coreclr/jit/instr.cpp

1. deleted irrelevant comments. Move the contain check up to cover more cases.

Ruihan-Yin · 2023-05-24T19:50:24Z

Also, can you run the TP regression locally to find out what is causing the regression?
diff_analysis.txt

From the results, TakeEvexPrefix is the major cause of regression, does this indicate the increase in the size of instrDesc could be the issue?

Ruihan-Yin · 2023-05-24T20:17:50Z

diff_analysis2.txt

More samples from different benchmarks.

Ruihan-Yin · 2023-05-24T20:32:31Z

Can you please confirm if the windows/x86 failures are from your change?

The same test failed in the last run passed in this run, look like the fail should be a random fail in CI.

Ruihan-Yin · 2023-05-31T16:28:54Z

Hi @kunalspathak @tannergooding, based on the regression study attached above, do we have further actions in this PR?

kunalspathak · 2023-05-31T16:32:55Z

Hi @kunalspathak @tannergooding, based on the regression study attached above, do we have further actions in this PR?

I will get back to you today. Sorry, I forgot about this.

kunalspathak · 2023-05-31T18:11:02Z

Also, can you run the TP regression locally to find out what is causing the regression?
diff_analysis.txt

From the results, `TakeEvexPrefix` is the major cause of regression, does this indicate the increase in the size of `instrDesc` could be the issue?

Is this for Minopts benchnmarks collection? Because that's the one that is showing 0.14% TP regression.

kunalspathak · 2023-05-31T18:30:29Z

Seems this is coming from update to HasEmbeddedBroadcast() which was returning false but now has id->idIsEvexbContext().

kunalspathak

few comments. @tannergooding - could you double check the asserts in instr.cpp? I have limited knowledge in that. Other than that looks good.

src/coreclr/jit/emit.h

src/coreclr/jit/emitxarch.cpp

kunalspathak · 2023-05-31T22:20:21Z

src/coreclr/jit/gentree.cpp

+        case NI_AVX_BroadcastScalarToVector128:
+        case NI_AVX_BroadcastScalarToVector256:
+        case NI_SSE3_MoveAndDuplicate:
+        case NI_AVX512F_BroadcastScalarToVector512:


did you open the issue for the follow-up work?

Ruihan-Yin · 2023-05-31T22:56:58Z

Is this for Minopts benchnmarks collection? Because that's the one that is showing 0.14% TP regression.

Thanks for pointing out, @kunalspathak. From the number, I believe the result should be all contexts collections, I am not sure how I can collect the MinOpts context number alone, will setting some env vars work?

1. Update comment to keep up with the changes in InstrDesc. 2. Removed un-needed argumnet in the irrelevant method.

Ruihan-Yin · 2023-05-31T23:33:13Z

Seems this is coming from update to HasEmbeddedBroadcast() which was returning false but now has id->idIsEvexbContext().

The conclusion makes sense to me.
I suppose this might be inevitable with the existing code design? Do we need any follow-up issue on this?

tannergooding · 2023-06-01T00:16:35Z

Do we need any follow-up issue on this?

An issue explicitly tracking us splitting out the behavior into a new instrDesc (as Bruce suggested above) would be ideal. That may allow us to mitigate the cost and general impact to non EVEX instructions

Ruihan-Yin · 2023-06-01T17:05:27Z

Do we need any follow-up issue on this?

An issue explicitly tracking us splitting out the behavior into a new instrDesc (as Bruce suggested above) would be ideal. That may allow us to mitigate the cost and general impact to non EVEX instructions

Thanks for the suggestion. This issue is giving a track of this part of work.

Ruihan-Yin · 2023-06-02T02:33:32Z

Hi @tannergooding @kunalspathak, except a few questions I left above, I would suppose the work in this PR can be considered as completed, or are we expecting additional reviews/works in this PR before merging?

tannergooding · 2023-06-02T02:46:11Z

Just need to finish doing my review pass on this. I expect to finish it early tomorrow and will hopefully merge when the completes

Ruihan-Yin · 2023-06-02T15:26:14Z

Thanks so much for all the help and suggestions!

ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 14, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 14, 2023

BruceForstall added the avx512 Related to the AVX-512 architecture label Apr 14, 2023

build-analysis bot mentioned this pull request Apr 20, 2023

Tracking issue for CI build timeouts #76454

Closed

tannergooding reviewed Apr 21, 2023

View reviewed changes

src/coreclr/jit/emit.h Outdated Show resolved Hide resolved

tannergooding reviewed Apr 21, 2023

View reviewed changes

src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Apr 21, 2023

View reviewed changes

src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved

Ruihan-Yin force-pushed the EmbBroadcastEnabling branch from 9a2af6a to b66b77a Compare April 25, 2023 20:29