Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable EVEX feature: embedded broadcast for Vector128/256/512.Add() in limited cases #84821

Merged
merged 44 commits into from
Jun 2, 2023

Conversation

Ruihan-Yin
Copy link
Contributor

@Ruihan-Yin Ruihan-Yin commented Apr 14, 2023

Description:

Enabled EVEX feature: embedded broadcast and provided optimization in some use cases of Vector256.Add() when vector base type is TYP_FLOAT.

The enabling work involves the following changes:

  1. a new Gentree flag: GTF_SIMD_ADD_EB, to track if a NI_AVX(2)_Add node requires embedded broadcast or not.
  2. a new Gentree flag: GTF_VECCON_FROMSCALAR, to track if a constant vector is created from a single scalar.
  3. an extra bit on instruction descriptor: _idEmbBroadcast, to propagate the embedded broadcast flag after lowering, continue to track this feature when interacting with hardware intrinsic addps.
  4. a new instruction flag: INS_Flags_EmbeddedBroadcastSupported: to track if an intrinsic is embedded broadcast compatible.
  5. modification on the logics when setting containment for Add node when lowering.
  6. more checking logics and special lowering when lowering the NI_Vector256_Create node.
  7. arguments list for these emit methods: emitIns_SIMD_R_R_C, emitIns_R_R_C, emitIns_SIMD_R_R_S, emitIns_R_R_S

Covered cases:

Embedded Broadcast is enabled in Vector256.Add() with limited cases:

//case 1
Vector256.Add(Vec, Vector256.Create(FloatConst));

//case 2
Vector256<float> VecCns = Vector256.Create(FloatConst); 
Vector256.Add(Vec, VecCns);

//case 3
Vector256.Add(Vec, Vector256.Create(LCL_VAR));

//case 4
Vector256<float> VecCns = Vector256.Create(LCL_VAR); 
Vector256.Add(Vec, VecCns);

Note: Case 2 4 can only be optimized when DOTNET_TieredCompilation = 0.

For case 1,2, embedded broadcast reduces the memory reference size by:

Original assembly: 
vaddps ymm, ymm, m256
Optimized assembly: 
vaddps ymm, ymm, m32 {1to8}

For case 3, 4, embedded broadcast reduces the number of instruction and the memory reference size by:

Original assembly: 
vbroadcast ymm, m32
vaddps ymm, ymm, ymm
Optimized assembly: 
vaddps ymm, ymm, m32 {1to8}

Future works to do:

This commit is intended to give a showcase of the optimization provided by embedded broadcast feature. We will increase the coverage of supported datatype(Int, Double, etc) and operator (Bitwise AND, OR, etc) to complete this PR.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 14, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 14, 2023
@ghost
Copy link

ghost commented Apr 14, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Description:

Enabled EVEX feature: embedded broadcast and provided optimization in some use cases of Vector256.Add() when vector base type is TYP_FLOAT.

The enabling work involves the following changes:

  1. a new Gentree flag: GTF_SIMD_ADD_EB, to track if a NI_AVX(2)_Add node requires embedded broadcast or not.
  2. a new Gentree flag: GTF_VECCON_FROMSCALAR, to track if a constant vector is created from a single scalar.
  3. an extra bit on instruction descriptor: _idEmbBroadcast, to propagate the embedded broadcast flag after lowering, continue to track this feature when interacting with hardware intrinsic addps.
  4. a new instruction flag: INS_Flags_EmbeddedBroadcastSupported: to track if an intrinsic is embedded broadcast compatible.
  5. modification on the logics when setting containment for Add node when lowering.
  6. more checking logics and special lowering when lowering the NI_Vector256_Create node.
  7. arguments list for these emit methods: emitIns_SIMD_R_R_C, emitIns_R_R_C, emitIns_SIMD_R_R_S, emitIns_R_R_S

Covered cases:

Embedded Broadcast is enabled in Vector256.Add() with limited cases:

//case 1
Vector256.Add(Vec, Vector256.Create(FloatConst));

//case 2
Vector256<float> VecCns = Vector256.Create(FloatConst); 
Vector256.Add(Vec, VecCns);

//case 3
Vector256.Add(Vec, Vector256.Create(LCL_VAR));

//case 4
Vector256<float> VecCns = Vector256.Create(LCL_VAR); 
Vector256.Add(Vec, VecCns);

Note: Case 2 4 can only be optimized when DOTNET_TieredCompilation = 0.

For case 1,2, embedded broadcast reduces the memory reference size by:

Original assembly: 
vaddps ymm, ymm, m256
Optimized assembly: 
vaddps ymm, ymm, m32 {1to8}

For case 3, 4, embedded broadcast reduces the number of instruction and the memory reference size by:

Original assembly: 
vbroadcast ymm, m32
vaddps ymm, ymm, ymm
Optimized assembly: 
vaddps ymm, ymm, m32 {1to8}

Future works to do:

This commit is intended to give a showcase of the optimization provided by embedded broadcast feature. We will increase the coverage of supported datatype(Int, Double, etc) and operator (Bitwise AND, OR, etc) to complete this PR.

Author: Ruihan-Yin
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Apr 14, 2023
@BruceForstall
Copy link
Member

cc @dotnet/avx512-contrib

@tannergooding
Copy link
Member

The enabling work involves the following changes:

  1. a new Gentree flag: GTF_SIMD_ADD_EB, to track if a NI_AVX(2)_Add node requires embedded broadcast or not.
  2. a new Gentree flag: GTF_VECCON_FROMSCALAR, to track if a constant vector is created from a single scalar.
  3. an extra bit on instruction descriptor: _idEmbBroadcast, to propagate the embedded broadcast flag after lowering, continue to track this feature when interacting with hardware intrinsic addps.
  4. a new instruction flag: INS_Flags_EmbeddedBroadcastSupported: to track if an intrinsic is embedded broadcast compatible.
  5. modification on the logics when setting containment for Add node when lowering.
  6. more checking logics and special lowering when lowering the NI_Vector256_Create node.
  7. arguments list for these emit methods: emitIns_SIMD_R_R_C, emitIns_R_R_C, emitIns_SIMD_R_R_S, emitIns_R_R_S

I'm not sure 1 is the right approach, most of the other points seem about like I'd expect.

Embedded broadcast is supported for most EVEX instructions, not just add, so we want to ensure names aren't specific to any individual instruction.

Embedded broadcast requires the data to be coming "from memory". So, we're basically just optimizing the amount of memory used for any GenTreeVecCon nodes by allowing us to emit a 4/8-byte constant rather than a 16/32/64-byte constant and also allowing containment of Broadcast intrinsics...

So, I'd expect us to basically just continue containing GenTreeVecCon as is, and simply extending lowering to also support containing relevant NI_*_Broadcast nodes based on the EmbeddedBroadcastSupported flag.

I'd then expect us to update genOperandDesc to have handling for contained Broadcast nodes and to optimize the GenTreeVecCon case when all elements are equal. -- Tracking some "this was created from a scalar" can help, but I don't know if it buys much as we'll still need to check for other broadcast patterns "somewhere" in the list. It'd basically just be an optimization if we needed to check that scenario more than once, and I'm not sure if we do.

We'd then flow the OperandDesc info, which includes a flag on "embedded broadcast", down into emitIns_R_*_C, R_*_S, and R_*_A calls (the last is important for containing NI_*_Broadcast nodes) so the instrDesc can also track the information somehow.

@Ruihan-Yin
Copy link
Contributor Author

Changes based on the reviews:

  1. In the genOperandDesc(), now it has logics to check if an operand is a contained broadcast node, this logic can be complicated since we are actually seeking for a pattern of embedded broadcast compatible intrinsic -> broadcast -> createScalar -> local variable.
    So I introduced a flag on the broadcast node: GTF_BROADCAST_EMBEDDED to avoid duplicated checking which might happen in lowering, contain checks and emit phases. This flag can be applied to multiple broadcast intrinsic to indicate if this broadcast node is a part of embedded broadcast that needs to be contained.
  2. For the GenTreeVecCon node, if we keep as it is in the previous stage, and only emit a scalar during codegen, we will face the following difficulty: it will be hard to get the base type of the vector, then it will hard to create the corresponding scalar operand descriptor. In the implementation, I simply used the emitting instruction to get the base type, but I am not sure what is the best way at this phase to handle this problem.
  3. convert GTF_SIMD_ADD_EB to a intrinsic flag HW_Flag_EmbBroadcastCompatible, only indicates the embedded broadcast compatibility of the intrinsic, and whether embedded broadcast is enabled will be checked during emit.

@Ruihan-Yin
Copy link
Contributor Author

@tannergooding Hi, please check the changes on genOperandDesc() and inst_RV_RV_TT().

The way we handle the broadcast node can be as expected, but for the GenTreeVecCon can be trickier, given the information we have within inst_RV_RV_TT(), say instruction and constant vector node, the base type of the vector is unknown, then we may only be able to get the base type from the instruction. I am not sure how we can handle this at this point.

As for the last part of the previous review, I am a bit unclear about how a flag on the OperandDesc can be passed into the mentioned emit calls, since those calls only take a part of information from OperandDesc, say Field Handle, VarNum, IndirForm, etc.

@tannergooding
Copy link
Member

given the information we have within inst_RV_RV_TT(), say instruction and constant vector node, the base type of the vector is unknown, then we may only be able to get the base type from the instruction. I am not sure how we can handle this at this point.

I imagine we'll need to pass down the simdBaseType so the right decision can be made when the constant is being emitted

I am a bit unclear about how a flag on the OperandDesc can be passed into the mentioned emit calls, since those calls only take a part of information from OperandDesc, say Field Handle, VarNum, IndirForm, etc.

We will likely need to extend the emitter slightly. On Arm64, there is an optional insOpts parameter passed in (which defaults to INS_OPTS_NONE). I imagine we may need a similar thing to fully support EVEX such that we can track embedded broadcast, embedded rounding control, SAE, or any other features in the future.

@@ -780,6 +780,9 @@ class emitter
unsigned _idCallRegPtr : 1; // IL indirect calls: addr in reg
unsigned _idCallAddr : 1; // IL indirect calls: can make a direct call to iiaAddr
unsigned _idNoGC : 1; // Some helpers don't get recorded in GC tables
#if defined(TARGET_XARCH)
unsigned _idEmbBroadcast : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll ultimately need space to encode:

  • EVEX.aaa: Embedded opmask register specifier
  • EVEX.Z: Zeroing/Merging
  • EVEX.b: Broadcast/RC/SAE Context
  • EVEX.L'L: Vector length/RC

This is 7 additional bits, so it might be good to make this be the more general EVEX.b bit that way it is usable for Broadcast, RC, and SAE. Maybe _idEvexBContext as a name?

--
There are also a couple comments higher up that need to be fixed as well, discussing the number of bits being used in each area, etc.

@Ruihan-Yin
Copy link
Contributor Author

Thanks for the feedback! we will make the adjustments on the design based on the suggestion.

@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented Apr 25, 2023

Rebased the branch and resolved the conflicts.

1. deleted irrelevant comments.

Move the contain check up to cover more cases.
@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented May 24, 2023

Also, can you run the TP regression locally to find out what is causing the regression?
diff_analysis.txt

MicrosoftTeams-image (2)

From the results, TakeEvexPrefix is the major cause of regression, does this indicate the increase in the size of instrDesc could be the issue?

@Ruihan-Yin
Copy link
Contributor Author

diff_analysis2.txt

More samples from different benchmarks.

@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented May 24, 2023

Can you please confirm if the windows/x86 failures are from your change?

The same test failed in the last run passed in this run, look like the fail should be a random fail in CI.

@Ruihan-Yin
Copy link
Contributor Author

Hi @kunalspathak @tannergooding, based on the regression study attached above, do we have further actions in this PR?

@kunalspathak
Copy link
Member

Hi @kunalspathak @tannergooding, based on the regression study attached above, do we have further actions in this PR?

I will get back to you today. Sorry, I forgot about this.

@kunalspathak
Copy link
Member

Also, can you run the TP regression locally to find out what is causing the regression?
diff_analysis.txt

MicrosoftTeams-image (2) From the results, `TakeEvexPrefix` is the major cause of regression, does this indicate the increase in the size of `instrDesc` could be the issue?

Is this for Minopts benchnmarks collection? Because that's the one that is showing 0.14% TP regression.

image

@kunalspathak
Copy link
Member

Seems this is coming from update to HasEmbeddedBroadcast() which was returning false but now has id->idIsEvexbContext().

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments. @tannergooding - could you double check the asserts in instr.cpp? I have limited knowledge in that. Other than that looks good.

src/coreclr/jit/emit.h Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved
case NI_AVX_BroadcastScalarToVector128:
case NI_AVX_BroadcastScalarToVector256:
case NI_SSE3_MoveAndDuplicate:
case NI_AVX512F_BroadcastScalarToVector512:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you open the issue for the follow-up work?

@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented May 31, 2023

Is this for Minopts benchnmarks collection? Because that's the one that is showing 0.14% TP regression.

Thanks for pointing out, @kunalspathak. From the number, I believe the result should be all contexts collections, I am not sure how I can collect the MinOpts context number alone, will setting some env vars work?

1. Update comment to keep up with the changes in InstrDesc.
2. Removed un-needed argumnet in the irrelevant method.
@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented May 31, 2023

Seems this is coming from update to HasEmbeddedBroadcast() which was returning false but now has id->idIsEvexbContext().

The conclusion makes sense to me.
I suppose this might be inevitable with the existing code design? Do we need any follow-up issue on this?

@tannergooding
Copy link
Member

tannergooding commented Jun 1, 2023

Do we need any follow-up issue on this?

An issue explicitly tracking us splitting out the behavior into a new instrDesc (as Bruce suggested above) would be ideal. That may allow us to mitigate the cost and general impact to non EVEX instructions

@Ruihan-Yin
Copy link
Contributor Author

Do we need any follow-up issue on this?

An issue explicitly tracking us splitting out the behavior into a new instrDesc (as Bruce suggested above) would be ideal. That may allow us to mitigate the cost and general impact to non EVEX instructions

Thanks for the suggestion. This issue is giving a track of this part of work.

@Ruihan-Yin
Copy link
Contributor Author

Hi @tannergooding @kunalspathak, except a few questions I left above, I would suppose the work in this PR can be considered as completed, or are we expecting additional reviews/works in this PR before merging?

@tannergooding
Copy link
Member

Just need to finish doing my review pass on this. I expect to finish it early tomorrow and will hopefully merge when the completes

@tannergooding tannergooding merged commit 1e029d0 into dotnet:main Jun 2, 2023
@Ruihan-Yin
Copy link
Contributor Author

Thanks so much for all the help and suggestions!

@Ruihan-Yin Ruihan-Yin deleted the EmbBroadcastEnabling branch June 5, 2023 18:55
@ghost ghost locked as resolved and limited conversation to collaborators Jul 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants