[DAGComb] Do not turn insert_elt into shuffle for single elt vectors. #1287

fhahn · 2020-05-29T14:56:48Z

Currently combineInsertEltToShuffle turns insert_vector_elt into a
vector_shuffle, even if the inserted element is a vector with a single
element. In this case, it should be unlikely that the additional shuffle
would be more efficient than a insert_vector_elt.

Additionally, this fixes a infinite cycle in DAGCombine, where
combineInsertEltToShuffle turns a insert_vector_elt into a shuffle,
which gets turned back into a insert_vector_elt/extract_vector_elt by
a custom AArch64 lowering (in visitVECTOR_SHUFFLE).

Such insert_vector_elt and extract_vector_elt combinations can be
lowered efficiently using mov on AArch64.

There are 2 test changes in arm64-neon-copy.ll: we now use one or two
mov instructions instead of a single zip1. The reason that we need a
second mov in ins1f2 is that we have to move the result to the result
register and is not really related to the DAGCombine fold I think.
But in any case, on most uarchs, mov should be cheaper than zip1. On a
Cortex-A75 for example, zip1 is twice as expensive as mov
(https://developer.arm.com/docs/101398/latest/arm-cortex-a75-software-optimization-guide-v20)

Reviewers: spatel, efriedma, dmgreen, RKSimon

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D80710

(Cherry-picked from d20a3d3)

Currently combineInsertEltToShuffle turns insert_vector_elt into a vector_shuffle, even if the inserted element is a vector with a single element. In this case, it should be unlikely that the additional shuffle would be more efficient than a insert_vector_elt. Additionally, this fixes a infinite cycle in DAGCombine, where combineInsertEltToShuffle turns a insert_vector_elt into a shuffle, which gets turned back into a insert_vector_elt/extract_vector_elt by a custom AArch64 lowering (in visitVECTOR_SHUFFLE). Such insert_vector_elt and extract_vector_elt combinations can be lowered efficiently using mov on AArch64. There are 2 test changes in arm64-neon-copy.ll: we now use one or two mov instructions instead of a single zip1. The reason that we need a second mov in ins1f2 is that we have to move the result to the result register and is not really related to the DAGCombine fold I think. But in any case, on most uarchs, mov should be cheaper than zip1. On a Cortex-A75 for example, zip1 is twice as expensive as mov (https://developer.arm.com/docs/101398/latest/arm-cortex-a75-software-optimization-guide-v20) Reviewers: spatel, efriedma, dmgreen, RKSimon Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D80710 (Cherry-picked from d20a3d3)

fhahn · 2020-05-29T14:56:57Z

@swift-ci please test

Gerolf-Apple

Also passed open source review. LGTM.

Gerolf-Apple approved these changes Jun 2, 2020

View reviewed changes

fhahn merged commit ea3e220 into swiftlang:apple/stable/20200108 Jun 2, 2020

fhahn deleted the dagcombine-cycle branch June 24, 2020 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DAGComb] Do not turn insert_elt into shuffle for single elt vectors. #1287

[DAGComb] Do not turn insert_elt into shuffle for single elt vectors. #1287

fhahn commented May 29, 2020

fhahn commented May 29, 2020

Gerolf-Apple left a comment

[DAGComb] Do not turn insert_elt into shuffle for single elt vectors. #1287

[DAGComb] Do not turn insert_elt into shuffle for single elt vectors. #1287

Conversation

fhahn commented May 29, 2020

fhahn commented May 29, 2020

Gerolf-Apple left a comment

Choose a reason for hiding this comment