Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CostModel][X86] Add initial costs for non-lane-crossing one/two input shuffles #114680

Merged
merged 3 commits into from
Nov 4, 2024

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Nov 2, 2024

Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.

This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.

@llvmbot
Copy link
Member

llvmbot commented Nov 2, 2024

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-analysis

Author: Simon Pilgrim (RKSimon)

Changes

Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.

This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.


Patch is 216.63 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/114680.diff

22 Files Affected:

  • (modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+64)
  • (modified) llvm/test/Analysis/CostModel/X86/reduction.ll (+16-16)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-extract_subvector-codesize.ll (+2-2)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-extract_subvector-latency.ll (+2-2)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-extract_subvector-sizelatency.ll (+2-2)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-extract_subvector.ll (+2-2)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-insert_subvector-codesize.ll (+16-16)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-insert_subvector-latency.ll (+16-16)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-insert_subvector-sizelatency.ll (+16-16)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-insert_subvector.ll (+16-16)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-non-pow-2-codesize.ll (+1-1)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-non-pow-2-latency.ll (+1-1)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-non-pow-2-sizelatency.ll (+1-1)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-non-pow-2.ll (+1-1)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-transpose-codesize.ll (+13-13)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-transpose-latency.ll (+13-13)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-transpose-sizelatency.ll (+13-13)
  • (modified) llvm/test/Analysis/CostModel/X86/shuffle-transpose.ll (+13-13)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/horiz-math-inseltpoison.ll (+4-3)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/horiz-math.ll (+4-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-calls-inseltpoison.ll (+4-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-calls.ll (+4-6)
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 520284d1d7a488..a1374012f4f57d 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -1766,6 +1766,21 @@ InstructionCost X86TTIImpl::getShuffleCost(
     LT.first = NumOfDests * NumOfShufflesPerDest;
   }
 
+  // Attempt to detect a cheaper inlane shuffle, avoiding subvector permutation.
+  bool IsInLaneShuffle = false;
+  if ((Kind == TTI::SK_PermuteSingleSrc || Kind == TTI::SK_PermuteTwoSrc) &&
+      LT.first == 1 && LT.second.isVector() &&
+      LT.second.getVectorNumElements() == Mask.size() &&
+      (LT.second.is256BitVector() || LT.second.is512BitVector())) {
+    IsInLaneShuffle = true;
+    unsigned NumLanes = LT.second.getSizeInBits() / 128;
+    unsigned NumEltPerLane = Mask.size() / NumLanes;
+    for (auto [I, M] : enumerate(Mask))
+      if (0 <= M)
+        IsInLaneShuffle &=
+            ((M % Mask.size()) / NumEltPerLane) == (I / NumEltPerLane);
+  }
+
   static const CostTblEntry AVX512VBMIShuffleTbl[] = {
       {TTI::SK_Reverse, MVT::v64i8, 1}, // vpermb
       {TTI::SK_Reverse, MVT::v32i8, 1}, // vpermb
@@ -1897,6 +1912,25 @@ InstructionCost X86TTIImpl::getShuffleCost(
       if (auto KindCost = Entry->Cost[CostKind])
         return LT.first * *KindCost;
 
+  static const CostTblEntry AVX2InLaneShuffleTbl[] = {
+      {TTI::SK_PermuteSingleSrc, MVT::v16i16, 1}, // vpshufb
+      {TTI::SK_PermuteSingleSrc, MVT::v16f16, 1}, // vpshufb
+      {TTI::SK_PermuteSingleSrc, MVT::v32i8,  1}, // vpshufb
+
+      {TTI::SK_PermuteTwoSrc, MVT::v4f64,     2}, // 2*vshufpd + vblendpd
+      {TTI::SK_PermuteTwoSrc, MVT::v8f32,     2}, // 2*vshufps + vblendps
+      {TTI::SK_PermuteTwoSrc, MVT::v4i64,     2}, // 2*vpshufd + vpblendd
+      {TTI::SK_PermuteTwoSrc, MVT::v8i32,     2}, // 2*vpshufd + vpblendd
+      {TTI::SK_PermuteTwoSrc, MVT::v16i16,    2}, // 2*vpshufb + vpor
+      {TTI::SK_PermuteTwoSrc, MVT::v16f16,    2}, // 2*vpshufb + vpor
+      {TTI::SK_PermuteTwoSrc, MVT::v32i8,     2}, // 2*vpshufb + vpor
+  };
+
+  if (IsInLaneShuffle && ST->hasAVX2())
+    if (const auto *Entry =
+            CostTableLookup(AVX2InLaneShuffleTbl, Kind, LT.second))
+      return LT.first * Entry->Cost;
+
   static const CostTblEntry AVX2ShuffleTbl[] = {
       {TTI::SK_Broadcast, MVT::v4f64, 1},  // vbroadcastpd
       {TTI::SK_Broadcast, MVT::v8f32, 1},  // vbroadcastps
@@ -1973,6 +2007,36 @@ InstructionCost X86TTIImpl::getShuffleCost(
     if (const auto *Entry = CostTableLookup(XOPShuffleTbl, Kind, LT.second))
       return LT.first * Entry->Cost;
 
+  static const CostTblEntry AVX1InLaneShuffleTbl[] = {
+      {TTI::SK_PermuteSingleSrc, MVT::v4f64,  1},  // vpermilpd
+      {TTI::SK_PermuteSingleSrc, MVT::v4i64,  1},  // vpermilpd
+      {TTI::SK_PermuteSingleSrc, MVT::v8f32,  1},  // vpermilps
+      {TTI::SK_PermuteSingleSrc, MVT::v8i32,  1},  // vpermilps
+
+      {TTI::SK_PermuteSingleSrc, MVT::v16i16, 4}, // vextractf128 + 2*pshufb
+                                                  // + vpor + vinsertf128
+      {TTI::SK_PermuteSingleSrc, MVT::v16f16, 4}, // vextractf128 + 2*pshufb
+                                                  // + vpor + vinsertf128
+      {TTI::SK_PermuteSingleSrc, MVT::v32i8,  4}, // vextractf128 + 2*pshufb
+                                                  // + vpor + vinsertf128
+
+      {TTI::SK_PermuteTwoSrc, MVT::v4f64,     2}, // 2*vshufpd + vblendpd
+      {TTI::SK_PermuteTwoSrc, MVT::v8f32,     2}, // 2*vshufps + vblendps
+      {TTI::SK_PermuteTwoSrc, MVT::v4i64,     2}, // 2*vpermilpd + vblendpd
+      {TTI::SK_PermuteTwoSrc, MVT::v8i32,     2}, // 2*vpermilps + vblendps
+      {TTI::SK_PermuteTwoSrc, MVT::v16i16,    9}, // 2*vextractf128 + 4*pshufb
+                                                  // + 2*vpor + vinsertf128
+      {TTI::SK_PermuteTwoSrc, MVT::v16f16,    9}, // 2*vextractf128 + 4*pshufb
+                                                  // + 2*vpor + vinsertf128
+      {TTI::SK_PermuteTwoSrc, MVT::v32i8,     9}, // 2*vextractf128 + 4*pshufb
+                                                  // + 2*vpor + vinsertf128
+  };
+
+  if (IsInLaneShuffle && ST->hasAVX())
+    if (const auto *Entry =
+            CostTableLookup(AVX1InLaneShuffleTbl, Kind, LT.second))
+      return LT.first * Entry->Cost;
+
   static const CostTblEntry AVX1ShuffleTbl[] = {
       {TTI::SK_Broadcast, MVT::v4f64, 2},  // vperm2f128 + vpermilpd
       {TTI::SK_Broadcast, MVT::v8f32, 2},  // vperm2f128 + vpermilps
diff --git a/llvm/test/Analysis/CostModel/X86/reduction.ll b/llvm/test/Analysis/CostModel/X86/reduction.ll
index 9fe127a28320a1..e4686f50d22bcf 100644
--- a/llvm/test/Analysis/CostModel/X86/reduction.ll
+++ b/llvm/test/Analysis/CostModel/X86/reduction.ll
@@ -73,9 +73,9 @@ define fastcc i32 @reduction_cost_int(<8 x i32> %rdx) {
 ; AVX1-LABEL: 'reduction_cost_int'
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf = shufflevector <8 x i32> %rdx, <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx = add <8 x i32> %rdx, %rdx.shuf
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.2 = shufflevector <8 x i32> %bin.rdx, <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf.2 = shufflevector <8 x i32> %bin.rdx, <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx.2 = add <8 x i32> %bin.rdx, %rdx.shuf.2
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.3 = shufflevector <8 x i32> %bin.rdx.2, <8 x i32> undef, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf.3 = shufflevector <8 x i32> %bin.rdx.2, <8 x i32> undef, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx.3 = add <8 x i32> %bin.rdx.2, %rdx.shuf.3
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %r = extractelement <8 x i32> %bin.rdx.3, i32 0
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 %r
@@ -442,7 +442,7 @@ define fastcc double @no_pairwise_reduction4double(<4 x double> %rdx, double %f1
 ; AVX1-LABEL: 'no_pairwise_reduction4double'
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %rdx.shuf = shufflevector <4 x double> %rdx, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx = fadd <4 x double> %rdx, %rdx.shuf
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %rdx.shuf7 = shufflevector <4 x double> %bin.rdx, <4 x double> undef, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf7 = shufflevector <4 x double> %bin.rdx, <4 x double> undef, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx8 = fadd <4 x double> %bin.rdx, %rdx.shuf7
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %r = extractelement <4 x double> %bin.rdx8, i32 0
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret double %r
@@ -506,9 +506,9 @@ define fastcc float @no_pairwise_reduction8float(<8 x float> %rdx, float %f1) {
 ; AVX1-LABEL: 'no_pairwise_reduction8float'
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf3 = shufflevector <8 x float> %rdx, <8 x float> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx4 = fadd <8 x float> %rdx, %rdx.shuf3
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf = shufflevector <8 x float> %bin.rdx4, <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf = shufflevector <8 x float> %bin.rdx4, <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx = fadd <8 x float> %bin.rdx4, %rdx.shuf
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf7 = shufflevector <8 x float> %bin.rdx, <8 x float> undef, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf7 = shufflevector <8 x float> %bin.rdx, <8 x float> undef, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx8 = fadd <8 x float> %bin.rdx, %rdx.shuf7
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %r = extractelement <8 x float> %bin.rdx8, i32 0
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret float %r
@@ -602,7 +602,7 @@ define fastcc i64 @no_pairwise_reduction4i64(<4 x i64> %rdx, i64 %f1) {
 ; AVX1-LABEL: 'no_pairwise_reduction4i64'
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %rdx.shuf = shufflevector <4 x i64> %rdx, <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx = add <4 x i64> %rdx, %rdx.shuf
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %rdx.shuf7 = shufflevector <4 x i64> %bin.rdx, <4 x i64> undef, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf7 = shufflevector <4 x i64> %bin.rdx, <4 x i64> undef, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx8 = add <4 x i64> %bin.rdx, %rdx.shuf7
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %r = extractelement <4 x i64> %bin.rdx8, i32 0
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i64 %r
@@ -708,9 +708,9 @@ define fastcc i32 @no_pairwise_reduction8i32(<8 x i32> %rdx, i32 %f1) {
 ; AVX1-LABEL: 'no_pairwise_reduction8i32'
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf3 = shufflevector <8 x i32> %rdx, <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx4 = add <8 x i32> %rdx, %rdx.shuf3
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf = shufflevector <8 x i32> %bin.rdx4, <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf = shufflevector <8 x i32> %bin.rdx4, <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx = add <8 x i32> %bin.rdx4, %rdx.shuf
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf7 = shufflevector <8 x i32> %bin.rdx, <8 x i32> undef, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf7 = shufflevector <8 x i32> %bin.rdx, <8 x i32> undef, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx8 = add <8 x i32> %bin.rdx, %rdx.shuf7
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %r = extractelement <8 x i32> %bin.rdx8, i32 0
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 %r
@@ -888,7 +888,7 @@ define fastcc double @pairwise_reduction4double(<4 x double> %rdx, double %f1) {
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %rdx.shuf.0.1 = shufflevector <4 x double> %rdx, <4 x double> undef, <4 x i32> <i32 1, i32 3, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx = fadd <4 x double> %rdx.shuf.0.0, %rdx.shuf.0.1
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %rdx.shuf.1.0 = shufflevector <4 x double> %bin.rdx, <4 x double> undef, <4 x i32> <i32 0, i32 poison, i32 poison, i32 poison>
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %rdx.shuf.1.1 = shufflevector <4 x double> %bin.rdx, <4 x double> undef, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf.1.1 = shufflevector <4 x double> %bin.rdx, <4 x double> undef, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx8 = fadd <4 x double> %rdx.shuf.1.0, %rdx.shuf.1.1
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %r = extractelement <4 x double> %bin.rdx8, i32 0
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret double %r
@@ -968,11 +968,11 @@ define fastcc float @pairwise_reduction8float(<8 x float> %rdx, float %f1) {
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.0.0 = shufflevector <8 x float> %rdx, <8 x float> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.0.1 = shufflevector <8 x float> %rdx, <8 x float> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx = fadd <8 x float> %rdx.shuf.0.0, %rdx.shuf.0.1
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.1.0 = shufflevector <8 x float> %bin.rdx, <8 x float> undef, <8 x i32> <i32 0, i32 2, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.1.1 = shufflevector <8 x float> %bin.rdx, <8 x float> undef, <8 x i32> <i32 1, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf.1.0 = shufflevector <8 x float> %bin.rdx, <8 x float> undef, <8 x i32> <i32 0, i32 2, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf.1.1 = shufflevector <8 x float> %bin.rdx, <8 x float> undef, <8 x i32> <i32 1, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx8 = fadd <8 x float> %rdx.shuf.1.0, %rdx.shuf.1.1
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %rdx.shuf.2.0 = shufflevector <8 x float> %bin.rdx8, <8 x float> undef, <8 x i32> <i32 0, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.2.1 = shufflevector <8 x float> %bin.rdx8, <8 x float> undef, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf.2.1 = shufflevector <8 x float> %bin.rdx8, <8 x float> undef, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %bin.rdx9 = fadd <8 x float> %rdx.shuf.2.0, %rdx.shuf.2.1
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %r = extractelement <8 x float> %bin.rdx9, i32 0
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret float %r
@@ -1088,7 +1088,7 @@ define fastcc i64 @pairwise_reduction4i64(<4 x i64> %rdx, i64 %f1) {
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %rdx.shuf.0.1 = shufflevector <4 x i64> %rdx, <4 x i64> undef, <4 x i32> <i32 1, i32 3, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx = add <4 x i64> %rdx.shuf.0.0, %rdx.shuf.0.1
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %rdx.shuf.1.0 = shufflevector <4 x i64> %bin.rdx, <4 x i64> undef, <4 x i32> <i32 0, i32 poison, i32 poison, i32 poison>
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %rdx.shuf.1.1 = shufflevector <4 x i64> %bin.rdx, <4 x i64> undef, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf.1.1 = shufflevector <4 x i64> %bin.rdx, <4 x i64> undef, <4 x i32> <i32 1, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx8 = add <4 x i64> %rdx.shuf.1.0, %rdx.shuf.1.1
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %r = extractelement <4 x i64> %bin.rdx8, i32 0
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i64 %r
@@ -1222,11 +1222,11 @@ define fastcc i32 @pairwise_reduction8i32(<8 x i32> %rdx, i32 %f1) {
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.0.0 = shufflevector <8 x i32> %rdx, <8 x i32> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.0.1 = shufflevector <8 x i32> %rdx, <8 x i32> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx = add <8 x i32> %rdx.shuf.0.0, %rdx.shuf.0.1
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.1.0 = shufflevector <8 x i32> %bin.rdx, <8 x i32> undef, <8 x i32> <i32 0, i32 2, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.1.1 = shufflevector <8 x i32> %bin.rdx, <8 x i32> undef, <8 x i32> <i32 1, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf.1.0 = shufflevector <8 x i32> %bin.rdx, <8 x i32> undef, <8 x i32> <i32 0, i32 2, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %rdx.shuf.1.1 = shufflevector <8 x i32> %bin.rdx, <8 x i32> undef, <8 x i32> <i32 1, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %bin.rdx8 = add <8 x i32> %rdx.shuf.1.0, %rdx.shuf.1.1
 ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %rdx.shuf.2.0 = shufflevector <8 x i32> %bin.rdx8, <8 x i32> undef, <8 x i32> <i32 0, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
-; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %rdx.shuf.2.1 = shufflevector <8 x i32> %bin.rdx8, <8 x i32> undef, <8 x i32> <i32 1, i32 poiso...
[truncated]

Copy link

github-actions bot commented Nov 2, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 30213e99b86a078c9d472264c7edeea9aa638dd4 9fc7a66fe904f64e367afcc918756a944008a5b8 --extensions cpp -- llvm/lib/Target/X86/X86TargetTransformInfo.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 7f362f48c2..f4679fce2c 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -1917,15 +1917,15 @@ InstructionCost X86TTIImpl::getShuffleCost(
   static const CostTblEntry AVX2InLaneShuffleTbl[] = {
       {TTI::SK_PermuteSingleSrc, MVT::v16i16, 1}, // vpshufb
       {TTI::SK_PermuteSingleSrc, MVT::v16f16, 1}, // vpshufb
-      {TTI::SK_PermuteSingleSrc, MVT::v32i8,  1}, // vpshufb
-
-      {TTI::SK_PermuteTwoSrc, MVT::v4f64,     2}, // 2*vshufpd + vblendpd
-      {TTI::SK_PermuteTwoSrc, MVT::v8f32,     2}, // 2*vshufps + vblendps
-      {TTI::SK_PermuteTwoSrc, MVT::v4i64,     2}, // 2*vpshufd + vpblendd
-      {TTI::SK_PermuteTwoSrc, MVT::v8i32,     2}, // 2*vpshufd + vpblendd
-      {TTI::SK_PermuteTwoSrc, MVT::v16i16,    2}, // 2*vpshufb + vpor
-      {TTI::SK_PermuteTwoSrc, MVT::v16f16,    2}, // 2*vpshufb + vpor
-      {TTI::SK_PermuteTwoSrc, MVT::v32i8,     2}, // 2*vpshufb + vpor
+      {TTI::SK_PermuteSingleSrc, MVT::v32i8, 1},  // vpshufb
+
+      {TTI::SK_PermuteTwoSrc, MVT::v4f64, 2},  // 2*vshufpd + vblendpd
+      {TTI::SK_PermuteTwoSrc, MVT::v8f32, 2},  // 2*vshufps + vblendps
+      {TTI::SK_PermuteTwoSrc, MVT::v4i64, 2},  // 2*vpshufd + vpblendd
+      {TTI::SK_PermuteTwoSrc, MVT::v8i32, 2},  // 2*vpshufd + vpblendd
+      {TTI::SK_PermuteTwoSrc, MVT::v16i16, 2}, // 2*vpshufb + vpor
+      {TTI::SK_PermuteTwoSrc, MVT::v16f16, 2}, // 2*vpshufb + vpor
+      {TTI::SK_PermuteTwoSrc, MVT::v32i8, 2},  // 2*vpshufb + vpor
   };
 
   if (IsInLaneShuffle && ST->hasAVX2())
@@ -2010,28 +2010,28 @@ InstructionCost X86TTIImpl::getShuffleCost(
       return LT.first * Entry->Cost;
 
   static const CostTblEntry AVX1InLaneShuffleTbl[] = {
-      {TTI::SK_PermuteSingleSrc, MVT::v4f64,  1},  // vpermilpd
-      {TTI::SK_PermuteSingleSrc, MVT::v4i64,  1},  // vpermilpd
-      {TTI::SK_PermuteSingleSrc, MVT::v8f32,  1},  // vpermilps
-      {TTI::SK_PermuteSingleSrc, MVT::v8i32,  1},  // vpermilps
+      {TTI::SK_PermuteSingleSrc, MVT::v4f64, 1}, // vpermilpd
+      {TTI::SK_PermuteSingleSrc, MVT::v4i64, 1}, // vpermilpd
+      {TTI::SK_PermuteSingleSrc, MVT::v8f32, 1}, // vpermilps
+      {TTI::SK_PermuteSingleSrc, MVT::v8i32, 1}, // vpermilps
 
       {TTI::SK_PermuteSingleSrc, MVT::v16i16, 4}, // vextractf128 + 2*pshufb
                                                   // + vpor + vinsertf128
       {TTI::SK_PermuteSingleSrc, MVT::v16f16, 4}, // vextractf128 + 2*pshufb
                                                   // + vpor + vinsertf128
-      {TTI::SK_PermuteSingleSrc, MVT::v32i8,  4}, // vextractf128 + 2*pshufb
+      {TTI::SK_PermuteSingleSrc, MVT::v32i8, 4},  // vextractf128 + 2*pshufb
                                                   // + vpor + vinsertf128
 
-      {TTI::SK_PermuteTwoSrc, MVT::v4f64,     2}, // 2*vshufpd + vblendpd
-      {TTI::SK_PermuteTwoSrc, MVT::v8f32,     2}, // 2*vshufps + vblendps
-      {TTI::SK_PermuteTwoSrc, MVT::v4i64,     2}, // 2*vpermilpd + vblendpd
-      {TTI::SK_PermuteTwoSrc, MVT::v8i32,     2}, // 2*vpermilps + vblendps
-      {TTI::SK_PermuteTwoSrc, MVT::v16i16,    9}, // 2*vextractf128 + 4*pshufb
-                                                  // + 2*vpor + vinsertf128
-      {TTI::SK_PermuteTwoSrc, MVT::v16f16,    9}, // 2*vextractf128 + 4*pshufb
-                                                  // + 2*vpor + vinsertf128
-      {TTI::SK_PermuteTwoSrc, MVT::v32i8,     9}, // 2*vextractf128 + 4*pshufb
-                                                  // + 2*vpor + vinsertf128
+      {TTI::SK_PermuteTwoSrc, MVT::v4f64, 2},  // 2*vshufpd + vblendpd
+      {TTI::SK_PermuteTwoSrc, MVT::v8f32, 2},  // 2*vshufps + vblendps
+      {TTI::SK_PermuteTwoSrc, MVT::v4i64, 2},  // 2*vpermilpd + vblendpd
+      {TTI::SK_PermuteTwoSrc, MVT::v8i32, 2},  // 2*vpermilps + vblendps
+      {TTI::SK_PermuteTwoSrc, MVT::v16i16, 9}, // 2*vextractf128 + 4*pshufb
+                                               // + 2*vpor + vinsertf128
+      {TTI::SK_PermuteTwoSrc, MVT::v16f16, 9}, // 2*vextractf128 + 4*pshufb
+                                               // + 2*vpor + vinsertf128
+      {TTI::SK_PermuteTwoSrc, MVT::v32i8, 9},  // 2*vextractf128 + 4*pshufb
+                                               // + 2*vpor + vinsertf128
   };
 
   if (IsInLaneShuffle && ST->hasAVX())

@RKSimon RKSimon force-pushed the x86-inlane-shuffle-costs branch from 826bd3b to 66c2387 Compare November 3, 2024 12:04
Comment on lines 1571 to 1575
IsInLaneShuffle = true;
for (auto [I, M] : enumerate(Mask))
if (0 <= M)
IsInLaneShuffle &=
((M % Mask.size()) / NumEltsPerLane) == (I / NumEltsPerLane);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsInLaneShuffle = all_of(enumerate(Mask), [&](const auto &P) { return P.value() == PoisonMaskElem || ((P.value() % Mask.size()) / NumEltsPerLane) == (P.index() / NumEltsPerLane); });

…t shuffles

Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.

This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.
@RKSimon RKSimon force-pushed the x86-inlane-shuffle-costs branch from 66c2387 to 6b79cb6 Compare November 3, 2024 14:43
bool IsInLaneShuffle = false;
if (BaseTp->getPrimitiveSizeInBits() > 0 &&
(BaseTp->getPrimitiveSizeInBits() % 128) == 0 &&
Mask.size() == BaseTp->getElementCount().getKnownMinValue()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also check that the original type of the vector element is the same as the type of the "legalized" (machine) vector type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think checking the scalar sizes match should be enough - I'll add that shortly

@@ -1564,6 +1564,7 @@ InstructionCost X86TTIImpl::getShuffleCost(
bool IsInLaneShuffle = false;
if (BaseTp->getPrimitiveSizeInBits() > 0 &&
(BaseTp->getPrimitiveSizeInBits() % 128) == 0 &&
BaseTp->getScalarSizeInBits() == LT.second.getScalarSizeInBits() &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it work for pointers?

Copy link
Collaborator Author

@RKSimon RKSimon Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, thats what BaseTp->getPrimitiveSizeInBits() > 0 is checking for as ptr vectors return 0

Copy link
Member

@alexey-bataev alexey-bataev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@RKSimon RKSimon merged commit ac1869a into llvm:main Nov 4, 2024
7 of 8 checks passed
@RKSimon RKSimon deleted the x86-inlane-shuffle-costs branch November 4, 2024 10:19
PhilippRados pushed a commit to PhilippRados/llvm-project that referenced this pull request Nov 6, 2024
…t shuffles (llvm#114680)

Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.

This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants