Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Allow non-power-of-2 vectors for VLS code generation #97010

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kito-cheng
Copy link
Member

SLP supports non-power-of-2 vectors [1], so we should consider supporting this
for RISC-V vector code generation. It is natural to support non-power-of-2 VLS
vectors for the vector extension, as VL does not impose any constraints on this.

In theory, we could support any length, but we want to prevent the
number of MVTs from growing too quickly. Therefore, we only add v3, v5,
v7 and v15.

[1] #77790


NOTE: this PR depend on #96481, so you can ignore ValueTypes.td and AMDGPUISelLowering.cpp during review

@llvmbot
Copy link
Member

llvmbot commented Jun 28, 2024

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-backend-risc-v

Author: Kito Cheng (kito-cheng)

Changes

SLP supports non-power-of-2 vectors [1], so we should consider supporting this
for RISC-V vector code generation. It is natural to support non-power-of-2 VLS
vectors for the vector extension, as VL does not impose any constraints on this.

In theory, we could support any length, but we want to prevent the
number of MVTs from growing too quickly. Therefore, we only add v3, v5,
v7 and v15.

[1] #77790


NOTE: this PR depend on #96481, so you can ignore ValueTypes.td and AMDGPUISelLowering.cpp during review


Patch is 152.72 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/97010.diff

32 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/ValueTypes.td (+224-201)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+2-1)
  • (modified) llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp (-2)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+30-9)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bitreverse-vp.ll (+89-107)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap-vp.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctlz-vp.ll (+100-206)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctpop-vp.ll (+43-106)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll (+100-206)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-extract.ll (+10-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp.ll (+20-66)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp2i.ll (+122-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-i2fp.ll (+2-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert.ll (+8-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-splat.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll (+5-45)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-llrint.ll (+3-24)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-lrint.ll (+6-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-non-power-of-2.ll (-8)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-formation.ll (+13-83)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-setcc-fp-vp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll (+24-28)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-stepvector.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfadd-vp.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfdiv-vp.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfma-vp.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfmul-vp.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfmuladd-vp.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfsub-vp.ll (+2-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/memcpy-inline.ll (+9-57)
  • (modified) llvm/test/CodeGen/RISCV/srem-seteq-illegal-types.ll (+48-40)
  • (modified) llvm/test/CodeGen/RISCV/urem-seteq-illegal-types.ll (+47-34)
diff --git a/llvm/include/llvm/CodeGen/ValueTypes.td b/llvm/include/llvm/CodeGen/ValueTypes.td
index 963b6a71de380..9f2c4806a6141 100644
--- a/llvm/include/llvm/CodeGen/ValueTypes.td
+++ b/llvm/include/llvm/CodeGen/ValueTypes.td
@@ -83,210 +83,233 @@ def v1i1    : VTVec<1,    i1, 17>;  //    1 x i1 vector value
 def v2i1    : VTVec<2,    i1, 18>;  //    2 x i1 vector value
 def v3i1    : VTVec<3,    i1, 19>;  //    3 x i1 vector value
 def v4i1    : VTVec<4,    i1, 20>;  //    4 x i1 vector value
-def v8i1    : VTVec<8,    i1, 21>;  //    8 x i1 vector value
-def v16i1   : VTVec<16,   i1, 22>;  //   16 x i1 vector value
-def v32i1   : VTVec<32,   i1, 23>;  //   32 x i1 vector value
-def v64i1   : VTVec<64,   i1, 24>;  //   64 x i1 vector value
-def v128i1  : VTVec<128,  i1, 25>;  //  128 x i1 vector value
-def v256i1  : VTVec<256,  i1, 26>;  //  256 x i1 vector value
-def v512i1  : VTVec<512,  i1, 27>;  //  512 x i1 vector value
-def v1024i1 : VTVec<1024, i1, 28>;  // 1024 x i1 vector value
-def v2048i1 : VTVec<2048, i1, 29>;  // 2048 x i1 vector value
-
-def v128i2  : VTVec<128,  i2, 30>;   //  128 x i2 vector value
-def v256i2  : VTVec<256,  i2, 31>;   //  256 x i2 vector value
-
-def v64i4   : VTVec<64,   i4, 32>;   //   64 x i4 vector value
-def v128i4  : VTVec<128,  i4, 33>;   //  128 x i4 vector value
-
-def v1i8    : VTVec<1,    i8, 34>;  //    1 x i8 vector value
-def v2i8    : VTVec<2,    i8, 35>;  //    2 x i8 vector value
-def v3i8    : VTVec<3,    i8, 36>;  //    3 x i8 vector value
-def v4i8    : VTVec<4,    i8, 37>;  //    4 x i8 vector value
-def v8i8    : VTVec<8,    i8, 38>;  //    8 x i8 vector value
-def v16i8   : VTVec<16,   i8, 39>;  //   16 x i8 vector value
-def v32i8   : VTVec<32,   i8, 40>;  //   32 x i8 vector value
-def v64i8   : VTVec<64,   i8, 41>;  //   64 x i8 vector value
-def v128i8  : VTVec<128,  i8, 42>;  //  128 x i8 vector value
-def v256i8  : VTVec<256,  i8, 43>;  //  256 x i8 vector value
-def v512i8  : VTVec<512,  i8, 44>;  //  512 x i8 vector value
-def v1024i8 : VTVec<1024, i8, 45>;  // 1024 x i8 vector value
-
-def v1i16   : VTVec<1,   i16, 46>;  //   1 x i16 vector value
-def v2i16   : VTVec<2,   i16, 47>;  //   2 x i16 vector value
-def v3i16   : VTVec<3,   i16, 48>;  //   3 x i16 vector value
-def v4i16   : VTVec<4,   i16, 49>;  //   4 x i16 vector value
-def v8i16   : VTVec<8,   i16, 50>;  //   8 x i16 vector value
-def v16i16  : VTVec<16,  i16, 51>;  //  16 x i16 vector value
-def v32i16  : VTVec<32,  i16, 52>;  //  32 x i16 vector value
-def v64i16  : VTVec<64,  i16, 53>;  //  64 x i16 vector value
-def v128i16 : VTVec<128, i16, 54>;  // 128 x i16 vector value
-def v256i16 : VTVec<256, i16, 55>;  // 256 x i16 vector value
-def v512i16 : VTVec<512, i16, 56>;  // 512 x i16 vector value
-
-def v1i32    : VTVec<1,    i32, 57>;  //    1 x i32 vector value
-def v2i32    : VTVec<2,    i32, 58>;  //    2 x i32 vector value
-def v3i32    : VTVec<3,    i32, 59>;  //    3 x i32 vector value
-def v4i32    : VTVec<4,    i32, 60>;  //    4 x i32 vector value
-def v5i32    : VTVec<5,    i32, 61>;  //    5 x i32 vector value
-def v6i32    : VTVec<6,    i32, 62>;  //    6 x f32 vector value
-def v7i32    : VTVec<7,    i32, 63>;  //    7 x f32 vector value
-def v8i32    : VTVec<8,    i32, 64>;  //    8 x i32 vector value
-def v9i32    : VTVec<9,    i32, 65>;  //    9 x i32 vector value
-def v10i32   : VTVec<10,   i32, 66>;  //   10 x i32 vector value
-def v11i32   : VTVec<11,   i32, 67>;  //   11 x i32 vector value
-def v12i32   : VTVec<12,   i32, 68>;  //   12 x i32 vector value
-def v16i32   : VTVec<16,   i32, 69>;  //   16 x i32 vector value
-def v32i32   : VTVec<32,   i32, 70>;  //   32 x i32 vector value
-def v64i32   : VTVec<64,   i32, 71>;  //   64 x i32 vector value
-def v128i32  : VTVec<128,  i32, 72>;  //  128 x i32 vector value
-def v256i32  : VTVec<256,  i32, 73>;  //  256 x i32 vector value
-def v512i32  : VTVec<512,  i32, 74>;  //  512 x i32 vector value
-def v1024i32 : VTVec<1024, i32, 75>;  // 1024 x i32 vector value
-def v2048i32 : VTVec<2048, i32, 76>;  // 2048 x i32 vector value
-
-def v1i64   : VTVec<1,   i64, 77>;  //   1 x i64 vector value
-def v2i64   : VTVec<2,   i64, 78>;  //   2 x i64 vector value
-def v3i64   : VTVec<3,   i64, 79>;  //   3 x i64 vector value
-def v4i64   : VTVec<4,   i64, 80>;  //   4 x i64 vector value
-def v8i64   : VTVec<8,   i64, 81>;  //   8 x i64 vector value
-def v16i64  : VTVec<16,  i64, 82>;  //  16 x i64 vector value
-def v32i64  : VTVec<32,  i64, 83>;  //  32 x i64 vector value
-def v64i64  : VTVec<64,  i64, 84>;  //  64 x i64 vector value
-def v128i64 : VTVec<128, i64, 85>;  // 128 x i64 vector value
-def v256i64 : VTVec<256, i64, 86>;  // 256 x i64 vector value
-
-def v1i128  : VTVec<1,  i128, 87>;  //  1 x i128 vector value
-
-def v1f16    : VTVec<1,    f16,  88>;  //    1 x f16 vector value
-def v2f16    : VTVec<2,    f16,  89>;  //    2 x f16 vector value
-def v3f16    : VTVec<3,    f16,  90>;  //    3 x f16 vector value
-def v4f16    : VTVec<4,    f16,  91>;  //    4 x f16 vector value
-def v8f16    : VTVec<8,    f16,  92>;  //    8 x f16 vector value
-def v16f16   : VTVec<16,   f16,  93>;  //   16 x f16 vector value
-def v32f16   : VTVec<32,   f16,  94>;  //   32 x f16 vector value
-def v64f16   : VTVec<64,   f16,  95>;  //   64 x f16 vector value
-def v128f16  : VTVec<128,  f16,  96>;  //  128 x f16 vector value
-def v256f16  : VTVec<256,  f16,  97>;  //  256 x f16 vector value
-def v512f16  : VTVec<512,  f16,  98>;  //  512 x f16 vector value
-
-def v2bf16   : VTVec<2,   bf16,  99>;  //    2 x bf16 vector value
-def v3bf16   : VTVec<3,   bf16, 100>;  //    3 x bf16 vector value
-def v4bf16   : VTVec<4,   bf16, 101>;  //    4 x bf16 vector value
-def v8bf16   : VTVec<8,   bf16, 102>;  //    8 x bf16 vector value
-def v16bf16  : VTVec<16,  bf16, 103>;  //   16 x bf16 vector value
-def v32bf16  : VTVec<32,  bf16, 104>;  //   32 x bf16 vector value
-def v64bf16  : VTVec<64,  bf16, 105>;  //   64 x bf16 vector value
-def v128bf16 : VTVec<128, bf16, 106>;  //  128 x bf16 vector value
-
-def v1f32    : VTVec<1,    f32, 107>;  //    1 x f32 vector value
-def v2f32    : VTVec<2,    f32, 108>;  //    2 x f32 vector value
-def v3f32    : VTVec<3,    f32, 109>;  //    3 x f32 vector value
-def v4f32    : VTVec<4,    f32, 110>;  //    4 x f32 vector value
-def v5f32    : VTVec<5,    f32, 111>;  //    5 x f32 vector value
-def v6f32    : VTVec<6,    f32, 112>;  //    6 x f32 vector value
-def v7f32    : VTVec<7,    f32, 113>;  //    7 x f32 vector value
-def v8f32    : VTVec<8,    f32, 114>;  //    8 x f32 vector value
-def v9f32    : VTVec<9,    f32, 115>;  //    9 x f32 vector value
-def v10f32   : VTVec<10,   f32, 116>;  //   10 x f32 vector value
-def v11f32   : VTVec<11,   f32, 117>;  //   11 x f32 vector value
-def v12f32   : VTVec<12,   f32, 118>;  //   12 x f32 vector value
-def v16f32   : VTVec<16,   f32, 119>;  //   16 x f32 vector value
-def v32f32   : VTVec<32,   f32, 120>;  //   32 x f32 vector value
-def v64f32   : VTVec<64,   f32, 121>;  //   64 x f32 vector value
-def v128f32  : VTVec<128,  f32, 122>;  //  128 x f32 vector value
-def v256f32  : VTVec<256,  f32, 123>;  //  256 x f32 vector value
-def v512f32  : VTVec<512,  f32, 124>;  //  512 x f32 vector value
-def v1024f32 : VTVec<1024, f32, 125>;  // 1024 x f32 vector value
-def v2048f32 : VTVec<2048, f32, 126>;  // 2048 x f32 vector value
-
-def v1f64    : VTVec<1,    f64, 127>;  //    1 x f64 vector value
-def v2f64    : VTVec<2,    f64, 128>;  //    2 x f64 vector value
-def v3f64    : VTVec<3,    f64, 129>;  //    3 x f64 vector value
-def v4f64    : VTVec<4,    f64, 130>;  //    4 x f64 vector value
-def v8f64    : VTVec<8,    f64, 131>;  //    8 x f64 vector value
-def v16f64   : VTVec<16,   f64, 132>;  //   16 x f64 vector value
-def v32f64   : VTVec<32,   f64, 133>;  //   32 x f64 vector value
-def v64f64   : VTVec<64,   f64, 134>;  //   64 x f64 vector value
-def v128f64  : VTVec<128,  f64, 135>;  //  128 x f64 vector value
-def v256f64  : VTVec<256,  f64, 136>;  //  256 x f64 vector value
-
-def nxv1i1  : VTScalableVec<1,  i1, 137>;  // n x  1 x i1  vector value
-def nxv2i1  : VTScalableVec<2,  i1, 138>;  // n x  2 x i1  vector value
-def nxv4i1  : VTScalableVec<4,  i1, 139>;  // n x  4 x i1  vector value
-def nxv8i1  : VTScalableVec<8,  i1, 140>;  // n x  8 x i1  vector value
-def nxv16i1 : VTScalableVec<16, i1, 141>;  // n x 16 x i1  vector value
-def nxv32i1 : VTScalableVec<32, i1, 142>;  // n x 32 x i1  vector value
-def nxv64i1 : VTScalableVec<64, i1, 143>;  // n x 64 x i1  vector value
-
-def nxv1i8  : VTScalableVec<1,  i8, 144>;  // n x  1 x i8  vector value
-def nxv2i8  : VTScalableVec<2,  i8, 145>;  // n x  2 x i8  vector value
-def nxv4i8  : VTScalableVec<4,  i8, 146>;  // n x  4 x i8  vector value
-def nxv8i8  : VTScalableVec<8,  i8, 147>;  // n x  8 x i8  vector value
-def nxv16i8 : VTScalableVec<16, i8, 148>;  // n x 16 x i8  vector value
-def nxv32i8 : VTScalableVec<32, i8, 149>;  // n x 32 x i8  vector value
-def nxv64i8 : VTScalableVec<64, i8, 150>;  // n x 64 x i8  vector value
-
-def nxv1i16  : VTScalableVec<1,  i16, 151>;  // n x  1 x i16 vector value
-def nxv2i16  : VTScalableVec<2,  i16, 152>;  // n x  2 x i16 vector value
-def nxv4i16  : VTScalableVec<4,  i16, 153>;  // n x  4 x i16 vector value
-def nxv8i16  : VTScalableVec<8,  i16, 154>;  // n x  8 x i16 vector value
-def nxv16i16 : VTScalableVec<16, i16, 155>;  // n x 16 x i16 vector value
-def nxv32i16 : VTScalableVec<32, i16, 156>;  // n x 32 x i16 vector value
-
-def nxv1i32  : VTScalableVec<1,  i32, 157>;  // n x  1 x i32 vector value
-def nxv2i32  : VTScalableVec<2,  i32, 158>;  // n x  2 x i32 vector value
-def nxv4i32  : VTScalableVec<4,  i32, 159>;  // n x  4 x i32 vector value
-def nxv8i32  : VTScalableVec<8,  i32, 160>;  // n x  8 x i32 vector value
-def nxv16i32 : VTScalableVec<16, i32, 161>;  // n x 16 x i32 vector value
-def nxv32i32 : VTScalableVec<32, i32, 162>;  // n x 32 x i32 vector value
-
-def nxv1i64  : VTScalableVec<1,  i64, 163>;  // n x  1 x i64 vector value
-def nxv2i64  : VTScalableVec<2,  i64, 164>;  // n x  2 x i64 vector value
-def nxv4i64  : VTScalableVec<4,  i64, 165>;  // n x  4 x i64 vector value
-def nxv8i64  : VTScalableVec<8,  i64, 166>;  // n x  8 x i64 vector value
-def nxv16i64 : VTScalableVec<16, i64, 167>;  // n x 16 x i64 vector value
-def nxv32i64 : VTScalableVec<32, i64, 168>;  // n x 32 x i64 vector value
-
-def nxv1f16  : VTScalableVec<1,  f16, 169>;  // n x  1 x  f16 vector value
-def nxv2f16  : VTScalableVec<2,  f16, 170>;  // n x  2 x  f16 vector value
-def nxv4f16  : VTScalableVec<4,  f16, 171>;  // n x  4 x  f16 vector value
-def nxv8f16  : VTScalableVec<8,  f16, 172>;  // n x  8 x  f16 vector value
-def nxv16f16 : VTScalableVec<16, f16, 173>;  // n x 16 x  f16 vector value
-def nxv32f16 : VTScalableVec<32, f16, 174>;  // n x 32 x  f16 vector value
-
-def nxv1bf16  : VTScalableVec<1,  bf16, 175>;  // n x  1 x bf16 vector value
-def nxv2bf16  : VTScalableVec<2,  bf16, 176>;  // n x  2 x bf16 vector value
-def nxv4bf16  : VTScalableVec<4,  bf16, 177>;  // n x  4 x bf16 vector value
-def nxv8bf16  : VTScalableVec<8,  bf16, 178>;  // n x  8 x bf16 vector value
-def nxv16bf16 : VTScalableVec<16, bf16, 179>;  // n x 16 x bf16 vector value
-def nxv32bf16 : VTScalableVec<32, bf16, 180>;  // n x 32 x bf16 vector value
-
-def nxv1f32  : VTScalableVec<1,  f32, 181>;  // n x  1 x  f32 vector value
-def nxv2f32  : VTScalableVec<2,  f32, 182>;  // n x  2 x  f32 vector value
-def nxv4f32  : VTScalableVec<4,  f32, 183>;  // n x  4 x  f32 vector value
-def nxv8f32  : VTScalableVec<8,  f32, 184>;  // n x  8 x  f32 vector value
-def nxv16f32 : VTScalableVec<16, f32, 185>;  // n x 16 x  f32 vector value
-
-def nxv1f64  : VTScalableVec<1,  f64, 186>;  // n x  1 x  f64 vector value
-def nxv2f64  : VTScalableVec<2,  f64, 187>;  // n x  2 x  f64 vector value
-def nxv4f64  : VTScalableVec<4,  f64, 188>;  // n x  4 x  f64 vector value
-def nxv8f64  : VTScalableVec<8,  f64, 189>;  // n x  8 x  f64 vector value
-
-def x86mmx    : ValueType<64,   190>;  // X86 MMX value
-def Glue      : ValueType<0,    191>;  // Pre-RA sched glue
-def isVoid    : ValueType<0,    192>;  // Produces no value
-def untyped   : ValueType<8,    193> { // Produces an untyped value
+def v5i1    : VTVec<5,    i1, 21>;  //    5 x i1 vector value
+def v7i1    : VTVec<7,    i1, 22>;  //    7 x i1 vector value
+def v8i1    : VTVec<8,    i1, 23>;  //    8 x i1 vector value
+def v15i1   : VTVec<15,   i1, 24>;  //   15 x i1 vector value
+def v16i1   : VTVec<16,   i1, 25>;  //   16 x i1 vector value
+def v32i1   : VTVec<32,   i1, 26>;  //   32 x i1 vector value
+def v64i1   : VTVec<64,   i1, 27>;  //   64 x i1 vector value
+def v128i1  : VTVec<128,  i1, 28>;  //  128 x i1 vector value
+def v256i1  : VTVec<256,  i1, 29>;  //  256 x i1 vector value
+def v512i1  : VTVec<512,  i1, 30>;  //  512 x i1 vector value
+def v1024i1 : VTVec<1024, i1, 31>;  // 1024 x i1 vector value
+def v2048i1 : VTVec<2048, i1, 32>;  // 2048 x i1 vector value
+
+def v128i2  : VTVec<128,  i2, 33>;   //  128 x i2 vector value
+def v256i2  : VTVec<256,  i2, 34>;   //  256 x i2 vector value
+
+def v64i4   : VTVec<64,   i4, 35>;   //   64 x i4 vector value
+def v128i4  : VTVec<128,  i4, 36>;   //  128 x i4 vector value
+
+def v1i8    : VTVec<1,    i8, 37>;  //    1 x i8 vector value
+def v2i8    : VTVec<2,    i8, 38>;  //    2 x i8 vector value
+def v3i8    : VTVec<3,    i8, 39>;  //    3 x i8 vector value
+def v4i8    : VTVec<4,    i8, 40>;  //    4 x i8 vector value
+def v5i8    : VTVec<5,    i8, 41>;  //    5 x i8 vector value
+def v7i8    : VTVec<7,    i8, 42>;  //    7 x i8 vector value
+def v8i8    : VTVec<8,    i8, 43>;  //    8 x i8 vector value
+def v15i8   : VTVec<15,   i8, 44>;  //   15 x i8 vector value
+def v16i8   : VTVec<16,   i8, 45>;  //   16 x i8 vector value
+def v32i8   : VTVec<32,   i8, 46>;  //   32 x i8 vector value
+def v64i8   : VTVec<64,   i8, 47>;  //   64 x i8 vector value
+def v128i8  : VTVec<128,  i8, 48>;  //  128 x i8 vector value
+def v256i8  : VTVec<256,  i8, 49>;  //  256 x i8 vector value
+def v512i8  : VTVec<512,  i8, 50>;  //  512 x i8 vector value
+def v1024i8 : VTVec<1024, i8, 51>;  // 1024 x i8 vector value
+
+def v1i16   : VTVec<1,   i16, 52>;  //   1 x i16 vector value
+def v2i16   : VTVec<2,   i16, 53>;  //   2 x i16 vector value
+def v3i16   : VTVec<3,   i16, 54>;  //   3 x i16 vector value
+def v4i16   : VTVec<4,   i16, 55>;  //   4 x i16 vector value
+def v5i16   : VTVec<5,   i16, 56>;  //   5 x i16 vector value
+def v7i16   : VTVec<7,   i16, 57>;  //   7 x i16 vector value
+def v8i16   : VTVec<8,   i16, 58>;  //   8 x i16 vector value
+def v15i16  : VTVec<15,  i16, 59>;  //  15 x i16 vector value
+def v16i16  : VTVec<16,  i16, 60>;  //  16 x i16 vector value
+def v32i16  : VTVec<32,  i16, 61>;  //  32 x i16 vector value
+def v64i16  : VTVec<64,  i16, 62>;  //  64 x i16 vector value
+def v128i16 : VTVec<128, i16, 63>;  // 128 x i16 vector value
+def v256i16 : VTVec<256, i16, 64>;  // 256 x i16 vector value
+def v512i16 : VTVec<512, i16, 65>;  // 512 x i16 vector value
+
+def v1i32    : VTVec<1,    i32, 66>;  //    1 x i32 vector value
+def v2i32    : VTVec<2,    i32, 67>;  //    2 x i32 vector value
+def v3i32    : VTVec<3,    i32, 68>;  //    3 x i32 vector value
+def v4i32    : VTVec<4,    i32, 69>;  //    4 x i32 vector value
+def v5i32    : VTVec<5,    i32, 70>;  //    5 x i32 vector value
+def v6i32    : VTVec<6,    i32, 71>;  //    6 x i32 vector value
+def v7i32    : VTVec<7,    i32, 72>;  //    7 x i32 vector value
+def v8i32    : VTVec<8,    i32, 73>;  //    8 x i32 vector value
+def v9i32    : VTVec<9,    i32, 74>;  //    9 x i32 vector value
+def v10i32   : VTVec<10,   i32, 75>;  //   10 x i32 vector value
+def v11i32   : VTVec<11,   i32, 76>;  //   11 x i32 vector value
+def v12i32   : VTVec<12,   i32, 77>;  //   12 x i32 vector value
+def v15i32   : VTVec<15,   i32, 78>;  //   15 x i32 vector value
+def v16i32   : VTVec<16,   i32, 79>;  //   16 x i32 vector value
+def v32i32   : VTVec<32,   i32, 80>;  //   32 x i32 vector value
+def v64i32   : VTVec<64,   i32, 81>;  //   64 x i32 vector value
+def v128i32  : VTVec<128,  i32, 82>;  //  128 x i32 vector value
+def v256i32  : VTVec<256,  i32, 83>;  //  256 x i32 vector value
+def v512i32  : VTVec<512,  i32, 84>;  //  512 x i32 vector value
+def v1024i32 : VTVec<1024, i32, 85>;  // 1024 x i32 vector value
+def v2048i32 : VTVec<2048, i32, 86>;  // 2048 x i32 vector value
+
+def v1i64   : VTVec<1,   i64, 87>;  //   1 x i64 vector value
+def v2i64   : VTVec<2,   i64, 88>;  //   2 x i64 vector value
+def v3i64   : VTVec<3,   i64, 89>;  //   3 x i64 vector value
+def v4i64   : VTVec<4,   i64, 90>;  //   4 x i64 vector value
+def v5i64   : VTVec<5,   i64, 91>;  //   5 x i64 vector value
+def v7i64   : VTVec<7,   i64, 92>;  //   7 x i64 vector value
+def v8i64   : VTVec<8,   i64, 93>;  //   8 x i64 vector value
+def v15i64  : VTVec<15,  i64, 94>;  //  15 x i64 vector value
+def v16i64  : VTVec<16,  i64, 95>;  //  16 x i64 vector value
+def v32i64  : VTVec<32,  i64, 96>;  //  32 x i64 vector value
+def v64i64  : VTVec<64,  i64, 97>;  //  64 x i64 vector value
+def v128i64 : VTVec<128, i64, 98>;  // 128 x i64 vector value
+def v256i64 : VTVec<256, i64, 99>;  // 256 x i64 vector value
+
+def v1i128  : VTVec<1,  i128, 100>;  //  1 x i128 vector value
+
+def v1f16    : VTVec<1,    f16, 101>;  //    1 x f16 vector value
+def v2f16    : VTVec<2,    f16, 102>;  //    2 x f16 vector value
+def v3f16    : VTVec<3,    f16, 103>;  //    3 x f16 vector value
+def v4f16    : VTVec<4,    f16, 104>;  //    4 x f16 vector value
+def v5f16    : VTVec<5,    f16, 105>;  //    5 x f16 vector value
+def v7f16    : VTVec<7,    f16, 106>;  //    7 x f16 vector value
+def v8f16    : VTVec<8,    f16, 107>;  //    8 x f16 vector value
+def v15f16   : VTVec<15,   f16, 108>;  //   15 x f16 vector value
+def v16f16   : VTVec<16,   f16, 109>;  //   16 x f16 vector value
+def v32f16   : VTVec<32,   f16, 110>;  //   32 x f16 vector value
+def v64f16   : VTVec<64,   f16, 111>;  //   64 x f16 vector value
+def v128f16  : VTVec<128,  f16, 112>;  //  128 x f16 vector value
+def v256f16  : VTVec<256,  f16, 113>;  //  256 x f16 vector value
+def v512f16  : VTVec<512,  f16, 114>;  //  512 x f16 vector value
+
+def v2bf16   : VTVec<2,   bf16, 115>;  //    2 x bf16 vector value
+def v3bf16   : VTVec<3,   bf16, 116>;  //    3 x bf16 vector value
+def v4bf16   : VTVec<4,   bf16, 117>;  //    4 x bf16 vector value
+def v8bf16   : VTVec<8,   bf16, 118>;  //    8 x bf16 vector value
+def v15bf16  : VTVec<15,  bf16, 119>;  //   15 x bf16 vector value
+def v16bf16  : VTVec<16,  bf16, 120>;  //   16 x bf16 vector value
+def v32bf16  : VTVec<32,  bf16, 121>;  //   32 x bf16 vector value
+def v64bf16  : VTVec<64,  bf16, 122>;  //   64 x bf16 vector value
+def v128bf16 : VTVec<128, bf16, 123>;  //  128 x bf16 vector value
+
+def v1f32    : VTVec<1,    f32, 124>;  //    1 x f32 vector value
+def v2f32    : VTVec<2,    f32, 125>;  //    2 x f32 vector value
+def v3f32    : VTVec<3,    f32, 126>;  //    3 x f32 vector value
+def v4f32    : VTVec<4,    f32, 127>;  //    4 x f32 vector value
+def v5f32    : VTVec<5,    f32, 128>;  //    5 x f32 vector value
+def v6f32    : VTVec<6,    f32, 129>;  //    6 x f32 vector value
+def v7f32    : VTVec<7,    f32, 130>;  //    7 x f32 vector value
+def v8f32    : VTVec<8,    f32, 131>;  //    8 x f32 vector value
+def v9f32    : VTVec<9,    f32, 132>;  //    9 x f32 vector value
+def v10f32   : VTVec<10,   f32, 133>;  //   10 x f32 vector value
+def v11f32   : VTVec<11,   f32, 134>;  //   11 x f32 vector value
+def v12f32   : VTVec<12,   f32, 135>;  //   12 x f32 vector value
+def v15f32   : VTVec<15,   f32, 136>;  //   15 x f32 vector value
+def v16f32   : VTVec<16,   f32, 137>;  //   16 x f32 vector value
+def v32f32   : VTVec<32,   f32, 138>;  //   32 x f32 vector value
+def v64f32   : VTVec<64,   f32, 139>;  //   64 x f32 vector value
+def v128f32  : VTVec<128,  f32, 140>;  //  128 x f32 vector value
+def v256f32  : VTVec<256,  f32, 141>;  //  256 x f32 vector value
+def v512f32  : VTVec<512,  f32, 142>;  //  512 x f32 vector value
+def v1024f32 : VTVec<1024, f32, 143>;  // 1024 x f32 vector value
+def v2048f32 : VTVec<2048, f32, 144>;  // 2048 x f32 vector value
+
+def v1f64    : VTVec<1,    f64, 145>;  //   ...
[truncated]

@lukel97
Copy link
Contributor

lukel97 commented Jun 28, 2024

Nice! I think this will be very important to have for RISC-V's SLP.

As an alternative to creating new MVTs for odd sizes though, have you considered just letting SelectionDAG widen them to the next legal VL? Asides from reductions and loads/stores, increasing the VL shouldn't impact performance. I also landed a patch a few years ago that should widen loads and stores of illegal fixed-length vector sizes to VP ops: https://reviews.llvm.org/D148713

And then to avoid VL toggles from the discrepancy between the widened ops VLs and loads/stores VLs, I think my original plan was to take advantage of https://llvm.org/devmtg/2023-05/slides/Posters/01-Albano-VectorPredictionPoster.pdf somehow to reduce the VL of the widened ops

: ValueType<16, 199>; // AArch64 predicate-as-counter
def spirvbuiltin : ValueType<0, 200>; // SPIR-V's builtin type
: ValueType<16, 220>; // AArch64 predicate-as-counter
def spirvbuiltin : ValueType<0, 221>; // SPIR-V's builtin type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are likely to exceed the maxinum number of ValueType. 😨

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was tried to add all v1 to v12, and then 4 left, so I only add v3, v5, v7, v15 for now...but increasing MVT size to 16 bits isn't impact too for memory usage too much IMO :P

Copy link
Contributor

@wangpc-pp wangpc-pp Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But it's very hard to increase MVT size to 16bits because a lot of places (for example, MatcherTable) assume that MVT is of one byte. We should be careful to add more MVT types.

Copy link
Collaborator

@jrtc27 jrtc27 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to be mindful that downstreams can add MVTs, and so even if you don't yet need 16 bits upstream, if there isn't much space left for downstreams you should support 16 bits upstream rather than force downstreams to figure out how to make that work.

@kito-cheng
Copy link
Member Author

As an alternative to creating new MVTs for odd sizes though, have you considered just letting SelectionDAG widen them to the next legal VL? Asides from reductions and loads/stores, increasing the VL shouldn't impact performance. I also landed a patch a few years ago that should widen loads and stores of illegal fixed-length vector sizes to VP ops: https://reviews.llvm.org/D148713

Yeah, and I didn't go that way in my first intuition since that way may take more work then just adding MVT, however that remind we may have another way to doing that: doing at LLVM IR, which means doing that on CodeGenPrepare, that may prevent us to doing that again on GlobalISel for non-power-of-2 support again :P

And then to avoid VL toggles from the discrepancy between the widened ops VLs and loads/stores VLs, I think my original plan was to take advantage of https://llvm.org/devmtg/2023-05/slides/Posters/01-Albano-VectorPredictionPoster.pdf somehow to reduce the VL of the widened ops

One possibility is just doing that in backend, RISC-V GCC already do that in RTL/backend, so I guess that should be do-able on MI with SSA, but that would be much complicate way.

https://github.com/gcc-mirror/gcc/blob/master/gcc/config/riscv/riscv-avlprop.cc

@@ -349,6 +349,7 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
setTruncStoreAction(MVT::v2f64, MVT::v2f16, Expand);

setTruncStoreAction(MVT::v3i32, MVT::v3i8, Expand);
setTruncStoreAction(MVT::v5i32, MVT::v5i8, Expand);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems suspicious we don't need a v5i32->v5i16 case. Probably missing test coverage

@preames
Copy link
Collaborator

preames commented Jul 2, 2024

I'm focusing my comment on the codegen changes to understand the merits of this approach over alternatives.

Glancing through, I see a couple major categories of changes:

  • VL only changes. Code sequence doesn't change. Depending on micro architecture, this can be either profitable, or useless.
  • LD/ST related changes -- more discussion below.
  • Splitting vs Widening - We appear to have some cases where we are splitting arithmetic operations which could be widened to the next legal type. Example would be some of the FP/int conversion cases. (The non-power of two legal case side steps this choice.)
  • Just poor codegen - examples, some of the sum reduction examples should be a single masked vmv.v.x instead of a series of vslides.

I think we can improve the codegen for the non-zero VL quite a ways before we worry about the LD/ST cases. I'd like to see us do so because it makes it easier to assess the alternatives.

On LD/ST specifically, I do not have a super strong opinion on which approach to take. To an extend, I'm happy to defer to the person driving this forward. I have a mild preference for something which works for any VL (not just 3, 7, 15), but that preference is not strong.

@preames
Copy link
Collaborator

preames commented Aug 22, 2024

I'm focusing my comment on the codegen changes to understand the merits of this approach over alternatives.

Can I ask that you rebase this? I've now landed a couple of pragmatic improvements for the current lowering without the additional value types, and I'm curious to see what remaining difference we have here.

…16, f16, i64, f64

This patch is a preliminary step to prepare RISC-V for supporting more VLS type
code generation. The currently affected targets are x86, AArch64, and AMDGPU:

- x86: The code generation order and register usage are different, but the
       generated instructions remain the same.

- AArch64: There is a slight change in a GlobalISel dump.

- AMDGPU: TruncStore from MVT::v5i32 to MVT::v5i8 was previously illegal
          because MVT::v5i8 did not exist. Now, it must be explicitly declared
          as Expand. Additionally, the calling convention need to correctly
          handle the newly added non-power-of-2 vector types.
SLP supports non-power-of-2 vectors [1], so we should consider supporting this
for RISC-V vector code generation. It is natural to support non-power-of-2 VLS
vectors for the vector extension, as VL does not impose any constraints on this.

In theory, we could support any length, but we want to prevent the
number of MVTs from growing too quickly. Therefore, we only add v3, v5,
v7 and v15.

[1] llvm#77790
@kito-cheng kito-cheng force-pushed the kitoc/non-pow-2-types branch from c59d3ac to cdd3b10 Compare August 26, 2024 15:18
@kito-cheng
Copy link
Member Author

Rebase, got one more fail after rebase, however it's CodeGen/AMDGPU/trunc-store.ll, so I think it should be fine for now, and will fix that if we want to take this approach

Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 4bf68aaca2ec11ffde3ee4c30e9761a144434a92 cdd3b1034e8aec7bdfb0f4c5d840ab3a9f689285 --extensions cpp -- llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp llvm/lib/Target/RISCV/RISCVISelLowering.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 2d63ffee38..11b48e8c1d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -1195,7 +1195,8 @@ void AMDGPUTargetLowering::analyzeFormalArgumentsCompute(
 
       if (NumRegs == 1) {
         // This argument is not split, so the IR type is the memory type.
-        if (ArgVT.isExtended() || (ArgVT.isVector() && !ArgVT.isPow2VectorType())) {
+        if (ArgVT.isExtended() ||
+            (ArgVT.isVector() && !ArgVT.isPow2VectorType())) {
           // We have an extended type, like i24, so we should just use the
           // register type.
           MemVT = RegisterVT;
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 7277bac973..9e2c7f22b1 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -2630,10 +2630,9 @@ static MVT getContainerForFixedLengthVector(const TargetLowering &TLI, MVT VT,
     // each fractional LMUL we support SEW between 8 and LMUL*ELEN.
     unsigned NumVLSElts = VT.getVectorNumElements();
     if (!isPowerOf2_32(NumVLSElts))
-       NumVLSElts = llvm::NextPowerOf2 (NumVLSElts);
+      NumVLSElts = llvm::NextPowerOf2(NumVLSElts);
 
-    unsigned NumElts =
-        (NumVLSElts * RISCV::RVVBitsPerBlock) / MinVLen;
+    unsigned NumElts = (NumVLSElts * RISCV::RVVBitsPerBlock) / MinVLen;
     NumElts = std::max(NumElts, RISCV::RVVBitsPerBlock / MaxELen);
 
     return MVT::getScalableVectorVT(EltVT, NumElts);
@@ -3583,7 +3582,7 @@ static SDValue lowerBuildVectorOfConstants(SDValue Op, SelectionDAG &DAG,
     // codegen across RV32 and RV64.
     unsigned NumViaIntegerBits = std::clamp(NumElts, 8u, Subtarget.getXLen());
     if (!isPowerOf2_32(NumViaIntegerBits))
-       NumViaIntegerBits = llvm::NextPowerOf2 (NumViaIntegerBits);
+      NumViaIntegerBits = llvm::NextPowerOf2(NumViaIntegerBits);
     NumViaIntegerBits = std::min(NumViaIntegerBits, Subtarget.getELen());
     // If we have to use more than one INSERT_VECTOR_ELT then this
     // optimization is likely to increase code size; avoid peforming it in
@@ -3627,7 +3626,7 @@ static SDValue lowerBuildVectorOfConstants(SDValue Op, SelectionDAG &DAG,
       // If we're producing a smaller vector than our minimum legal integer
       // type, bitcast to the equivalent (known-legal) mask type, and extract
       // our final mask.
-      if (IntegerViaVecVT == MVT::v1i8){
+      if (IntegerViaVecVT == MVT::v1i8) {
         assert(IntegerViaVecVT == MVT::v1i8 && "Unexpected mask vector type");
         Vec = DAG.getBitcast(MVT::v8i1, Vec);
         Vec = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, Vec,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants