[RISCV] VLS cost model issues #62576

topperc · 2023-05-05T19:02:48Z

These are issues we identified in our downstream. Not sure if any have been fixed recently.

<2 x i8> → <2 x float> is not costed as being 2 instructions, a vzext+vwfcvt
<2 x float> → <2 x i8> is not costed as being 2 instructions, vwfcvt+vnsrl.
Scalar fmul/fsub float cost is 2, but vector is 1

cc: @preames @bubba

llvmbot · 2023-05-05T19:03:03Z

@llvm/issue-subscribers-backend-risc-v

jacquesguan · 2023-05-10T08:15:35Z

The cost of cast and fp arithmetic instruction was modified in ecd7a01 and db07d79, I think the cost is right at the main branch.

preames · 2023-05-15T17:53:34Z

These all look fine to me. This is from a recent TOT snapshot. The fptoXi versions looks slightly debatable as we're discounting two vsetvlis rather than our normal 1, but otherwise, I don't see obvious issues here.

$ cat craig-fp-convert-cost.ll
; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv64 -mattr=+v,+f,+d,+zfh,+experimental-zvfh < %s | FileCheck %s

define <2 x float> @v2i8_sitofp_v2f32(<2 x i8> %a) {
; CHECK-LABEL: 'v2i8_sitofp_v2f32'
; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = sitofp <2 x i8> %a to <2 x float>
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret <2 x float> %res
;
  %res = sitofp <2 x i8> %a to <2 x float>
  ret <2 x float> %res
}

define <2 x float> @v2i8_uitofp_v2f32(<2 x i8> %a) {
; CHECK-LABEL: 'v2i8_uitofp_v2f32'
; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = sitofp <2 x i8> %a to <2 x float>
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret <2 x float> %res
;
  %res = sitofp <2 x i8> %a to <2 x float>
  ret <2 x float> %res
}

define <2 x i8> @v2i8_fptosi_v2f32(<2 x float> %a) {
; CHECK-LABEL: 'v2i8_fptosi_v2f32'
; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = fptosi <2 x float> %a to <2 x i8>
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret <2 x i8> %res
;
  %res = fptosi <2 x float> %a to <2 x i8>
  ret <2 x i8> %res
}

define <2 x i8> @v2i8_fptoui_v2f32(<2 x float> %a) {
; CHECK-LABEL: 'v2i8_fptoui_v2f32'
; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = fptoui <2 x float> %a to <2 x i8>
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret <2 x i8> %res
;
  %res = fptoui <2 x float> %a to <2 x i8>
  ret <2 x i8> %res
}

$ ./llc -march=riscv64 -mattr=+v < craig-fp-convert-cost.ll
	.text
	.attribute	4, 16
	.attribute	5, "rv64i2p1_f2p2_d2p2_v1p0_zicsr2p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
	.file	"<stdin>"
	.globl	v2i8_sitofp_v2f32               # -- Begin function v2i8_sitofp_v2f32
	.p2align	2
	.type	v2i8_sitofp_v2f32,@function
	.variant_cc	v2i8_sitofp_v2f32
v2i8_sitofp_v2f32:                      # @v2i8_sitofp_v2f32
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 2, e16, mf4, ta, ma
	vsext.vf2	v9, v8
	vfwcvt.f.x.v	v8, v9
	ret
.Lfunc_end0:
	.size	v2i8_sitofp_v2f32, .Lfunc_end0-v2i8_sitofp_v2f32
	.cfi_endproc
                                        # -- End function
	.globl	v2i8_uitofp_v2f32               # -- Begin function v2i8_uitofp_v2f32
	.p2align	2
	.type	v2i8_uitofp_v2f32,@function
	.variant_cc	v2i8_uitofp_v2f32
v2i8_uitofp_v2f32:                      # @v2i8_uitofp_v2f32
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 2, e16, mf4, ta, ma
	vsext.vf2	v9, v8
	vfwcvt.f.x.v	v8, v9
	ret
.Lfunc_end1:
	.size	v2i8_uitofp_v2f32, .Lfunc_end1-v2i8_uitofp_v2f32
	.cfi_endproc
                                        # -- End function
	.globl	v2i8_fptosi_v2f32               # -- Begin function v2i8_fptosi_v2f32
	.p2align	2
	.type	v2i8_fptosi_v2f32,@function
	.variant_cc	v2i8_fptosi_v2f32
v2i8_fptosi_v2f32:                      # @v2i8_fptosi_v2f32
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 2, e16, mf4, ta, ma
	vfncvt.rtz.x.f.w	v9, v8
	vsetvli	zero, zero, e8, mf8, ta, ma
	vnsrl.wi	v8, v9, 0
	ret
.Lfunc_end2:
	.size	v2i8_fptosi_v2f32, .Lfunc_end2-v2i8_fptosi_v2f32
	.cfi_endproc
                                        # -- End function
	.globl	v2i8_fptoui_v2f32               # -- Begin function v2i8_fptoui_v2f32
	.p2align	2
	.type	v2i8_fptoui_v2f32,@function
	.variant_cc	v2i8_fptoui_v2f32
v2i8_fptoui_v2f32:                      # @v2i8_fptoui_v2f32
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 2, e16, mf4, ta, ma
	vfncvt.rtz.xu.f.w	v9, v8
	vsetvli	zero, zero, e8, mf8, ta, ma
	vnsrl.wi	v8, v9, 0
	ret
.Lfunc_end3:
	.size	v2i8_fptoui_v2f32, .Lfunc_end3-v2i8_fptoui_v2f32
	.cfi_endproc
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

topperc · 2023-05-15T17:55:40Z

Looks like I made a mistake here. I was acting on someone else's analysis in an internal bug report. Our tree is relatively up to date and those commits are from 6 months and a year ago so I need to figure out where the disconnect was on our side.

…lvm#99594) I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and rva22u64_v, and noticed that in a few cases that rva22u64_v was considerably slower. One of them was 519.lbm_r, which has a large loop that was being unprofitably vectorized. It has an if/else in the loop which requires large amounts of predication when vectorized, but despite the loop vectorizer taking this into account the vector cost came out as cheaper than the scalar. It looks like the reason for this is because we cost scalar floating point ops as 2, but their vector equivalents as 1 (for LMUL 1). This comes from how we use BasicTTIImpl for scalars which treats floats as twice as expensive as integers. This patch doubles the cost of vector floating point arithmetic ops so that they're at least as expensive as their scalar counterparts, which gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60. Fixes llvm#62576 (the last point there about scalar fsub/fmul)

…99594) I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and rva22u64_v, and noticed that in a few cases that rva22u64_v was considerably slower. One of them was 519.lbm_r, which has a large loop that was being unprofitably vectorized. It has an if/else in the loop which requires large amounts of predication when vectorized, but despite the loop vectorizer taking this into account the vector cost came out as cheaper than the scalar. It looks like the reason for this is because we cost scalar floating point ops as 2, but their vector equivalents as 1 (for LMUL 1). This comes from how we use BasicTTIImpl for scalars which treats floats as twice as expensive as integers. This patch doubles the cost of vector floating point arithmetic ops so that they're at least as expensive as their scalar counterparts, which gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60. Fixes #62576 (the last point there about scalar fsub/fmul)

topperc added the backend:RISC-V label May 5, 2023

lukel97 mentioned this issue Jul 19, 2024

[RISCV] Don't cost vector arithmetic fp ops as cheaper than scalar #99594

Merged

lukel97 closed this as completed in #99594 Jul 22, 2024

lukel97 closed this as completed in 58854fa Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISCV] VLS cost model issues #62576

[RISCV] VLS cost model issues #62576

topperc commented May 5, 2023

llvmbot commented May 5, 2023

jacquesguan commented May 10, 2023 •

edited

Loading

preames commented May 15, 2023

topperc commented May 15, 2023 •

edited

Loading

[RISCV] VLS cost model issues #62576

[RISCV] VLS cost model issues #62576

Comments

topperc commented May 5, 2023

llvmbot commented May 5, 2023

jacquesguan commented May 10, 2023 • edited Loading

preames commented May 15, 2023

topperc commented May 15, 2023 • edited Loading

jacquesguan commented May 10, 2023 •

edited

Loading

topperc commented May 15, 2023 •

edited

Loading