Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] VLS cost model issues #62576

Closed
topperc opened this issue May 5, 2023 · 4 comments · Fixed by #99594
Closed

[RISCV] VLS cost model issues #62576

topperc opened this issue May 5, 2023 · 4 comments · Fixed by #99594

Comments

@topperc
Copy link
Collaborator

topperc commented May 5, 2023

These are issues we identified in our downstream. Not sure if any have been fixed recently.

<2 x i8> → <2 x float> is not costed as being 2 instructions, a vzext+vwfcvt
<2 x float> → <2 x i8> is not costed as being 2 instructions, vwfcvt+vnsrl.
Scalar fmul/fsub float cost is 2, but vector is 1

cc: @preames @bubba

@llvmbot
Copy link
Member

llvmbot commented May 5, 2023

@llvm/issue-subscribers-backend-risc-v

@jacquesguan
Copy link
Contributor

jacquesguan commented May 10, 2023

The cost of cast and fp arithmetic instruction was modified in ecd7a01 and db07d79, I think the cost is right at the main branch.

@preames
Copy link
Collaborator

preames commented May 15, 2023

These all look fine to me. This is from a recent TOT snapshot. The fptoXi versions looks slightly debatable as we're discounting two vsetvlis rather than our normal 1, but otherwise, I don't see obvious issues here.

$ cat craig-fp-convert-cost.ll
; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple=riscv64 -mattr=+v,+f,+d,+zfh,+experimental-zvfh < %s | FileCheck %s

define <2 x float> @v2i8_sitofp_v2f32(<2 x i8> %a) {
; CHECK-LABEL: 'v2i8_sitofp_v2f32'
; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = sitofp <2 x i8> %a to <2 x float>
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret <2 x float> %res
;
  %res = sitofp <2 x i8> %a to <2 x float>
  ret <2 x float> %res
}

define <2 x float> @v2i8_uitofp_v2f32(<2 x i8> %a) {
; CHECK-LABEL: 'v2i8_uitofp_v2f32'
; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = sitofp <2 x i8> %a to <2 x float>
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret <2 x float> %res
;
  %res = sitofp <2 x i8> %a to <2 x float>
  ret <2 x float> %res
}

define <2 x i8> @v2i8_fptosi_v2f32(<2 x float> %a) {
; CHECK-LABEL: 'v2i8_fptosi_v2f32'
; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = fptosi <2 x float> %a to <2 x i8>
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret <2 x i8> %res
;
  %res = fptosi <2 x float> %a to <2 x i8>
  ret <2 x i8> %res
}

define <2 x i8> @v2i8_fptoui_v2f32(<2 x float> %a) {
; CHECK-LABEL: 'v2i8_fptoui_v2f32'
; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = fptoui <2 x float> %a to <2 x i8>
; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret <2 x i8> %res
;
  %res = fptoui <2 x float> %a to <2 x i8>
  ret <2 x i8> %res
}

$ ./llc -march=riscv64 -mattr=+v < craig-fp-convert-cost.ll
	.text
	.attribute	4, 16
	.attribute	5, "rv64i2p1_f2p2_d2p2_v1p0_zicsr2p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
	.file	"<stdin>"
	.globl	v2i8_sitofp_v2f32               # -- Begin function v2i8_sitofp_v2f32
	.p2align	2
	.type	v2i8_sitofp_v2f32,@function
	.variant_cc	v2i8_sitofp_v2f32
v2i8_sitofp_v2f32:                      # @v2i8_sitofp_v2f32
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 2, e16, mf4, ta, ma
	vsext.vf2	v9, v8
	vfwcvt.f.x.v	v8, v9
	ret
.Lfunc_end0:
	.size	v2i8_sitofp_v2f32, .Lfunc_end0-v2i8_sitofp_v2f32
	.cfi_endproc
                                        # -- End function
	.globl	v2i8_uitofp_v2f32               # -- Begin function v2i8_uitofp_v2f32
	.p2align	2
	.type	v2i8_uitofp_v2f32,@function
	.variant_cc	v2i8_uitofp_v2f32
v2i8_uitofp_v2f32:                      # @v2i8_uitofp_v2f32
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 2, e16, mf4, ta, ma
	vsext.vf2	v9, v8
	vfwcvt.f.x.v	v8, v9
	ret
.Lfunc_end1:
	.size	v2i8_uitofp_v2f32, .Lfunc_end1-v2i8_uitofp_v2f32
	.cfi_endproc
                                        # -- End function
	.globl	v2i8_fptosi_v2f32               # -- Begin function v2i8_fptosi_v2f32
	.p2align	2
	.type	v2i8_fptosi_v2f32,@function
	.variant_cc	v2i8_fptosi_v2f32
v2i8_fptosi_v2f32:                      # @v2i8_fptosi_v2f32
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 2, e16, mf4, ta, ma
	vfncvt.rtz.x.f.w	v9, v8
	vsetvli	zero, zero, e8, mf8, ta, ma
	vnsrl.wi	v8, v9, 0
	ret
.Lfunc_end2:
	.size	v2i8_fptosi_v2f32, .Lfunc_end2-v2i8_fptosi_v2f32
	.cfi_endproc
                                        # -- End function
	.globl	v2i8_fptoui_v2f32               # -- Begin function v2i8_fptoui_v2f32
	.p2align	2
	.type	v2i8_fptoui_v2f32,@function
	.variant_cc	v2i8_fptoui_v2f32
v2i8_fptoui_v2f32:                      # @v2i8_fptoui_v2f32
	.cfi_startproc
# %bb.0:
	vsetivli	zero, 2, e16, mf4, ta, ma
	vfncvt.rtz.xu.f.w	v9, v8
	vsetvli	zero, zero, e8, mf8, ta, ma
	vnsrl.wi	v8, v9, 0
	ret
.Lfunc_end3:
	.size	v2i8_fptoui_v2f32, .Lfunc_end3-v2i8_fptoui_v2f32
	.cfi_endproc
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

@topperc
Copy link
Collaborator Author

topperc commented May 15, 2023

Looks like I made a mistake here. I was acting on someone else's analysis in an internal bug report. Our tree is relatively up to date and those commits are from 6 months and a year ago so I need to figure out where the disconnect was on our side.

sgundapa pushed a commit to sgundapa/upstream_effort that referenced this issue Jul 23, 2024
…lvm#99594)

I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and
rva22u64_v, and noticed that in a few cases that rva22u64_v was
considerably slower.

One of them was 519.lbm_r, which has a large loop that was being
unprofitably vectorized. It has an if/else in the loop which requires
large amounts of predication when vectorized, but despite the loop
vectorizer taking this into account the vector cost came out as cheaper
than the scalar.

It looks like the reason for this is because we cost scalar floating
point ops as 2, but their vector equivalents as 1 (for LMUL 1). This
comes from how we use BasicTTIImpl for scalars which treats floats as
twice as expensive as integers.

This patch doubles the cost of vector floating point arithmetic ops so
that they're at least as expensive as their scalar counterparts, which
gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60.

Fixes llvm#62576 (the last point there about scalar fsub/fmul)
yuxuanchen1997 pushed a commit that referenced this issue Jul 25, 2024
…99594)

I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and
rva22u64_v, and noticed that in a few cases that rva22u64_v was
considerably slower.

One of them was 519.lbm_r, which has a large loop that was being
unprofitably vectorized. It has an if/else in the loop which requires
large amounts of predication when vectorized, but despite the loop
vectorizer taking this into account the vector cost came out as cheaper
than the scalar.

It looks like the reason for this is because we cost scalar floating
point ops as 2, but their vector equivalents as 1 (for LMUL 1). This
comes from how we use BasicTTIImpl for scalars which treats floats as
twice as expensive as integers.

This patch doubles the cost of vector floating point arithmetic ops so
that they're at least as expensive as their scalar counterparts, which
gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60.

Fixes #62576 (the last point there about scalar fsub/fmul)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants