-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] VLS cost model issues #62576
Comments
@llvm/issue-subscribers-backend-risc-v |
These all look fine to me. This is from a recent TOT snapshot. The fptoXi versions looks slightly debatable as we're discounting two vsetvlis rather than our normal 1, but otherwise, I don't see obvious issues here.
|
Looks like I made a mistake here. I was acting on someone else's analysis in an internal bug report. Our tree is relatively up to date and those commits are from 6 months and a year ago so I need to figure out where the disconnect was on our side. |
…lvm#99594) I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and rva22u64_v, and noticed that in a few cases that rva22u64_v was considerably slower. One of them was 519.lbm_r, which has a large loop that was being unprofitably vectorized. It has an if/else in the loop which requires large amounts of predication when vectorized, but despite the loop vectorizer taking this into account the vector cost came out as cheaper than the scalar. It looks like the reason for this is because we cost scalar floating point ops as 2, but their vector equivalents as 1 (for LMUL 1). This comes from how we use BasicTTIImpl for scalars which treats floats as twice as expensive as integers. This patch doubles the cost of vector floating point arithmetic ops so that they're at least as expensive as their scalar counterparts, which gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60. Fixes llvm#62576 (the last point there about scalar fsub/fmul)
…99594) I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and rva22u64_v, and noticed that in a few cases that rva22u64_v was considerably slower. One of them was 519.lbm_r, which has a large loop that was being unprofitably vectorized. It has an if/else in the loop which requires large amounts of predication when vectorized, but despite the loop vectorizer taking this into account the vector cost came out as cheaper than the scalar. It looks like the reason for this is because we cost scalar floating point ops as 2, but their vector equivalents as 1 (for LMUL 1). This comes from how we use BasicTTIImpl for scalars which treats floats as twice as expensive as integers. This patch doubles the cost of vector floating point arithmetic ops so that they're at least as expensive as their scalar counterparts, which gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60. Fixes #62576 (the last point there about scalar fsub/fmul)
These are issues we identified in our downstream. Not sure if any have been fixed recently.
<2 x i8> → <2 x float> is not costed as being 2 instructions, a vzext+vwfcvt
<2 x float> → <2 x i8> is not costed as being 2 instructions, vwfcvt+vnsrl.
Scalar fmul/fsub float cost is 2, but vector is 1
cc: @preames @bubba
The text was updated successfully, but these errors were encountered: