-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2x performance regression on nightly 2018-11-10 vs 2018-11-11 #56009
Comments
Extra info, in case it's relevant:
|
@ljedrz I noticed you did a lot of LLVM backend work on 2018-10-10. Do you know if this is related/expected? |
I don't think so, these were mostly ergonomics. |
@ljedrz thanks for taking a look. |
I can see a different PR you could take a look at: #55650. It has a change related to integer shifts. |
Taking the diff between the two nightly hashes: 36a50c2...6e9b842 The only change affecting codegen in there (not counting emscripten) is #55650. And seeing how mohrezaei/num-integer@5be6ea1#diff-1d6fa51a8a2182204ba56d113464a194R80 uses rotate_right, this is quite likely indeed the culprit. |
@nikic @ljedrz I think you're right. Edit: Apparently this only happens with -C target-cpu=native (I really wish RUSTFLAGS was not an env variable.. it makes keeping track of these things very hard). |
The optimizations happening here are really amazing. We have a simple piece of code: fn is_pow10_u64(v: u64) -> bool {
let mut hash: u32 = v as u32;
hash ^= hash >> 3;
hash = hash.rotate_right(1);
hash ^= hash >> 23;
v == POWER10_HASH_U64[(hash & 31) as usize]
} The two versions produce very different assembly from this. 2018-11-10 does a very clever optimization (described below) and manages to then vectorize the optimized code. 2018-11-11 translates the code nearly vebatim and doesn't do any vectorization, just plain unrolling. This is the relevant assembly for that function, annotated: before (11-10):
movq (%r10), %rdx
addq $8, %r10
movl %edx, %edi
shrl $3, %edi // hash >> 3
xorl %edx, %edi
movl %edi, %eax
shrl %eax // hash >> 1 !? see below
shrl $24, %edi // hash >> 24 !? see below
xorl %eax, %edi
andl $31, %edi // hash & 31
xorl %eax, %eax
cmpq %rdx, (%r8,%rdi,8)
sete %al
addl %eax, %ecx
cmpq %r10, %r9
after (11-11):
movq (%r11,%rdx,8), %rcx
movl %ecx, %esi
shrl $3, %esi // hash >> 3
xorl %ecx, %esi
rorl $1, %esi // hash.rotate_right(1)
movl %esi, %eax
shrl $23, %eax // hash >> 23
xorl %esi, %eax
andl $31, %eax // hash & 31
xorl %esi, %esi
cmpq %rcx, (%r9,%rax,8)
sete %sil
addl %esi, %edi
addq $1, %rdx
cmpq %rdx, %r8 You'll recognize the constants in the 11-11 output pretty easily, but the 11-10 output looks bizarre as it doesn't have these constants. The optimization performed in 11-10, which is missing in 11-11 is rewriting the function like so: fn is_pow10_u64(v: u64) -> bool {
let mut hash: u32 = v as u32;
hash ^= hash >> 3;
hash = (hash >> 24) ^ (hash >> 1);
v == POWER10_HASH_U64[(hash & 31) as usize]
} Which is why there is no rotate or shift(23). If I hand optimize the method, both 11-10 and 11-11 produce the same speed via SIMD vectorization. Hope that helps narrow this down. That's as far as I can take the analysis. |
Okay, looks like demanded bits simplification is not performed for funnel shifts yet. I can submit a patch for LLVM, but as it's going to take some time until that propagates back to us, it's probably best to revert the use of funnel shift intrinsics for now. It should be enough to drop the calls in libcore, the rest of the implementation work can stay. |
@nikic we definitely can update our LLVM mid-cycle, though I don’t think funnel shifts justify that. |
I've submitted https://reviews.llvm.org/D54666 upstream. Unfortunately it's not actually able to handle the case reported here, because the rotate result has more than one use. Some support for funnel shift vectorization has also landed recently, not sure if the LLVM upgrade in #55835 has those changes yet. |
Just a random thought here: would it make sense to have an optimization that converts rotates into shifts first (based on demanded bits)? |
Never mind, I see the llvm code is already meant to do that. |
Does the original code for this still exist? I wanted to check whether the recent LLVM update mitigates this by vectorizing the rotation, but the current code in num-integer doesn't seem to use rotations anymore. |
I did make a branch locally to preserve it. I've now pushed it to github. You'll have to clone my fork and |
https://reviews.llvm.org/D55563 should be the last piece to handle the case encountered here. I've also tested locally, and the last LLVM upgrade didn't change anything here wrt vectorization. Locally I can only reproduce a 20% rather than 2x difference, probably my CPU is too old. |
The recent LLVM update pulled in the fix mentioned above. I've checked in the asm that the relaxation from rotate to shift does work now. Unfortunately I can no longer reproduce any perf difference between nightly-2018-11-10, nightly-2018-11-11 and current nightly using Maybe you could give this another try and check whether you get comparable performance from nightly-2018-11-10 and the current one? |
Looks good. Procedure to reproduce:
My results are as follows:
Thanks for following up on this and feel free to close the ticket when appropriate. |
I noticed a 2x performance regression in the nightlies while writing new code for num-integer.
I bisected the regression to 2018-11-11.
To reproduce:
The first line in the benchmarks show the regression. On my machine it goes from ~4300 ns/iter to ~9300 ns/iter.
The text was updated successfully, but these errors were encountered: