Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2x performance regression on nightly 2018-11-10 vs 2018-11-11 #56009

Closed
mohrezaei opened this issue Nov 16, 2018 · 18 comments
Closed

2x performance regression on nightly 2018-11-10 vs 2018-11-11 #56009

mohrezaei opened this issue Nov 16, 2018 · 18 comments
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@mohrezaei
Copy link

mohrezaei commented Nov 16, 2018

I noticed a 2x performance regression in the nightlies while writing new code for num-integer.
I bisected the regression to 2018-11-11.

To reproduce:

rustup install nightly-2018-11-10
rustup install nightly-2018-11-11
git clone https://github.com/mohrezaei/num-integer
cd num-integer
set your RUSTFLAGS="-C target-cpu=native"
cargo +nightly-2018-11-10 bench benchp10_u64_10000_random
cargo +nightly-2018-11-11 bench benchp10_u64_10000_random

The first line in the benchmarks show the regression. On my machine it goes from ~4300 ns/iter to ~9300 ns/iter.

@mohrezaei
Copy link
Author

Extra info, in case it's relevant:

Default host: x86_64-pc-windows-msvc

installed toolchains
--------------------

stable-x86_64-pc-windows-msvc
nightly-2018-11-10-x86_64-pc-windows-msvc
nightly-2018-11-11-x86_64-pc-windows-msvc
nightly-x86_64-pc-windows-msvc

active toolchain
----------------

stable-x86_64-pc-windows-msvc (default)
rustc 1.30.1 (1433507eb 2018-11-07)
> rustup run nightly-2018-11-10 rustc --version
rustc 1.32.0-nightly (36a50c29f 2018-11-09)
> rustup run nightly-2018-11-11 rustc --version
rustc 1.32.0-nightly (6e9b84296 2018-11-10)

@mohrezaei
Copy link
Author

@ljedrz I noticed you did a lot of LLVM backend work on 2018-10-10. Do you know if this is related/expected?

@ljedrz
Copy link
Contributor

ljedrz commented Nov 16, 2018

I don't think so, these were mostly ergonomics.

@mohrezaei
Copy link
Author

@ljedrz thanks for taking a look.

@ljedrz
Copy link
Contributor

ljedrz commented Nov 16, 2018

I can see a different PR you could take a look at: #55650. It has a change related to integer shifts.

@nikic
Copy link
Contributor

nikic commented Nov 16, 2018

Taking the diff between the two nightly hashes: 36a50c2...6e9b842

The only change affecting codegen in there (not counting emscripten) is #55650. And seeing how mohrezaei/num-integer@5be6ea1#diff-1d6fa51a8a2182204ba56d113464a194R80 uses rotate_right, this is quite likely indeed the culprit.

@mohrezaei
Copy link
Author

mohrezaei commented Nov 16, 2018

@nikic @ljedrz I think you're right. Edit: It's either that or an inlining issue. When I try --emit=asm, I get the same slow results because --emit=asm disables LTO. I'll have a closer look at rotate.

Apparently this only happens with -C target-cpu=native (I really wish RUSTFLAGS was not an env variable.. it makes keeping track of these things very hard).

@zackmdavis zackmdavis added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Nov 16, 2018
@mohrezaei
Copy link
Author

mohrezaei commented Nov 16, 2018

The optimizations happening here are really amazing. We have a simple piece of code:

fn is_pow10_u64(v: u64) -> bool {
    let mut hash: u32 = v as u32;
    hash ^= hash >> 3;
    hash = hash.rotate_right(1);
    hash ^= hash >> 23;
    v == POWER10_HASH_U64[(hash & 31) as usize]
}

The two versions produce very different assembly from this. 2018-11-10 does a very clever optimization (described below) and manages to then vectorize the optimized code. 2018-11-11 translates the code nearly vebatim and doesn't do any vectorization, just plain unrolling.

This is the relevant assembly for that function, annotated:

before (11-10):
	movq	(%r10), %rdx
	addq	$8, %r10
	movl	%edx, %edi
	shrl	$3, %edi // hash >> 3
	xorl	%edx, %edi
	movl	%edi, %eax
	shrl	%eax // hash >> 1 !? see below
	shrl	$24, %edi // hash >> 24 !? see below
	xorl	%eax, %edi
	andl	$31, %edi // hash & 31
	xorl	%eax, %eax
	cmpq	%rdx, (%r8,%rdi,8)
	sete	%al
	addl	%eax, %ecx
	cmpq	%r10, %r9


after (11-11):

	movq	(%r11,%rdx,8), %rcx
	movl	%ecx, %esi
	shrl	$3, %esi // hash >> 3
	xorl	%ecx, %esi
	rorl	$1, %esi // hash.rotate_right(1)
	movl	%esi, %eax
	shrl	$23, %eax // hash >> 23
	xorl	%esi, %eax
	andl	$31, %eax // hash & 31
	xorl	%esi, %esi
	cmpq	%rcx, (%r9,%rax,8)
	sete	%sil
	addl	%esi, %edi
	addq	$1, %rdx
	cmpq	%rdx, %r8

You'll recognize the constants in the 11-11 output pretty easily, but the 11-10 output looks bizarre as it doesn't have these constants. The optimization performed in 11-10, which is missing in 11-11 is rewriting the function like so:

fn is_pow10_u64(v: u64) -> bool {
    let mut hash: u32 = v as u32;
    hash ^= hash >> 3;
    hash = (hash >> 24) ^ (hash >> 1);
    v == POWER10_HASH_U64[(hash & 31) as usize]
}

Which is why there is no rotate or shift(23). If I hand optimize the method, both 11-10 and 11-11 produce the same speed via SIMD vectorization.

Hope that helps narrow this down. That's as far as I can take the analysis.

@nikic
Copy link
Contributor

nikic commented Nov 16, 2018

Okay, looks like demanded bits simplification is not performed for funnel shifts yet. I can submit a patch for LLVM, but as it's going to take some time until that propagates back to us, it's probably best to revert the use of funnel shift intrinsics for now. It should be enough to drop the calls in libcore, the rest of the implementation work can stay.

@nagisa
Copy link
Member

nagisa commented Nov 17, 2018

@nikic we definitely can update our LLVM mid-cycle, though I don’t think funnel shifts justify that.

@nikic
Copy link
Contributor

nikic commented Nov 17, 2018

I've submitted https://reviews.llvm.org/D54666 upstream. Unfortunately it's not actually able to handle the case reported here, because the rotate result has more than one use.

Some support for funnel shift vectorization has also landed recently, not sure if the LLVM upgrade in #55835 has those changes yet.

@mohrezaei
Copy link
Author

Just a random thought here: would it make sense to have an optimization that converts rotates into shifts first (based on demanded bits)?

@mohrezaei
Copy link
Author

Never mind, I see the llvm code is already meant to do that.

@nikic
Copy link
Contributor

nikic commented Nov 29, 2018

Does the original code for this still exist? I wanted to check whether the recent LLVM update mitigates this by vectorizing the rotation, but the current code in num-integer doesn't seem to use rotations anymore.

@mohrezaei
Copy link
Author

I did make a branch locally to preserve it. I've now pushed it to github. You'll have to clone my fork and git checkout 5be6ea1

@nikic nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Dec 15, 2018
@nikic
Copy link
Contributor

nikic commented Dec 15, 2018

https://reviews.llvm.org/D55563 should be the last piece to handle the case encountered here.

I've also tested locally, and the last LLVM upgrade didn't change anything here wrt vectorization. Locally I can only reproduce a 20% rather than 2x difference, probably my CPU is too old.

@nikic
Copy link
Contributor

nikic commented Jan 30, 2019

The recent LLVM update pulled in the fix mentioned above. I've checked in the asm that the relaxation from rotate to shift does work now.

Unfortunately I can no longer reproduce any perf difference between nightly-2018-11-10, nightly-2018-11-11 and current nightly using -C target-cpu=native. I remember seeing a difference when I tried this before, but now it's all the same.

Maybe you could give this another try and check whether you get comparable performance from nightly-2018-11-10 and the current one?

@mohrezaei
Copy link
Author

Looks good. Procedure to reproduce:

rustup install nightly-2018-11-10
rustup install nightly-2018-11-11
rustup install nightly-2019-01-29
git clone https://github.com/mohrezaei/num-integer
cd num-integer
git checkout 5be6ea1
set your RUSTFLAGS="-C target-cpu=native"
cargo +nightly-2018-11-10 bench benchp10_u64_10000_random
cargo +nightly-2018-11-11 bench benchp10_u64_10000_random
cargo +nightly-2019-01-29 bench benchp10_u64_10000_random

My results are as follows:

2018-11-10:
test benchp10_u64_10000_random        ... bench:       4,176 ns/iter (+/- 53)

2018-11-11:
test benchp10_u64_10000_random        ... bench:       8,323 ns/iter (+/- 14)

2019-01-29:
test benchp10_u64_10000_random        ... bench:       4,241 ns/iter (+/- 26)

Thanks for following up on this and feel free to close the ticket when appropriate.
😄

@nagisa nagisa closed this as completed Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

5 participants