2x performance regression on nightly 2018-11-10 vs 2018-11-11 #56009

mohrezaei · 2018-11-16T18:56:15Z

I noticed a 2x performance regression in the nightlies while writing new code for num-integer.
I bisected the regression to 2018-11-11.

To reproduce:

rustup install nightly-2018-11-10
rustup install nightly-2018-11-11
git clone https://github.com/mohrezaei/num-integer
cd num-integer
set your RUSTFLAGS="-C target-cpu=native"
cargo +nightly-2018-11-10 bench benchp10_u64_10000_random
cargo +nightly-2018-11-11 bench benchp10_u64_10000_random

The first line in the benchmarks show the regression. On my machine it goes from ~4300 ns/iter to ~9300 ns/iter.

The text was updated successfully, but these errors were encountered:

mohrezaei · 2018-11-16T18:59:54Z

Extra info, in case it's relevant:

Default host: x86_64-pc-windows-msvc

installed toolchains
--------------------

stable-x86_64-pc-windows-msvc
nightly-2018-11-10-x86_64-pc-windows-msvc
nightly-2018-11-11-x86_64-pc-windows-msvc
nightly-x86_64-pc-windows-msvc

active toolchain
----------------

stable-x86_64-pc-windows-msvc (default)
rustc 1.30.1 (1433507eb 2018-11-07)

> rustup run nightly-2018-11-10 rustc --version
rustc 1.32.0-nightly (36a50c29f 2018-11-09)
> rustup run nightly-2018-11-11 rustc --version
rustc 1.32.0-nightly (6e9b84296 2018-11-10)

mohrezaei · 2018-11-16T19:07:30Z

@ljedrz I noticed you did a lot of LLVM backend work on 2018-10-10. Do you know if this is related/expected?

ljedrz · 2018-11-16T19:18:12Z

I don't think so, these were mostly ergonomics.

mohrezaei · 2018-11-16T19:18:42Z

@ljedrz thanks for taking a look.

ljedrz · 2018-11-16T19:22:49Z

I can see a different PR you could take a look at: #55650. It has a change related to integer shifts.

nikic · 2018-11-16T19:24:17Z

Taking the diff between the two nightly hashes: 36a50c2...6e9b842

The only change affecting codegen in there (not counting emscripten) is #55650. And seeing how mohrezaei/num-integer@5be6ea1#diff-1d6fa51a8a2182204ba56d113464a194R80 uses rotate_right, this is quite likely indeed the culprit.

mohrezaei · 2018-11-16T19:25:13Z

@nikic @ljedrz I think you're right. Edit: ~~It's either that or an inlining issue. When I try --emit=asm, I get the same slow results because --emit=asm disables LTO.~~ I'll have a closer look at rotate.

Apparently this only happens with -C target-cpu=native (I really wish RUSTFLAGS was not an env variable.. it makes keeping track of these things very hard).

mohrezaei · 2018-11-16T21:11:36Z

The optimizations happening here are really amazing. We have a simple piece of code:

fn is_pow10_u64(v: u64) -> bool {
    let mut hash: u32 = v as u32;
    hash ^= hash >> 3;
    hash = hash.rotate_right(1);
    hash ^= hash >> 23;
    v == POWER10_HASH_U64[(hash & 31) as usize]
}

The two versions produce very different assembly from this. 2018-11-10 does a very clever optimization (described below) and manages to then vectorize the optimized code. 2018-11-11 translates the code nearly vebatim and doesn't do any vectorization, just plain unrolling.

This is the relevant assembly for that function, annotated:

before (11-10):
	movq	(%r10), %rdx
	addq	$8, %r10
	movl	%edx, %edi
	shrl	$3, %edi // hash >> 3
	xorl	%edx, %edi
	movl	%edi, %eax
	shrl	%eax // hash >> 1 !? see below
	shrl	$24, %edi // hash >> 24 !? see below
	xorl	%eax, %edi
	andl	$31, %edi // hash & 31
	xorl	%eax, %eax
	cmpq	%rdx, (%r8,%rdi,8)
	sete	%al
	addl	%eax, %ecx
	cmpq	%r10, %r9


after (11-11):

	movq	(%r11,%rdx,8), %rcx
	movl	%ecx, %esi
	shrl	$3, %esi // hash >> 3
	xorl	%ecx, %esi
	rorl	$1, %esi // hash.rotate_right(1)
	movl	%esi, %eax
	shrl	$23, %eax // hash >> 23
	xorl	%esi, %eax
	andl	$31, %eax // hash & 31
	xorl	%esi, %esi
	cmpq	%rcx, (%r9,%rax,8)
	sete	%sil
	addl	%esi, %edi
	addq	$1, %rdx
	cmpq	%rdx, %r8

You'll recognize the constants in the 11-11 output pretty easily, but the 11-10 output looks bizarre as it doesn't have these constants. The optimization performed in 11-10, which is missing in 11-11 is rewriting the function like so:

fn is_pow10_u64(v: u64) -> bool {
    let mut hash: u32 = v as u32;
    hash ^= hash >> 3;
    hash = (hash >> 24) ^ (hash >> 1);
    v == POWER10_HASH_U64[(hash & 31) as usize]
}

Which is why there is no rotate or shift(23). If I hand optimize the method, both 11-10 and 11-11 produce the same speed via SIMD vectorization.

Hope that helps narrow this down. That's as far as I can take the analysis.

nikic · 2018-11-16T22:16:26Z

Okay, looks like demanded bits simplification is not performed for funnel shifts yet. I can submit a patch for LLVM, but as it's going to take some time until that propagates back to us, it's probably best to revert the use of funnel shift intrinsics for now. It should be enough to drop the calls in libcore, the rest of the implementation work can stay.

nagisa · 2018-11-17T13:52:56Z

@nikic we definitely can update our LLVM mid-cycle, though I don’t think funnel shifts justify that.

nikic · 2018-11-17T14:50:32Z

I've submitted https://reviews.llvm.org/D54666 upstream. Unfortunately it's not actually able to handle the case reported here, because the rotate result has more than one use.

Some support for funnel shift vectorization has also landed recently, not sure if the LLVM upgrade in #55835 has those changes yet.

mohrezaei · 2018-11-17T14:53:55Z

Just a random thought here: would it make sense to have an optimization that converts rotates into shifts first (based on demanded bits)?

mohrezaei · 2018-11-17T14:54:54Z

Never mind, I see the llvm code is already meant to do that.

nikic · 2018-11-29T17:36:47Z

Does the original code for this still exist? I wanted to check whether the recent LLVM update mitigates this by vectorizing the rotation, but the current code in num-integer doesn't seem to use rotations anymore.

mohrezaei · 2018-11-29T17:56:29Z

I did make a branch locally to preserve it. I've now pushed it to github. You'll have to clone my fork and git checkout 5be6ea1

nikic · 2018-12-15T11:19:07Z

https://reviews.llvm.org/D55563 should be the last piece to handle the case encountered here.

I've also tested locally, and the last LLVM upgrade didn't change anything here wrt vectorization. Locally I can only reproduce a 20% rather than 2x difference, probably my CPU is too old.

nikic · 2019-01-30T17:59:28Z

The recent LLVM update pulled in the fix mentioned above. I've checked in the asm that the relaxation from rotate to shift does work now.

Unfortunately I can no longer reproduce any perf difference between nightly-2018-11-10, nightly-2018-11-11 and current nightly using -C target-cpu=native. I remember seeing a difference when I tried this before, but now it's all the same.

Maybe you could give this another try and check whether you get comparable performance from nightly-2018-11-10 and the current one?

mohrezaei · 2019-01-30T18:19:42Z

Looks good. Procedure to reproduce:

rustup install nightly-2018-11-10
rustup install nightly-2018-11-11
rustup install nightly-2019-01-29
git clone https://github.com/mohrezaei/num-integer
cd num-integer
git checkout 5be6ea1
set your RUSTFLAGS="-C target-cpu=native"
cargo +nightly-2018-11-10 bench benchp10_u64_10000_random
cargo +nightly-2018-11-11 bench benchp10_u64_10000_random
cargo +nightly-2019-01-29 bench benchp10_u64_10000_random

My results are as follows:

2018-11-10:
test benchp10_u64_10000_random        ... bench:       4,176 ns/iter (+/- 53)

2018-11-11:
test benchp10_u64_10000_random        ... bench:       8,323 ns/iter (+/- 14)

2019-01-29:
test benchp10_u64_10000_random        ... bench:       4,241 ns/iter (+/- 26)

Thanks for following up on this and feel free to close the ticket when appropriate.
😄

mohrezaei mentioned this issue Nov 16, 2018

Implement Power10 trait, with first method is_power_of_ten rust-num/num-integer#15

Open

zackmdavis added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Nov 16, 2018

nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Dec 15, 2018

nikic self-assigned this Dec 15, 2018

nagisa mentioned this issue Jan 30, 2019

Use LLVM intrinsics for saturating add/sub #58003

Merged

nagisa closed this as completed Jan 30, 2019

2x performance regression on nightly 2018-11-10 vs 2018-11-11 #56009

2x performance regression on nightly 2018-11-10 vs 2018-11-11 #56009

Comments

mohrezaei commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mohrezaei commented Nov 16, 2018

Uh oh!

mohrezaei commented Nov 16, 2018

Uh oh!

ljedrz commented Nov 16, 2018

Uh oh!

mohrezaei commented Nov 16, 2018

Uh oh!

ljedrz commented Nov 16, 2018

Uh oh!

nikic commented Nov 16, 2018

Uh oh!

mohrezaei commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohrezaei commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Nov 16, 2018

Uh oh!

nagisa commented Nov 17, 2018

Uh oh!

nikic commented Nov 17, 2018

Uh oh!

mohrezaei commented Nov 17, 2018

Uh oh!

mohrezaei commented Nov 17, 2018

Uh oh!

nikic commented Nov 29, 2018

Uh oh!

mohrezaei commented Nov 29, 2018

Uh oh!

nikic commented Dec 15, 2018

Uh oh!

nikic commented Jan 30, 2019

Uh oh!

mohrezaei commented Jan 30, 2019

Uh oh!

mohrezaei commented Nov 16, 2018 •

edited

Loading

mohrezaei commented Nov 16, 2018 •

edited

Loading

mohrezaei commented Nov 16, 2018 •

edited

Loading