Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newtype impedes vectorization #24963

Closed
bluss opened this issue Apr 29, 2015 · 9 comments
Closed

Newtype impedes vectorization #24963

bluss opened this issue Apr 29, 2015 · 9 comments
Labels
A-codegen Area: Code generation

Comments

@bluss
Copy link
Member

bluss commented Apr 29, 2015

In the following example, the newtype is not zero-cost in practice since it seems to impede optimizations in llvm. The plain u32 sum vectorizes while Foo(u32) does not. The newtype fold needs 3 times the runtime of the plain u32 fold.

rustc version: rustc 1.1.0-nightly (97d4e76 2015-04-27) (built 2015-04-28)

code (playpen link)

(The code has been updated)

#![crate_type="lib"]
#![feature(test)]
extern crate test;

#[inline(never)]
pub fn folds(x: &[u32]) -> u32 { x.iter().fold(0, |a, &b| a + b) }

#[derive(Copy, Clone)]
pub struct Foo<T>(T);

#[inline(never)]
pub fn folds_foo(x: &[Foo<u32>]) -> Foo<u32> { x.iter().fold(Foo(0), |a, &b| Foo(a.0 + b.0)) }

#[bench]
fn folds1(b: &mut test::Bencher)
{
    let xs = test::black_box(vec![1; 1024]);
    b.iter(|| {
        folds(&xs)
    })
}

#[bench]
fn folds2(b: &mut test::Bencher)
{
    let xs = test::black_box(vec![Foo(1); 1024]);
    b.iter(|| {
        folds_foo(&xs)
    })
}

bench results vary with compilation setting

// rustc -C opt-level=3 --test

running 2 tests
test folds1 ... bench:       206 ns/iter (+/- 5)
test folds2 ... bench:       609 ns/iter (+/- 6)

// rustc -C opt-level=3 -C target-cpu=corei7-avx --test
running 2 tests
test folds1 ... bench:       131 ns/iter (+/- 1)
test folds2 ... bench:       192 ns/iter (+/- 3)
@Stebalien
Copy link
Contributor

That second one should be |a, &b| which is slightly faster but still 3x slower on my machine. Destructuring (|Foo(a), &Foo(b)| Foo(a + b)) is the same as |a, b| and slower than |a, &b|. Also, getting rid of the generic doesn't help.

@sanxiyn sanxiyn added the A-codegen Area: Code generation label Apr 30, 2015
@bluss
Copy link
Member Author

bluss commented Apr 30, 2015

I agree, that's what I was intending @Stebalien, but it shouldn't affect codegen either. When I tried it doesn't affect benchmarks. Vectorization is cool btw, compiling with corei7-avx decreases the plain u32 fold's runtime even more, increasing the benchmark difference. :-)

@bluss
Copy link
Member Author

bluss commented Apr 30, 2015

Updated code & bench.

@Aatch
Copy link
Contributor

Aatch commented May 7, 2015

Ah goddammit, I actually wrote a patch that unwrapped newtypes, at least for simple cases (which includes this one), but abandoned it because I couldn't see a noticable difference in optimisation/performance.

@bluss
Copy link
Member Author

bluss commented May 7, 2015

Not vectorizing isn't actually correct. It just does it less efficiently. That's some puzzle to try to understand.

@bluss
Copy link
Member Author

bluss commented May 25, 2015

cc @dotdash if you have time & interest

@bluss
Copy link
Member Author

bluss commented Mar 25, 2016

Triage: Still an issue with rustc 1.9.0-nightly (98f0a9128 2016-03-23)

@Mark-Simulacrum
Copy link
Member

I feel like this is fixed judging by the bench results below. Please reopen if that's not the case, preferably with a summary of what we should be looking for to close this issue.

Without target-cpu:

$ ./test --bench

running 2 tests
test folds1 ... bench:          44 ns/iter (+/- 8)
test folds2 ... bench:          45 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

and with:

$ rustc -Copt-level=3 -C target-cpu=corei7-avx --test test.rs
$ ./test --bench

running 2 tests
test folds1 ... bench:          40 ns/iter (+/- 0)
test folds2 ... bench:          40 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

@bluss
Copy link
Member Author

bluss commented May 2, 2017

Nice! I can confirm that too.

According to playpen right now, not fixed in stable not fixed in rustc 1.18.0-beta.1 (4dce67253 2017-04-25)

But it is fixed in rustc 1.19.0-nightly (777ee2079 2017-05-01)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation
Projects
None yet
Development

No branches or pull requests

5 participants