Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strong performance issue when using struct instead of multiple variables #85354

Open
ZuseZ4 opened this issue May 16, 2021 · 23 comments
Open

strong performance issue when using struct instead of multiple variables #85354

ZuseZ4 opened this issue May 16, 2021 · 23 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@ZuseZ4
Copy link
Contributor

ZuseZ4 commented May 16, 2021

I ran into an issue when working on a prime sieve.
I grouped three related variables (two arrays and an integer) into a struct to have less variables being passed around.
While profiling I noticed that the execution time went up from 2.8s to 7s when using the struct.
Based on some Discord hint´s I guess that shouldn´t be the case: https://discord.com/channels/273534239310479360/273541522815713281/843272480852672513

Here is my actual struct:

pub struct Aux {
    pub sieve: [u64;_AUX_SIEVE_WORDS_ as usize], 
    pub primes: [u32;_NUMBER_OF_AUX_PRIMES_],
    pub base: u32,
}

I created a gh repo, showing the different versions: https://github.com/ZuseZ4/perf_comp

Branch with_struct: I´m passing the struct by ref => 7s
Branch no_struct: I´m passing both arrays by ref, integer by value => 2.8s
Branch no_struct_ref: I´m passing both arrays and the integer by ref => 2.8s

Setup:
I use release mode, lto=true and target=native. (The last two seem to have no effect.)
rustc 1.52.1 (9bc8c42 2021-05-09)
(1.54.0-nightly (fe72845 2021-05-16) gives the same performance)
Ubuntu 20.04 (5.4.0-73-generic)
AMD® Athlon 200ge with 2x16 GB DDR4-3200

I could understand that an extra layer of indirection and therefore (if not removed by optimizations) pointer deref. could cause some performance cost, but I´m surprised about the large difference.

Did I miss something?

@leonardo-m
Copy link

It could be a good idea to minimize the code that shows the problem (and copy it here, if you want).

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented May 16, 2021

The gh version is already quite minimized @leonardo-m, but I now also removed all assertions and a little bit of initialization.
Removing more code led to execution times being too short or long to measure.

Site note, the compiler doesn´t notice that I´m not using the results anymore, so the entire calculation could be removed.
This doesn´t change when removing all time measurements / println calls.
However, that´s probably less interesting than the struct issue.

On my pc I now have execution times of ~4s with the struct vs 1.3s without the struct:

Edit: found some more things to remove, now the relevant update_aux_sieve function ended up being only 6 loc.
I hope that´s small enough to be valuable.

use std::time::Instant;

pub struct Aux {
    pub sieve: [u64;1 << 12],
    pub primes: [u32;6536],
}

pub const MARK_MASK: [u64;64] = {
    let mut res = [0; 64];
    let mut i = 0;
    while i < 64 {
        res[i] = 1 << i;
        i += 1;
    }
    res
};

#[inline(always)]
fn mark_2(arr: &mut [u64], index: usize) { arr[index >> 6] |= MARK_MASK[index & 63usize]; }
#[inline]
const fn test_2(arr: &[u64], index: usize) -> bool { arr[index >> 6] & MARK_MASK[index & 63usize] == 0 }

#[inline]
pub fn update_aux_sieve(aux_sieve: &mut [u64], aux_primes: &[u32]) {
    for i in 0..6536 {
        let mut j = aux_primes[i] * aux_primes[i];
        while j < (1 << 18) { 
            mark_2(aux_sieve, j as usize);
            j += aux_primes[i];
        }
    }
}

#[inline]
pub fn update_aux_struct_sieve(aux: &mut Aux) {
    for i in 0..6536 {
        let mut j = aux.primes[i] * aux.primes[i];
        while j < (1 << 18) { 
            mark_2(&mut aux.sieve, j as usize);
            j += aux.primes[i];
        }
    }
}


pub fn init_aux_primes() -> ([u64;1 << 12], [u32;6536]) {
    let mut aux_sieve:  [u64;1 << 12] = [0;1 << 12];
    let mut aux_primes: [u32;6536] = [1;6536];
    let start = Instant::now();
    println!("Initializing aux primes");
    for i in (3..256).step_by(2) {
        if test_2(&mut aux_sieve, i >> 1usize) {
            for j in (((i*i) >> 1)..32768).step_by(i) {
                mark_2(&mut aux_sieve, j as usize);
            }
        }
    }
    let mut j = 0;
    for i in 8..32768 {
        if test_2(&mut aux_sieve, i) {
            aux_primes[j] = 2 * (i as u32) + 1;
            j += 1;
        }
    }

    let end = Instant::now();
    let duration = end.duration_since(start);
    eprintln!("init_aux_primes took, {:?}\n", duration);
    (aux_sieve, aux_primes)
}

pub fn main() {

    let (sieve, primes)  = init_aux_primes();
    let mut aux = Aux { sieve, primes };

    let start = Instant::now();
    for _ in 0..8192 {
        //update_aux_struct_sieve(&mut aux); // slow
        update_aux_sieve(&mut aux.sieve, &aux.primes); // fast
    }
    let end = Instant::now();
    let duration = end.duration_since(start);
    eprintln!("duration {:?}\n", duration);
}

@hellow554
Copy link
Contributor

For me the runtime does not differ... neither with your original example, nor with your second one.

945.0161ms vs 954.0564ms

What compiler version do you use (rustc --version --verbose)?

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented May 17, 2021

I´m on the latest stable version:

rustc 1.52.1 (9bc8c42 2021-05-09)
binary: rustc
commit-hash: 9bc8c42
commit-date: 2021-05-09
host: x86_64-unknown-linux-gnu
release: 1.52.1
LLVM version: 12.0.0

@hellow554
Copy link
Contributor

Ok. To reproduce the issue, you do exactly what?
Please make a step by step instruction from cloning the repo to see the actual behavoir. Please use a clean directory for that (at best use cd $(mktemp -d)).

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented May 17, 2021

Sure, I created a new branch for that. I haven´t used mktemp before, but I created a new directory for that. Inside of it I ran:

git clone https://github.com/ZuseZ4/perf_comp
cd perf_comp
git checkout simple_example
cargo run --release 
vim src/main.rs // now call the fast path on line 82 instead of the slow on line 81.
cargo run --release // the exec time changed from 4s to 1.3s  

@hellow554
Copy link
Contributor

hellow554 commented May 17, 2021

(simple_example)> cargo run --release
   Compiling perf_comp v0.1.0 (/tmp/tmp.tNq5F6nEPZ/perf_comp)
    Finished release [optimized] target(s) in 0.51s
     Running `target/release/perf_comp`
Initializing aux primes
init_aux_primes took, 112.9µs

duration 950.4344ms

(simple_example)> vim src/main.rs
(simple_example)> cargo run --release
   Compiling perf_comp v0.1.0 (/tmp/tmp.tNq5F6nEPZ/perf_comp)
    Finished release [optimized] target(s) in 0.27s
     Running `target/release/perf_comp`
Initializing aux primes
init_aux_primes took, 111.4µs

duration 913.6341ms

I'm sorry.... I can't reproduce your issue... :(

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented May 17, 2021

Sure, updated it.
I also just checked with mktemp, still getting the same numbers.
Unfortunately I couldn´t test it on a server due to an outdated rust version.

 /t/t/perf_comp> cargo run --release                                                                                                                                   simple_example
   Compiling perf_comp v0.1.0 (/tmp/tmp.el9PvYMGpW/perf_comp)
    Finished release [optimized] target(s) in 0.89s
     Running `target/release/perf_comp`
Initializing aux primes
init_aux_primes took, 310.209µs

duration 4.192843814s

 /t/t/perf_comp> vim src/main.rs                                                                                                                                      simple_example?
 /t/t/perf_comp> cargo run --release                                                                                                                                 simple_example!?
   Compiling perf_comp v0.1.0 (/tmp/tmp.el9PvYMGpW/perf_comp)
    Finished release [optimized] target(s) in 0.43s
     Running `target/release/perf_comp`
Initializing aux primes
init_aux_primes took, 240.856µs

duration 1.472958919s

Nightly gives me the same performance.

That´s awkward, but thanks for testing!
Guess I will need to wait for someone else to see if it´s just a local issue.

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented May 17, 2021

https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=19862feadf5a335ae4765e31d0a9c1fd

We probably don´t want to use the playground for serious benchmarking, at least for me the execution time varies greatly.
However, slow and fast are +- equal.
On my own PC I get:


/t/t/perf_comp> cargo +nightly run --release  
   Compiling perf_comp v0.1.0 (/tmp/tmp.el9PvYMGpW/perf_comp)
    Finished release [optimized] target(s) in 0.94s
     Running `target/release/perf_comp`
Initializing aux primes
init_aux_primes took, 159.632µs

slow 2.137172522s

fast 739.105792ms

@hellow554
Copy link
Contributor

I'm calling @jonas-schievink for further actions. Could you add the correct tags and do you think it's worth to call the ICEBreaker-Cleanup-Crew?

@jonas-schievink
Copy link
Contributor

Hmm, the slow version is consistently slower for me (on 1.52.0, as well as rustc 1.54.0-nightly (881c1ac40 2021-05-08)). Here's timings on 1.52.0, with -O:

Slow:

~> ./sieve
Initializing aux primes
init_aux_primes took, 105.982µs

duration 2.959215981s

~> ./sieve
Initializing aux primes
init_aux_primes took, 212.863µs

duration 2.948748385s

Fast:

~> ./sieve
Initializing aux primes
init_aux_primes took, 108.476µs

duration 695.428093ms

~> ./sieve
Initializing aux primes
init_aux_primes took, 214.827µs

duration 706.368691ms

@jonas-schievink jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 17, 2021
@jonas-schievink
Copy link
Contributor

(this is with the code in #85354 (comment), compiled directly with rustc)

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented Jun 6, 2021

I added two simplified functions (1a and 1b), one taking the struct and another one taking the arrays directly.
I only access one of those two arrays and here the compiler was able to achieve the same performance for both.

I also ran flamegraph on 2a and 2b.
2b (taking the struct) is roughly 4x slower than 2a.
2a is spending 100% on mark_2(..)
2b is spending 50% on mark_2(..) (in absolute numbers still twice as long as 2a)
2b is spending 50% on his loop body (something taking no time in 2a).

Removing always from mark_2 does not effect 2a, but increases the runtime of 2b by roughly 30%.

I tried looking at the generated code, but I wasn't really able to figure out which building block belongs where.
https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=a47d269d93e1a85ecb65577a60e47544

use std::time::Instant;

#[derive(Clone)]
pub struct Aux {
    pub sieve: [u64;1 << 12],
    pub primes: [u32;6536],
}

pub const MARK_MASK: [u64;64] = {
    let mut res = [0; 64];
    let mut i = 0;
    while i < 64 {
        res[i] = 1 << i;
        i += 1;
    }
    res
};

#[inline(always)]
fn mark_2(arr: &mut [u64], index: usize) { arr[index >> 6] |= MARK_MASK[index & 63usize]; }

#[inline]
pub fn foo_1a(aux_sieve: &mut [u64], _aux_primes: &[u32]) {
    for j in 0..(1<<18) {
      mark_2(aux_sieve, j as usize);
    }
}

#[inline]
pub fn foo_1b(aux: &mut Aux) {
    for j in 0..(1<<18) {
      mark_2(&mut aux.sieve, j as usize);
    }
}

#[inline]
pub fn foo_2a(aux_sieve: &mut [u64], aux_primes: &[u32]) {
    for i in 0..6536 {
        let mut j = aux_primes[i] * aux_primes[i];
        while j < (1 << 18) {
            mark_2(aux_sieve, j as usize);
            j += aux_primes[i];
        }
    }
}

#[inline]
pub fn foo_2b(aux: &mut Aux) { // SLOW
    for i in 0..6536 {
        let mut j = aux.primes[i] * aux.primes[i];
        while j < (1 << 18) {
            mark_2(&mut aux.sieve, j as usize);
            j += aux.primes[i];
        }
    }
}

pub fn init_aux_primes() -> ([u64;1 << 12], [u32;6536]) {
    
    let mut aux_sieve:  [u64;1 << 12] = [0;1 << 12];
    for i in (3..256).step_by(2) {
        for j in (((i*i) >> 1)..32768).step_by(i) {
            mark_2(&mut aux_sieve, j as usize);
        }
    }
    
    let mut aux_primes: [u32;6536] = [1;6536];
    for j in 0..6536 {
      aux_primes[j] = 2 * (j as u32) + 1;
    }

    (aux_sieve, aux_primes)
}

pub fn main() {

    let (sieve, primes)  = init_aux_primes();
    let mut aux1 = Aux { sieve, primes };
    let mut aux2 = aux1.clone();
    let mut aux3 = aux1.clone();
    let mut aux4 = aux1.clone();
    
    //////////// 1a
    let start = Instant::now();
    for _ in 0..4096 {
        foo_1a(&mut aux1.sieve, &aux1.primes);
    }
    let end = Instant::now();
    let duration = end.duration_since(start);
    eprintln!("1a {:?}\n", duration);
    
    //////////// 1b
    let start = Instant::now();
    for _ in 0..4096 {
        foo_1b(&mut aux2);
    }
    let end = Instant::now();
    let duration = end.duration_since(start);
    eprintln!("1b {:?}\n", duration);

    //////////// 2a
    let start = Instant::now();
    for _ in 0..4096 {
        foo_2a(&mut aux3.sieve, &aux3.primes);
    }
    let end = Instant::now();
    let duration = end.duration_since(start);
    eprintln!("2a {:?}\n", duration);

    //////////// 2b
    let start = Instant::now();
    for _ in 0..4096 {
        foo_2b(&mut aux4);
    }
    let end = Instant::now();
    let duration = end.duration_since(start);
    eprintln!("2b {:?}\n", duration);
    
}

@AngelicosPhosphoros
Copy link
Contributor

@ZuseZ4 I think it doesn't related to the small POD case because it is more about integer packing.

I think, your problem related to this #85570

However, it would be cool if you provide smaller example.

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented Jun 7, 2021

@AngelicosPhosphoros Thanks for your hint. I wouldn't have expected such a large impact by mutable noalias. However, if it's due to missed vectorization that probably makes sense.

I was able to break it down further, the difference is still around a factor of 4.
I wasn't able to show any difference when using only one array of the struct, which makes sense if it's an alias issue.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=530bac9311ddf86c4622f2806d52939e

use std::time::Instant;

#[derive(Clone)]
pub struct Aux {
    pub sieve: [u64;4096],
    pub primes: [u32;5000],
}

#[inline]
pub fn foo_1a(aux_sieve: &mut [u64], aux_primes: &[u32]) {
    for i in 0..10_000_000 {
        let mut j = 0;
        while j < 4096 {
            aux_sieve[j as usize] |= 1234;
            j += aux_primes[i%5000];
        }
    }
}

#[inline]
pub fn foo_1b(aux: &mut Aux) { // SLOW
    for i in 0..10_000_000 {
        let mut j = 0;
        while j < 4096 {
            aux.sieve[j as usize] |= 1234;
            j += aux.primes[i%5000];
        }
    }
}

pub fn main() { 
    let sieve:  [u64;4096] = [0;4096];
    let primes: [u32;5000] = [10;5000];

    let mut aux1 = Aux { sieve, primes };
    let mut aux2 = aux1.clone();
    
    //////////// 2a
    let start = Instant::now();
    foo_1a(&mut aux1.sieve, &aux1.primes);
    let end = Instant::now();
    let duration = end.duration_since(start);
    eprintln!("1a {:?}\n", duration);

    //////////// 2b
    let start = Instant::now();
    foo_1b(&mut aux2);
    let end = Instant::now();
    let duration = end.duration_since(start);
    eprintln!("2b {:?}\n", duration);
}

@hellow554
Copy link
Contributor

I still find it funny, that on my machine it's equally fast:

# rustc a.rs -O && ./a
1a 1.742168s

2b 1.7540583s

But Jonas was able to confirm this.
I'm very confused about what's happening here...

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented Jun 7, 2021

I thought it might be related to some hw specific optimizations.
So I benched it on my pc as well as a university cluster with march-native turned off and on.
I'm using 1.52.1 on my pc (Ryzen 2600) and 1.41.1 (Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz) on the university cluster.

Want to get some more confused?

I used
rustc -O -C target-cpu=native main.rs && ./main
rustc -O -C && ./main
and took the average over 10 runs.

Private with native: 2s vs 6.8s
Private without native: 1.9s vs 7.3s
Cluster with native: 2.1s vs 2.4s
Cluster without native: 7.5s vs 2.6s
I can try to get a newer rustc version running on the cluster, if needed. But I guess the generated code on playground from the current version might be more interesting:

1a:
https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=5760dad6f76126eb1d84870823a763f3
1b:
https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=f1e42cff2965fc0d6d71357ca207e4bd

The asm code generated does differ only in two lines, from what I can tell, in
.LBB5_4 and .LBB5_5 accordingly.
1b contains an extra line ( add edx, dword ptr [rsp + 4rax + 52768] ) in LBB5_5,
1a contains an extra line ( mov eax, dword ptr [rsp + 4
rdx + 52768])) in LBB5_4

I can hardly imagine that this causes any difference. They are both vectorized (see .LBB5_1:).
Let me check what is generated on my lokal pc.

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented Jun 8, 2021

Here we go. Not sure if I missed something, so I'll post the steps here. Let me know if you need more details.
I used cargo-asm to make it more readable (https://crates.io/crates/cargo-asm)
I placed the functions into their own library and ran

cargo asm test_perf::perf::foo_1a --rust

I am surprised, that 1b is much shorter than 1a.
Which optimization pass(es) could be relevant here / how do we check why it generates different asm here?

For 1a I got:

file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/iter/range.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/cmp.rs does not exist!
 pub fn foo_1a(aux_sieve: &mut [u64], aux_primes: &[u32]) {
 push    rax
 mov     r8, rdx
 xor     r9d, r9d
 movabs  r10, 3777893186295716171
.LBB0_1:
 mov     rax, r9
 mul     r10
 shr     rdx, 10
 imul    rax, rdx, 5000
 mov     r11, r9
 sub     r11, rax
 cmp     r11, rcx
 while j < 4096 {
 jae     .LBB0_7
 add     r9, 1
 xor     edx, edx
.LBB0_3:
 aux_sieve[j as usize] |= 1234;
 mov     eax, edx
 aux_sieve[j as usize] |= 1234;
 cmp     rax, rsi
 jae     .LBB0_9
 or      word, ptr, [rdi, +, 8*rax], 1234
 j += aux_primes[i%5000];
 add     edx, dword, ptr, [r8, +, 4*r11]
 while j < 4096 {
 cmp     edx, 4096
 while j < 4096 {
 jb      .LBB0_3
     cmp     r9, 10000000
     jne     .LBB0_1
 }
 pop     rax
 ret
.LBB0_7:
 aux_sieve[j as usize] |= 1234;
 test    rsi, rsi
 je      .LBB0_8
 or      word, ptr, [rdi], 1234
 j += aux_primes[i%5000];
 lea     rdx, [rip, +, .L__unnamed_1]
 mov     rdi, r11
 mov     rsi, rcx
 call    qword, ptr, [rip, +, _ZN4core9panicking18panic_bounds_check17hd3a5ad32cddc7d96E@GOTPCREL]
 ud2
.LBB0_8:
 xor     edx, edx
.LBB0_9:
 aux_sieve[j as usize] |= 1234;
 mov     edi, edx
 aux_sieve[j as usize] |= 1234;
 lea     rdx, [rip, +, .L__unnamed_2]
 call    qword, ptr, [rip, +, _ZN4core9panicking18panic_bounds_check17hd3a5ad32cddc7d96E@GOTPCREL]
 ud2

For 1b I got:

file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/cmp.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/num/uint_macros.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/iter/range.rs does not exist!
 pub fn foo_1b(aux: &mut Aux) {
 xor     ecx, ecx
 movabs  r8, 3777893186295716171
.LBB1_2:
 mov     rax, rcx
 mul     r8
 shr     rdx, 10
 imul    rdx, rdx, 5000
 mov     rax, rcx
 sub     rax, rdx
 xor     edx, edx
.LBB1_3:
 aux.sieve[j as usize] |= 1234;
 mov     esi, edx
 aux.sieve[j as usize] |= 1234;
 or      word, ptr, [rdi, +, 8*rsi], 1234
 j += aux.primes[i%5000];
 add     edx, dword, ptr, [rdi, +, 4*rax, +, 32768]
 while j < 4096 {
 cmp     edx, 4096
 while j < 4096 {
 jb      .LBB1_3
     add     rcx, 1
     cmp     rcx, 10000000
     jne     .LBB1_2
 }
 ret

Since I'm not 100% sure about how cargo-asm works, I will also append the two asm files directly (native vs not native lib.s), as well as a diff. There are some differences, but to me they don't look relevant.
diff.txt

rustc -O lib.rs --crate-type=lib -C target-cpu=native --emit asm -o lib_native_asm.s

lib_asm.txt
lib_native_asm.txt

@leonardo-m
Copy link

As general rule, before benchmarking you need to make your code correct, idiomatic, good, and nice, in that order. This avoids you silly mistakes and allows you to see everything better.

Your two functions aren't equal, as you see from the asm you have shown, because in foo_1a you are losing the compile-time knowledge of the array lengths (inlining sometimes regains such information, but you can't rely too much on this), you can see it from the two panics. So the code becomes this (that is panic-free):

use std::time::Instant;

const N1: usize = 4_096;
const N1B: u32 = N1 as _;
const N2: usize = 5_000;
const K: u64 = 1_234;
const LIM: usize = 10_000_000;

#[derive(Clone)]
pub struct Aux {
    pub sieve: [u64; N1],
    pub primes: [u32; N2],
}

#[inline]
//pub fn foo_1a(data1: &mut [u64], data2: &[u32]) {
pub fn foo_1a(data1: &mut [u64; N1], data2: &[u32; N2]) {
    for i in 0 .. LIM {
        let mut j = 0;
        while j < N1B {
            data1[j as usize] |= K;
            j += data2[i % N2];
        }
    }
}

#[inline]
pub fn foo_1b(aux: &mut Aux) { // SLOW
    for i in 0 .. LIM {
        let mut j = 0;
        while j < N1B {
            aux.sieve[j as usize] |= K;
            j += aux.primes[i % N2];
        }
    }
}

pub fn main() {
    let sieve = [0_u64; N1];
    let primes = [10_u32; N2];

    let mut aux1 = Aux { sieve, primes };
    let mut aux2 = aux1.clone();

    let start = Instant::now();
    foo_1a(&mut aux1.sieve, &aux1.primes);
    let end = Instant::now();
    let duration = end.duration_since(start);
    println!("1a {:?}\n", duration);

    //////////// 2b
    let start = Instant::now();
    foo_1b(&mut aux2);
    let end = Instant::now();
    let duration = end.duration_since(start);
    println!("1b {:?}\n", duration);
}

Now there's run-time performance difference, perhaps caused by how LLVM chooses to order the memory accesses (instructions like add esi, dword ptr [rdi + 4*rdx + 32768]). Now you can start looking for the causes (that could be prefetching, or other cache/cpu interactions that aren't obvious to me).

@hellow554
Copy link
Contributor

hellow554 commented Jun 8, 2021

Godbolt outputs this: (with the example from leonardo

example::foo_1a:
        xor     r9d, r9d
        movabs  r8, 3777893186295716171
.LBB0_2:
        mov     rax, r9
        mul     r8
        shr     rdx, 10
        imul    rax, rdx, 5000
        mov     rdx, r9
        sub     rdx, rax
        mov     eax, dword ptr [rsi + 4*rdx]
        xor     edx, edx
.LBB0_3:
        mov     ecx, edx
        or      word ptr [rdi + 8*rcx], 1234
        add     edx, eax
        cmp     edx, 4096
        jb      .LBB0_3
        add     r9, 1
        cmp     r9, 10000000
        jne     .LBB0_2
        ret

example::foo_1b:
        xor     ecx, ecx
        movabs  r8, 3777893186295716171
.LBB1_2:
        mov     rax, rcx
        mul     r8
        shr     rdx, 10
        imul    rdx, rdx, 5000
        mov     rax, rcx
        sub     rax, rdx
        xor     edx, edx
.LBB1_3:
        mov     esi, edx
        or      word ptr [rdi + 8*rsi], 1234
        add     edx, dword ptr [rdi + 4*rax + 32768]
        cmp     edx, 4096
        jb      .LBB1_3
        add     rcx, 1
        cmp     rcx, 10000000
        jne     .LBB1_2
        ret

so the code is almost identical. Can you post what your "native" cpu target corresponds to? (rustc --print target-cpus | grep native). Mine is skylake

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented Jun 8, 2021

My Ryzen 2600X is identified correctly as znver1:
native - Select the CPU of the current host (currently znver1).
However, the code I posted (from cargo-asm), as well as the lib_asm.txt file have been generated without the native instructions.
So this part seems fine.

Regarding the code changes, the code generated from my newest lib version is identical to what @hellow554 posted.
The performance gap seems to be a bit larger now, but I guess our issue is making 1b faster, not 1a:
1a 1.691864221s
1b 7.77066248s

I guess you also have the asm from main.rs, however just to save it I'll also post it here:

file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/cmp.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/clone.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/iter/range.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/fmt/mod.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/num/uint_macros.rs does not exist!
 pub fn main() {
 push    r15
 push    r14
 push    r13
 push    r12
 push    rbx
 mov     eax, 125664
 call    __rust_probestack
 sub     rsp, rax
 mov     eax, 36
 movaps  xmm0, xmmword, ptr, [rip, +, .LCPI5_0]
.LBB5_1:
 movups  xmmword, ptr, [rsp, +, 4*rax, -, 16], xmm0
 movups  xmmword, ptr, [rsp, +, 4*rax], xmm0
 movups  xmmword, ptr, [rsp, +, 4*rax, +, 16], xmm0
 movups  xmmword, ptr, [rsp, +, 4*rax, +, 32], xmm0
 movups  xmmword, ptr, [rsp, +, 4*rax, +, 48], xmm0
 movups  xmmword, ptr, [rsp, +, 4*rax, +, 64], xmm0
 movups  xmmword, ptr, [rsp, +, 4*rax, +, 80], xmm0
 movups  xmmword, ptr, [rsp, +, 4*rax, +, 96], xmm0
 movups  xmmword, ptr, [rsp, +, 4*rax, +, 112], xmm0
 movups  xmmword, ptr, [rsp, +, 4*rax, +, 128], xmm0
 add     rax, 40
 cmp     rax, 5036
 jne     .LBB5_1
 lea     rdi, [rsp, +, 20128]
 xor     ebx, ebx
 mov     r15, qword, ptr, [rip, +, memset@GOTPCREL]
 let mut aux1 = Aux { sieve, primes };
 mov     edx, 32768
 xor     esi, esi
 call    r15
 let mut aux1 = Aux { sieve, primes };
 lea     rdi, [rsp, +, 52896]
 lea     r14, [rsp, +, 128]
 mov     r12, qword, ptr, [rip, +, memcpy@GOTPCREL]
 let mut aux1 = Aux { sieve, primes };
 mov     edx, 20000
 mov     rsi, r14
 call    r12
 lea     rdi, [rsp, +, 72896]
     mov     edx, 32768
     xor     esi, esi
     call    r15
 #[derive(Clone)]
 lea     rdi, [rsp, +, 105664]
     mov     edx, 20000
     mov     rsi, r14
     call    r12
 let start = Instant::now();
 call    qword, ptr, [rip, +, _ZN3std4time7Instant3now17h4b133928c0a48a41E@GOTPCREL]
 mov     r15, rax
 mov     r14, rdx
 movabs  r12, 3777893186295716171
.LBB5_4:
 mov     rax, rbx
 mul     r12
 shr     rdx, 10
 imul    rax, rdx, 5000
 mov     rcx, rbx
 sub     rcx, rax
 mov     eax, dword, ptr, [rsp, +, 4*rcx, +, 52896]
 xor     ecx, ecx
.LBB5_5:
 data1[j as usize] |= K;
 mov     edx, ecx
 data1[j as usize] |= K;
 or      word, ptr, [rsp, +, 8*rdx, +, 20128], 1234
 j += data2[i % N2];
 add     ecx, eax
 while j < N1B {
 cmp     ecx, 4096
 while j < N1B {
 jb      .LBB5_5
     add     rbx, 1
     cmp     rbx, 10000000
     jne     .LBB5_4
 let end = Instant::now();
 mov     rbx, qword, ptr, [rip, +, _ZN3std4time7Instant3now17h4b133928c0a48a41E@GOTPCREL]
 call    rbx
 mov     qword, ptr, [rsp, +, 64], rax
 mov     qword, ptr, [rsp, +, 72], rdx
 lea     rdi, [rsp, +, 64]
 let duration = end.duration_since(start);
 mov     rsi, r15
 mov     rdx, r14
 call    qword, ptr, [rip, +, _ZN3std4time7Instant14duration_since17hc096810677d2cca4E@GOTPCREL]
 mov     qword, ptr, [rsp, +, 80], rax
 mov     dword, ptr, [rsp, +, 88], edx
 lea     rax, [rsp, +, 80]
 println!("1a {:?}\n", duration);
 mov     qword, ptr, [rsp], rax
 mov     r15, qword, ptr, [rip, +, _ZN57_$LT$core..time..Duration$u20$as$u20$core..fmt..Debug$GT$3fmt17h0e458f4f306000a0E@GOTPCREL]
 mov     qword, ptr, [rsp, +, 8], r15
     lea     rax, [rip, +, .L__unnamed_2]
     mov     qword, ptr, [rsp, +, 16], rax
     mov     qword, ptr, [rsp, +, 24], 2
     mov     qword, ptr, [rsp, +, 32], 0
     mov     r13, rsp
     mov     qword, ptr, [rsp, +, 48], r13
     mov     qword, ptr, [rsp, +, 56], 1
     lea     rdi, [rsp, +, 16]
 call    qword, ptr, [rip, +, _ZN3std2io5stdio6_print17ha58a4cba11598dcaE@GOTPCREL]
 let start = Instant::now();
 call    rbx
 mov     rbx, rax
 mov     r14, rdx
 xor     ecx, ecx
.LBB5_8:
 mov     rax, rcx
 mul     r12
 shr     rdx, 10
 imul    rdx, rdx, 5000
 mov     rax, rcx
 sub     rax, rdx
 xor     edx, edx
.LBB5_9:
 aux.sieve[j as usize] |= K;
 mov     esi, edx
 aux.sieve[j as usize] |= K;
 or      word, ptr, [rsp, +, 8*rsi, +, 72896], 1234
 j += aux.primes[i % N2];
 add     edx, dword, ptr, [rsp, +, 4*rax, +, 105664]
 while j < N1B {
 cmp     edx, 4096
 while j < N1B {
 jb      .LBB5_9
     add     rcx, 1
     cmp     rcx, 10000000
     jne     .LBB5_8
 let end = Instant::now();
 call    qword, ptr, [rip, +, _ZN3std4time7Instant3now17h4b133928c0a48a41E@GOTPCREL]
 mov     qword, ptr, [rsp, +, 96], rax
 mov     qword, ptr, [rsp, +, 104], rdx
 lea     rdi, [rsp, +, 96]
 let duration = end.duration_since(start);
 mov     rsi, rbx
 mov     rdx, r14
 call    qword, ptr, [rip, +, _ZN3std4time7Instant14duration_since17hc096810677d2cca4E@GOTPCREL]
 mov     qword, ptr, [rsp, +, 112], rax
 mov     dword, ptr, [rsp, +, 120], edx
 lea     rax, [rsp, +, 112]
 println!("1b {:?}\n", duration);
 mov     qword, ptr, [rsp], rax
 mov     qword, ptr, [rsp, +, 8], r15
     lea     rax, [rip, +, .L__unnamed_3]
     mov     qword, ptr, [rsp, +, 16], rax
     mov     qword, ptr, [rsp, +, 24], 2
     mov     qword, ptr, [rsp, +, 32], 0
     mov     qword, ptr, [rsp, +, 48], r13
     mov     qword, ptr, [rsp, +, 56], 1
     lea     rdi, [rsp, +, 16]
 call    qword, ptr, [rip, +, _ZN3std2io5stdio6_print17ha58a4cba11598dcaE@GOTPCREL]
 }
 add     rsp, 125664
 pop     rbx
 pop     r12
 pop     r13
 pop     r14
 pop     r15
 ret

@hellow554
Copy link
Contributor

hey @ZuseZ4

out of curiosity, can you take your examples and try to run them again today with the latest compiler? The LLVM Version has been increased a few times, maybe it's not performing bad anymore?

@ZuseZ4
Copy link
Contributor Author

ZuseZ4 commented Jun 3, 2022

@hellow554 Good question:
Based on the code from leonardo-m

➜  RustPerf rustc -O -C target-cpu=native main.rs && ./main
1a 5.785559692s

1b 1.999194356s

➜  RustPerf rustc -O  main.rs && ./main
1a 1.801210071s

1b 2.404140135s

Using rustc 1.60 and Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz

I'll talk to my LLVM (gsoc) supervisor tomorrow anyway, so unless you have any new thoughts I can bring it up there and see what can be done about it.

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants