-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strong performance issue when using struct instead of multiple variables #85354
Comments
It could be a good idea to minimize the code that shows the problem (and copy it here, if you want). |
The gh version is already quite minimized @leonardo-m, but I now also removed all assertions and a little bit of initialization. Site note, the compiler doesn´t notice that I´m not using the results anymore, so the entire calculation could be removed. On my pc I now have execution times of ~4s with the struct vs 1.3s without the struct: Edit: found some more things to remove, now the relevant update_aux_sieve function ended up being only 6 loc. use std::time::Instant;
pub struct Aux {
pub sieve: [u64;1 << 12],
pub primes: [u32;6536],
}
pub const MARK_MASK: [u64;64] = {
let mut res = [0; 64];
let mut i = 0;
while i < 64 {
res[i] = 1 << i;
i += 1;
}
res
};
#[inline(always)]
fn mark_2(arr: &mut [u64], index: usize) { arr[index >> 6] |= MARK_MASK[index & 63usize]; }
#[inline]
const fn test_2(arr: &[u64], index: usize) -> bool { arr[index >> 6] & MARK_MASK[index & 63usize] == 0 }
#[inline]
pub fn update_aux_sieve(aux_sieve: &mut [u64], aux_primes: &[u32]) {
for i in 0..6536 {
let mut j = aux_primes[i] * aux_primes[i];
while j < (1 << 18) {
mark_2(aux_sieve, j as usize);
j += aux_primes[i];
}
}
}
#[inline]
pub fn update_aux_struct_sieve(aux: &mut Aux) {
for i in 0..6536 {
let mut j = aux.primes[i] * aux.primes[i];
while j < (1 << 18) {
mark_2(&mut aux.sieve, j as usize);
j += aux.primes[i];
}
}
}
pub fn init_aux_primes() -> ([u64;1 << 12], [u32;6536]) {
let mut aux_sieve: [u64;1 << 12] = [0;1 << 12];
let mut aux_primes: [u32;6536] = [1;6536];
let start = Instant::now();
println!("Initializing aux primes");
for i in (3..256).step_by(2) {
if test_2(&mut aux_sieve, i >> 1usize) {
for j in (((i*i) >> 1)..32768).step_by(i) {
mark_2(&mut aux_sieve, j as usize);
}
}
}
let mut j = 0;
for i in 8..32768 {
if test_2(&mut aux_sieve, i) {
aux_primes[j] = 2 * (i as u32) + 1;
j += 1;
}
}
let end = Instant::now();
let duration = end.duration_since(start);
eprintln!("init_aux_primes took, {:?}\n", duration);
(aux_sieve, aux_primes)
}
pub fn main() {
let (sieve, primes) = init_aux_primes();
let mut aux = Aux { sieve, primes };
let start = Instant::now();
for _ in 0..8192 {
//update_aux_struct_sieve(&mut aux); // slow
update_aux_sieve(&mut aux.sieve, &aux.primes); // fast
}
let end = Instant::now();
let duration = end.duration_since(start);
eprintln!("duration {:?}\n", duration);
} |
For me the runtime does not differ... neither with your original example, nor with your second one. 945.0161ms vs 954.0564ms What compiler version do you use ( |
Ok. To reproduce the issue, you do exactly what? |
Sure, I created a new branch for that. I haven´t used mktemp before, but I created a new directory for that. Inside of it I ran:
|
I'm sorry.... I can't reproduce your issue... :( |
Sure, updated it.
Nightly gives me the same performance. That´s awkward, but thanks for testing! |
We probably don´t want to use the playground for serious benchmarking, at least for me the execution time varies greatly.
|
I'm calling @jonas-schievink for further actions. Could you add the correct tags and do you think it's worth to call the |
Hmm, the slow version is consistently slower for me (on 1.52.0, as well as Slow:
Fast:
|
(this is with the code in #85354 (comment), compiled directly with |
I added two simplified functions (1a and 1b), one taking the struct and another one taking the arrays directly. I also ran flamegraph on 2a and 2b. Removing always from mark_2 does not effect 2a, but increases the runtime of 2b by roughly 30%. I tried looking at the generated code, but I wasn't really able to figure out which building block belongs where.
|
@AngelicosPhosphoros Thanks for your hint. I wouldn't have expected such a large impact by mutable noalias. However, if it's due to missed vectorization that probably makes sense. I was able to break it down further, the difference is still around a factor of 4.
|
I still find it funny, that on my machine it's equally fast:
But Jonas was able to confirm this. |
I thought it might be related to some hw specific optimizations. Want to get some more confused? I used Private with native: 2s vs 6.8s 1a: The asm code generated does differ only in two lines, from what I can tell, in I can hardly imagine that this causes any difference. They are both vectorized (see .LBB5_1:). |
Here we go. Not sure if I missed something, so I'll post the steps here. Let me know if you need more details.
I am surprised, that 1b is much shorter than 1a. For 1a I got: file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/iter/range.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/cmp.rs does not exist!
pub fn foo_1a(aux_sieve: &mut [u64], aux_primes: &[u32]) {
push rax
mov r8, rdx
xor r9d, r9d
movabs r10, 3777893186295716171
.LBB0_1:
mov rax, r9
mul r10
shr rdx, 10
imul rax, rdx, 5000
mov r11, r9
sub r11, rax
cmp r11, rcx
while j < 4096 {
jae .LBB0_7
add r9, 1
xor edx, edx
.LBB0_3:
aux_sieve[j as usize] |= 1234;
mov eax, edx
aux_sieve[j as usize] |= 1234;
cmp rax, rsi
jae .LBB0_9
or word, ptr, [rdi, +, 8*rax], 1234
j += aux_primes[i%5000];
add edx, dword, ptr, [r8, +, 4*r11]
while j < 4096 {
cmp edx, 4096
while j < 4096 {
jb .LBB0_3
cmp r9, 10000000
jne .LBB0_1
}
pop rax
ret
.LBB0_7:
aux_sieve[j as usize] |= 1234;
test rsi, rsi
je .LBB0_8
or word, ptr, [rdi], 1234
j += aux_primes[i%5000];
lea rdx, [rip, +, .L__unnamed_1]
mov rdi, r11
mov rsi, rcx
call qword, ptr, [rip, +, _ZN4core9panicking18panic_bounds_check17hd3a5ad32cddc7d96E@GOTPCREL]
ud2
.LBB0_8:
xor edx, edx
.LBB0_9:
aux_sieve[j as usize] |= 1234;
mov edi, edx
aux_sieve[j as usize] |= 1234;
lea rdx, [rip, +, .L__unnamed_2]
call qword, ptr, [rip, +, _ZN4core9panicking18panic_bounds_check17hd3a5ad32cddc7d96E@GOTPCREL]
ud2
For 1b I got: file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/cmp.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/num/uint_macros.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/iter/range.rs does not exist!
pub fn foo_1b(aux: &mut Aux) {
xor ecx, ecx
movabs r8, 3777893186295716171
.LBB1_2:
mov rax, rcx
mul r8
shr rdx, 10
imul rdx, rdx, 5000
mov rax, rcx
sub rax, rdx
xor edx, edx
.LBB1_3:
aux.sieve[j as usize] |= 1234;
mov esi, edx
aux.sieve[j as usize] |= 1234;
or word, ptr, [rdi, +, 8*rsi], 1234
j += aux.primes[i%5000];
add edx, dword, ptr, [rdi, +, 4*rax, +, 32768]
while j < 4096 {
cmp edx, 4096
while j < 4096 {
jb .LBB1_3
add rcx, 1
cmp rcx, 10000000
jne .LBB1_2
}
ret Since I'm not 100% sure about how cargo-asm works, I will also append the two asm files directly (native vs not native lib.s), as well as a diff. There are some differences, but to me they don't look relevant.
|
As general rule, before benchmarking you need to make your code correct, idiomatic, good, and nice, in that order. This avoids you silly mistakes and allows you to see everything better. Your two functions aren't equal, as you see from the asm you have shown, because in foo_1a you are losing the compile-time knowledge of the array lengths (inlining sometimes regains such information, but you can't rely too much on this), you can see it from the two panics. So the code becomes this (that is panic-free): use std::time::Instant;
const N1: usize = 4_096;
const N1B: u32 = N1 as _;
const N2: usize = 5_000;
const K: u64 = 1_234;
const LIM: usize = 10_000_000;
#[derive(Clone)]
pub struct Aux {
pub sieve: [u64; N1],
pub primes: [u32; N2],
}
#[inline]
//pub fn foo_1a(data1: &mut [u64], data2: &[u32]) {
pub fn foo_1a(data1: &mut [u64; N1], data2: &[u32; N2]) {
for i in 0 .. LIM {
let mut j = 0;
while j < N1B {
data1[j as usize] |= K;
j += data2[i % N2];
}
}
}
#[inline]
pub fn foo_1b(aux: &mut Aux) { // SLOW
for i in 0 .. LIM {
let mut j = 0;
while j < N1B {
aux.sieve[j as usize] |= K;
j += aux.primes[i % N2];
}
}
}
pub fn main() {
let sieve = [0_u64; N1];
let primes = [10_u32; N2];
let mut aux1 = Aux { sieve, primes };
let mut aux2 = aux1.clone();
let start = Instant::now();
foo_1a(&mut aux1.sieve, &aux1.primes);
let end = Instant::now();
let duration = end.duration_since(start);
println!("1a {:?}\n", duration);
//////////// 2b
let start = Instant::now();
foo_1b(&mut aux2);
let end = Instant::now();
let duration = end.duration_since(start);
println!("1b {:?}\n", duration);
} Now there's run-time performance difference, perhaps caused by how LLVM chooses to order the memory accesses (instructions like |
Godbolt outputs this: (with the example from leonardo example::foo_1a:
xor r9d, r9d
movabs r8, 3777893186295716171
.LBB0_2:
mov rax, r9
mul r8
shr rdx, 10
imul rax, rdx, 5000
mov rdx, r9
sub rdx, rax
mov eax, dword ptr [rsi + 4*rdx]
xor edx, edx
.LBB0_3:
mov ecx, edx
or word ptr [rdi + 8*rcx], 1234
add edx, eax
cmp edx, 4096
jb .LBB0_3
add r9, 1
cmp r9, 10000000
jne .LBB0_2
ret
example::foo_1b:
xor ecx, ecx
movabs r8, 3777893186295716171
.LBB1_2:
mov rax, rcx
mul r8
shr rdx, 10
imul rdx, rdx, 5000
mov rax, rcx
sub rax, rdx
xor edx, edx
.LBB1_3:
mov esi, edx
or word ptr [rdi + 8*rsi], 1234
add edx, dword ptr [rdi + 4*rax + 32768]
cmp edx, 4096
jb .LBB1_3
add rcx, 1
cmp rcx, 10000000
jne .LBB1_2
ret so the code is almost identical. Can you post what your "native" cpu target corresponds to? (rustc --print target-cpus | grep native). Mine is skylake |
My Ryzen 2600X is identified correctly as znver1: Regarding the code changes, the code generated from my newest lib version is identical to what @hellow554 posted. I guess you also have the asm from main.rs, however just to save it I'll also post it here: file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/cmp.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/clone.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/iter/range.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/fmt/mod.rs does not exist!
file /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/num/uint_macros.rs does not exist!
pub fn main() {
push r15
push r14
push r13
push r12
push rbx
mov eax, 125664
call __rust_probestack
sub rsp, rax
mov eax, 36
movaps xmm0, xmmword, ptr, [rip, +, .LCPI5_0]
.LBB5_1:
movups xmmword, ptr, [rsp, +, 4*rax, -, 16], xmm0
movups xmmword, ptr, [rsp, +, 4*rax], xmm0
movups xmmword, ptr, [rsp, +, 4*rax, +, 16], xmm0
movups xmmword, ptr, [rsp, +, 4*rax, +, 32], xmm0
movups xmmword, ptr, [rsp, +, 4*rax, +, 48], xmm0
movups xmmword, ptr, [rsp, +, 4*rax, +, 64], xmm0
movups xmmword, ptr, [rsp, +, 4*rax, +, 80], xmm0
movups xmmword, ptr, [rsp, +, 4*rax, +, 96], xmm0
movups xmmword, ptr, [rsp, +, 4*rax, +, 112], xmm0
movups xmmword, ptr, [rsp, +, 4*rax, +, 128], xmm0
add rax, 40
cmp rax, 5036
jne .LBB5_1
lea rdi, [rsp, +, 20128]
xor ebx, ebx
mov r15, qword, ptr, [rip, +, memset@GOTPCREL]
let mut aux1 = Aux { sieve, primes };
mov edx, 32768
xor esi, esi
call r15
let mut aux1 = Aux { sieve, primes };
lea rdi, [rsp, +, 52896]
lea r14, [rsp, +, 128]
mov r12, qword, ptr, [rip, +, memcpy@GOTPCREL]
let mut aux1 = Aux { sieve, primes };
mov edx, 20000
mov rsi, r14
call r12
lea rdi, [rsp, +, 72896]
mov edx, 32768
xor esi, esi
call r15
#[derive(Clone)]
lea rdi, [rsp, +, 105664]
mov edx, 20000
mov rsi, r14
call r12
let start = Instant::now();
call qword, ptr, [rip, +, _ZN3std4time7Instant3now17h4b133928c0a48a41E@GOTPCREL]
mov r15, rax
mov r14, rdx
movabs r12, 3777893186295716171
.LBB5_4:
mov rax, rbx
mul r12
shr rdx, 10
imul rax, rdx, 5000
mov rcx, rbx
sub rcx, rax
mov eax, dword, ptr, [rsp, +, 4*rcx, +, 52896]
xor ecx, ecx
.LBB5_5:
data1[j as usize] |= K;
mov edx, ecx
data1[j as usize] |= K;
or word, ptr, [rsp, +, 8*rdx, +, 20128], 1234
j += data2[i % N2];
add ecx, eax
while j < N1B {
cmp ecx, 4096
while j < N1B {
jb .LBB5_5
add rbx, 1
cmp rbx, 10000000
jne .LBB5_4
let end = Instant::now();
mov rbx, qword, ptr, [rip, +, _ZN3std4time7Instant3now17h4b133928c0a48a41E@GOTPCREL]
call rbx
mov qword, ptr, [rsp, +, 64], rax
mov qword, ptr, [rsp, +, 72], rdx
lea rdi, [rsp, +, 64]
let duration = end.duration_since(start);
mov rsi, r15
mov rdx, r14
call qword, ptr, [rip, +, _ZN3std4time7Instant14duration_since17hc096810677d2cca4E@GOTPCREL]
mov qword, ptr, [rsp, +, 80], rax
mov dword, ptr, [rsp, +, 88], edx
lea rax, [rsp, +, 80]
println!("1a {:?}\n", duration);
mov qword, ptr, [rsp], rax
mov r15, qword, ptr, [rip, +, _ZN57_$LT$core..time..Duration$u20$as$u20$core..fmt..Debug$GT$3fmt17h0e458f4f306000a0E@GOTPCREL]
mov qword, ptr, [rsp, +, 8], r15
lea rax, [rip, +, .L__unnamed_2]
mov qword, ptr, [rsp, +, 16], rax
mov qword, ptr, [rsp, +, 24], 2
mov qword, ptr, [rsp, +, 32], 0
mov r13, rsp
mov qword, ptr, [rsp, +, 48], r13
mov qword, ptr, [rsp, +, 56], 1
lea rdi, [rsp, +, 16]
call qword, ptr, [rip, +, _ZN3std2io5stdio6_print17ha58a4cba11598dcaE@GOTPCREL]
let start = Instant::now();
call rbx
mov rbx, rax
mov r14, rdx
xor ecx, ecx
.LBB5_8:
mov rax, rcx
mul r12
shr rdx, 10
imul rdx, rdx, 5000
mov rax, rcx
sub rax, rdx
xor edx, edx
.LBB5_9:
aux.sieve[j as usize] |= K;
mov esi, edx
aux.sieve[j as usize] |= K;
or word, ptr, [rsp, +, 8*rsi, +, 72896], 1234
j += aux.primes[i % N2];
add edx, dword, ptr, [rsp, +, 4*rax, +, 105664]
while j < N1B {
cmp edx, 4096
while j < N1B {
jb .LBB5_9
add rcx, 1
cmp rcx, 10000000
jne .LBB5_8
let end = Instant::now();
call qword, ptr, [rip, +, _ZN3std4time7Instant3now17h4b133928c0a48a41E@GOTPCREL]
mov qword, ptr, [rsp, +, 96], rax
mov qword, ptr, [rsp, +, 104], rdx
lea rdi, [rsp, +, 96]
let duration = end.duration_since(start);
mov rsi, rbx
mov rdx, r14
call qword, ptr, [rip, +, _ZN3std4time7Instant14duration_since17hc096810677d2cca4E@GOTPCREL]
mov qword, ptr, [rsp, +, 112], rax
mov dword, ptr, [rsp, +, 120], edx
lea rax, [rsp, +, 112]
println!("1b {:?}\n", duration);
mov qword, ptr, [rsp], rax
mov qword, ptr, [rsp, +, 8], r15
lea rax, [rip, +, .L__unnamed_3]
mov qword, ptr, [rsp, +, 16], rax
mov qword, ptr, [rsp, +, 24], 2
mov qword, ptr, [rsp, +, 32], 0
mov qword, ptr, [rsp, +, 48], r13
mov qword, ptr, [rsp, +, 56], 1
lea rdi, [rsp, +, 16]
call qword, ptr, [rip, +, _ZN3std2io5stdio6_print17ha58a4cba11598dcaE@GOTPCREL]
}
add rsp, 125664
pop rbx
pop r12
pop r13
pop r14
pop r15
ret |
hey @ZuseZ4 out of curiosity, can you take your examples and try to run them again today with the latest compiler? The LLVM Version has been increased a few times, maybe it's not performing bad anymore? |
@hellow554 Good question:
Using rustc 1.60 and Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz I'll talk to my LLVM (gsoc) supervisor tomorrow anyway, so unless you have any new thoughts I can bring it up there and see what can be done about it. |
I ran into an issue when working on a prime sieve.
I grouped three related variables (two arrays and an integer) into a struct to have less variables being passed around.
While profiling I noticed that the execution time went up from 2.8s to 7s when using the struct.
Based on some Discord hint´s I guess that shouldn´t be the case: https://discord.com/channels/273534239310479360/273541522815713281/843272480852672513
Here is my actual struct:
I created a gh repo, showing the different versions: https://github.com/ZuseZ4/perf_comp
Branch with_struct: I´m passing the struct by ref => 7s
Branch no_struct: I´m passing both arrays by ref, integer by value => 2.8s
Branch no_struct_ref: I´m passing both arrays and the integer by ref => 2.8s
Setup:
I use release mode, lto=true and target=native. (The last two seem to have no effect.)
rustc 1.52.1 (9bc8c42 2021-05-09)
(1.54.0-nightly (fe72845 2021-05-16) gives the same performance)
Ubuntu 20.04 (5.4.0-73-generic)
AMD® Athlon 200ge with 2x16 GB DDR4-3200
I could understand that an extra layer of indirection and therefore (if not removed by optimizations) pointer deref. could cause some performance cost, but I´m surprised about the large difference.
Did I miss something?
The text was updated successfully, but these errors were encountered: