-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Loop without side-effect is not eliminated. Leads to O(n) instead of O(1) runtime #79308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It looks like this can be solved with more On godbolt it can be made loop-free by adding example::vec_cast:
mov rax, rdi
mov rcx, qword ptr [rsi]
mov rdx, qword ptr [rsi + 8]
mov rdi, qword ptr [rsi + 16]
mov rsi, rcx
test rdi, rdi
je .LBB0_2
lea rsi, [rcx + 8*rdi]
.LBB0_2:
sub rsi, rcx
sar rsi, 3
mov qword ptr [rax], rcx
mov qword ptr [rax + 8], rdx
mov qword ptr [rax + 16], rsi
ret And branchless with example::vec_cast:
mov rax, rdi
mov rcx, qword ptr [rsi]
mov rdx, qword ptr [rsi + 8]
mov rsi, qword ptr [rsi + 16]
test rsi, rsi
lea rsi, [rcx + 8*rsi]
cmove rsi, rcx
sub rsi, rcx
sar rsi, 3
mov qword ptr [rdi], rcx
mov qword ptr [rdi + 8], rdx
mov qword ptr [rdi + 16], rsi
ret Using I failed to reproduce this improvement locally but that might be due to argument escaping issues when attempting to put the space-separated list into |
Surprisingly targeting a current CPU results in only one extra pass needed to achieve the loop-free result: |
Inserting one loop deletion pass directly into |
It looks like that this removed on nightly now. On nightly: example::vec_cast:
mov rax, rdi
mov rcx, qword ptr [rsi]
movups xmm0, xmmword ptr [rsi + 8]
mov qword ptr [rdi], rcx
movups xmmword ptr [rdi + 8], xmm0
ret On stable: example::vec_cast:
mov rax, rdi
mov rdi, qword ptr [rsi]
mov r8, qword ptr [rsi + 8]
mov rsi, qword ptr [rsi + 16]
mov rcx, rdi
test rsi, rsi
je .LBB0_7
lea r9, [8*rsi - 8]
mov ecx, r9d
shr ecx, 3
add ecx, 1
mov rdx, rdi
and rcx, 7
je .LBB0_4
neg rcx
mov rdx, rdi
.LBB0_3:
add rdx, 8
inc rcx
jne .LBB0_3
.LBB0_4:
lea rcx, [rdi + 8*rsi]
cmp r9, 56
jb .LBB0_7
lea rsi, [rdi + 8*rsi]
sub rsi, rdx
.LBB0_6:
add rsi, -64
jne .LBB0_6
.LBB0_7:
sub rcx, rdi
sar rcx, 3
mov qword ptr [rax], rdi
mov qword ptr [rax + 8], r8
mov qword ptr [rax + 16], rcx
ret |
That's only true for |
Something that was brought up on Zulip is that this case also fails for |
Seems to work in the other direction weirdly https://rust.godbolt.org/z/onbz6he9c pub struct Foo(usize);
#[inline(never)]
pub fn vec_cast(input: Vec<usize>) -> Vec<Foo> {
input.into_iter().map(|e| Foo(e)).collect()
} is optimised away, whilst pub struct Foo(usize);
#[inline(never)]
pub fn vec_cast(input: Vec<Foo>) -> Vec<usize> {
input.into_iter().map(|e| e.0).collect()
} is not |
…ulacrum Add codegen tests for additional cases where noop iterators get optimized away Optimizations have improved over time and now LLVM manages to optimize more in-place-collect noop-iterators to O(1) functions. This updates the codegen test to match. Many but not all cases reported in rust-lang#79308 work now.
Is there a case here that still doesn't optimize? It looks like all the existing godbolt links do. |
Looks good, I'll update the existing codegen test. |
Adding the follow method as part of a benchmark to library/alloc/benches/vec.rs
which exercises this specialization in Vec
results in the following assembly (extracted with objdump):
The ghidra decompile for the same function (comments are mine):
Note the useless loop.
The number of loop iterations (or rather the pointer increments) is needed to calculate the new length of the output
Vec
. LLVM already manages to hoistlVar3 = lVar1 + lVar4 * 8;
but then it fails to eliminate the now-useless loop.The issue does not occur if one uses
input.into_iter().flat_map(|e| None).collect()
instead, which always results in length == 0.I tried several variations of the loop (e.g. replacing
try_fold
with a simplewhile let Some() ...
) but it generally results in the same or worse assembly.Note: The assembly looks somewhat different if I run this on godbolt but the decrementing loop without side-effect is still there. I assume the differences are due to LTO or some other compiler settings.
Tested on commit a1a13b2 2020-11-21 22:46
@rustbot modify labels: +I-slow
The text was updated successfully, but these errors were encountered: