Skip to content

Loop without side-effect is not eliminated. Leads to O(n) instead of O(1) runtime #79308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
the8472 opened this issue Nov 22, 2020 · 9 comments · Fixed by #109891
Closed

Loop without side-effect is not eliminated. Leads to O(n) instead of O(1) runtime #79308

the8472 opened this issue Nov 22, 2020 · 9 comments · Fixed by #109891
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@the8472
Copy link
Member

the8472 commented Nov 22, 2020

Adding the follow method as part of a benchmark to library/alloc/benches/vec.rs

#[repr(transparent)]
pub struct Foo(usize);


#[inline(never)]
pub fn vec_cast(input: Vec<Foo>) -> Vec<usize> {
    input.into_iter().map(|e| unsafe { std::mem::transmute(e) }).collect()
}

which exercises this specialization in Vec

results in the following assembly (extracted with objdump):

0000000000086130 <collectionsbenches::vec::vec_cast>:
   86130:	48 8b 0e             	mov    (%rsi),%rcx
   86133:	48 89 f8             	mov    %rdi,%rax
   86136:	48 8b 56 08          	mov    0x8(%rsi),%rdx
   8613a:	48 8b 7e 10          	mov    0x10(%rsi),%rdi
   8613e:	48 89 ce             	mov    %rcx,%rsi
   86141:	48 85 ff             	test   %rdi,%rdi
   86144:	74 10                	je     86156 <collectionsbenches::vec::vec_cast+0x26>
   86146:	48 8d 34 f9          	lea    (%rcx,%rdi,8),%rsi
   8614a:	48 c1 e7 03          	shl    $0x3,%rdi
   8614e:	66 90                	xchg   %ax,%ax
   86150:	48 83 c7 f8          	add    $0xfffffffffffffff8,%rdi ; <= RDI unused from here onwards
   86154:	75 fa                	jne    86150 <collectionsbenches::vec::vec_cast+0x20>
   86156:	48 29 ce             	sub    %rcx,%rsi
   86159:	48 89 08             	mov    %rcx,(%rax)
   8615c:	48 89 50 08          	mov    %rdx,0x8(%rax)
   86160:	48 c1 fe 03          	sar    $0x3,%rsi
   86164:	48 89 70 10          	mov    %rsi,0x10(%rax)
   86168:	c3                   	retq   

The ghidra decompile for the same function (comments are mine):

void collectionsbenches::vec::vec_cast(long *param_1,long *param_2)

{
  long lVar1;
  long lVar2;
  long lVar3;
  long lVar4;
  
  lVar1 = *param_2; // pointer
  lVar2 = param_2[1]; // capacity
  lVar4 = param_2[2]; // len
  lVar3 = lVar1;
  if (lVar4 != 0) {
    lVar3 = lVar1 + lVar4 * 8; // end pointer of vec::IntoIter
    lVar4 = lVar4 << 3; // len in bytes
    do {
      lVar4 = lVar4 + -8;
    } while (lVar4 != 0); // <== lVar4 unused from here onwards
  }
  *param_1 = lVar1; // pointer
  param_1[1] = lVar2; // capacity
  param_1[2] = lVar3 - lVar1 >> 3; // len from pointer difference
  return;
}

Note the useless loop.

The number of loop iterations (or rather the pointer increments) is needed to calculate the new length of the output Vec. LLVM already manages to hoist lVar3 = lVar1 + lVar4 * 8; but then it fails to eliminate the now-useless loop.

The issue does not occur if one uses input.into_iter().flat_map(|e| None).collect() instead, which always results in length == 0.

I tried several variations of the loop (e.g. replacing try_fold with a simple while let Some() ...) but it generally results in the same or worse assembly.

Note: The assembly looks somewhat different if I run this on godbolt but the decrementing loop without side-effect is still there. I assume the differences are due to LTO or some other compiler settings.

Tested on commit a1a13b2 2020-11-21 22:46

@rustbot modify labels: +I-slow

@rustbot rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Nov 22, 2020
@jonas-schievink jonas-schievink added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Nov 22, 2020
@the8472
Copy link
Member Author

the8472 commented Nov 23, 2020

It looks like this can be solved with more dakka passes.

On godbolt it can be made loop-free by adding -C passes="loop-deletion dce loop-deletion"
godbolt link

example::vec_cast:
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        mov     rdx, qword ptr [rsi + 8]
        mov     rdi, qword ptr [rsi + 16]
        mov     rsi, rcx
        test    rdi, rdi
        je      .LBB0_2
        lea     rsi, [rcx + 8*rdi]
.LBB0_2:
        sub     rsi, rcx
        sar     rsi, 3
        mov     qword ptr [rax], rcx
        mov     qword ptr [rax + 8], rdx
        mov     qword ptr [rax + 16], rsi
        ret

And branchless with -C passes="loop-deletion dce loop-deletion simplifycfg instcombine simplifycfg"
godbolt link

example::vec_cast:
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        mov     rdx, qword ptr [rsi + 8]
        mov     rsi, qword ptr [rsi + 16]
        test    rsi, rsi
        lea     rsi, [rcx + 8*rsi]
        cmove   rsi, rcx
        sub     rsi, rcx
        sar     rsi, 3
        mov     qword ptr [rdi], rcx
        mov     qword ptr [rdi + 8], rdx
        mov     qword ptr [rdi + 16], rsi
        ret

Using opt-level=3 as starting point requires the same passes to be added to achieve these results, i.e. it does not provide a better baseline.

I failed to reproduce this improvement locally but that might be due to argument escaping issues when attempting to put the space-separated list into RUSTFLAGS

@the8472
Copy link
Member Author

the8472 commented Nov 23, 2020

Surprisingly targeting a current CPU results in only one extra pass needed to achieve the loop-free result:
-O -Ctarget-cpu=znver2 -C passes="loop-deletion"
godbolt link

@the8472
Copy link
Member Author

the8472 commented Dec 6, 2020

Inserting one loop deletion pass directly into PassManagerBuilder::addFunctionSimplificationPasses also results in the branchless version without having to change target-cpu or adding additional other passes.
So I assume that the other passes are only needed due to the way that -Cpasses works.

@AngelicosPhosphoros
Copy link
Contributor

It looks like that this removed on nightly now.
godbolt link

On nightly:

example::vec_cast:
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        movups  xmm0, xmmword ptr [rsi + 8]
        mov     qword ptr [rdi], rcx
        movups  xmmword ptr [rdi + 8], xmm0
        ret

On stable:

example::vec_cast:
        mov     rax, rdi
        mov     rdi, qword ptr [rsi]
        mov     r8, qword ptr [rsi + 8]
        mov     rsi, qword ptr [rsi + 16]
        mov     rcx, rdi
        test    rsi, rsi
        je      .LBB0_7
        lea     r9, [8*rsi - 8]
        mov     ecx, r9d
        shr     ecx, 3
        add     ecx, 1
        mov     rdx, rdi
        and     rcx, 7
        je      .LBB0_4
        neg     rcx
        mov     rdx, rdi
.LBB0_3:
        add     rdx, 8
        inc     rcx
        jne     .LBB0_3
.LBB0_4:
        lea     rcx, [rdi + 8*rsi]
        cmp     r9, 56
        jb      .LBB0_7
        lea     rsi, [rdi + 8*rsi]
        sub     rsi, rdx
.LBB0_6:
        add     rsi, -64
        jne     .LBB0_6
.LBB0_7:
        sub     rcx, rdi
        sar     rcx, 3
        mov     qword ptr [rax], rdi
        mov     qword ptr [rax + 8], r8
        mov     qword ptr [rax + 16], rcx
        ret

@the8472
Copy link
Member Author

the8472 commented Apr 3, 2021

That's only true for Copy types because I implemented a specialization for that purpose. The suboptimal assembly still exists for types not implementing Copy, e.g. newtypes. I'll update my initial post.

@the8472
Copy link
Member Author

the8472 commented Jul 31, 2021

Something that was brought up on Zulip is that this case also fails for Copy - i.e. with the TrustedRandomAccess specialization - is when the wrapper type is repr(C)

https://rust.godbolt.org/z/5oEnTcMqd

@cristicbz
Copy link
Contributor

cristicbz commented Jan 25, 2022

Seems to work in the other direction weirdly https://rust.godbolt.org/z/onbz6he9c

pub struct Foo(usize);

#[inline(never)]
pub fn vec_cast(input: Vec<usize>) -> Vec<Foo> {
    input.into_iter().map(|e| Foo(e)).collect()
}

is optimised away, whilst

pub struct Foo(usize);

#[inline(never)]
pub fn vec_cast(input: Vec<Foo>) -> Vec<usize> {
    input.into_iter().map(|e| e.0).collect()
}

is not

bors added a commit to rust-lang-ci/rust that referenced this issue Apr 18, 2022
…ulacrum

Add codegen tests for additional cases where noop iterators get optimized away

Optimizations have improved over time and now LLVM manages to optimize more in-place-collect noop-iterators to O(1) functions. This updates the codegen test to match.

Many but not all cases reported in rust-lang#79308 work now.
@nikic
Copy link
Contributor

nikic commented Apr 3, 2023

Is there a case here that still doesn't optimize? It looks like all the existing godbolt links do.

@the8472
Copy link
Member Author

the8472 commented Apr 3, 2023

Looks good, I'll update the existing codegen test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants