Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary local variable and memcpy not elided during struct initialization #77955

Closed
tgnottingham opened this issue Oct 14, 2020 · 2 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@tgnottingham
Copy link
Contributor

Here is a case where a local variable isn't optimized away, which introduces an unnecessary copy:

struct S {
    item: [u64; 17],
}

pub fn f() {
    let item = [1u64; 17];  
    unsafe { ext(&S{item}) };
}

extern {
    fn ext(s: &S);
}

item is created on the stack, and then copied into the S temporary. But item is otherwise unused, so the compiler could just populate the temporary directly.

Assembly at -C opt-level=3:

.LCPI0_0:
        .quad   1
        .quad   1
example::f:
        pushq   %rbx
        subq    $272, %rsp
        movaps  .LCPI0_0(%rip), %xmm0
        movaps  %xmm0, (%rsp)
        movaps  %xmm0, 16(%rsp)
        movaps  %xmm0, 32(%rsp)
        movaps  %xmm0, 48(%rsp)
        movaps  %xmm0, 64(%rsp)
        movaps  %xmm0, 80(%rsp)
        movaps  %xmm0, 96(%rsp)
        movaps  %xmm0, 112(%rsp)
        movq    $1, 128(%rsp)
        leaq    136(%rsp), %rbx
        movq    %rsp, %rsi
        movl    $136, %edx
        movq    %rbx, %rdi
        callq   *memcpy@GOTPCREL(%rip)
        movq    %rbx, %rdi
        callq   *ext@GOTPCREL(%rip)
        addq    $272, %rsp
        popq    %rbx
        retq

The problem occurs if the array initializer is 1u64, but not if it's 0u64. With 1u64, the generated IR uses a loop to populate the array, whereas with 0u64, it doesn't. The issue is likely that LLVM is unable to see through the relative complexity of the loop to notice that item can be eliminated.

Here is the LLVM-IR before any optimization passes:

define void @_ZN3lib1f17ha3b7b123ae9e0a2fE() unnamed_addr #0 {
start:
  %item = alloca [17 x i64], align 8
  %_3 = alloca %S, align 8
  %0 = getelementptr inbounds [17 x i64], [17 x i64]* %item, i64 0, i64 0
  %1 = getelementptr inbounds [17 x i64], [17 x i64]* %item, i64 0, i64 0
  %2 = getelementptr inbounds [17 x i64], [17 x i64]* %item, i64 0, i64 17
  br label %repeat_loop_header

bb1:                                              ; preds = %repeat_loop_next
  %3 = bitcast %S* %_3 to i8*
  call void @llvm.lifetime.end.p0i8(i64 136, i8* %3)
  ret void

repeat_loop_header:                               ; preds = %repeat_loop_body, %start
  %4 = phi i64* [ %1, %start ], [ %6, %repeat_loop_body ]
  %5 = icmp ne i64* %4, %2
  br i1 %5, label %repeat_loop_body, label %repeat_loop_next

repeat_loop_body:                                 ; preds = %repeat_loop_header
  store i64 1, i64* %4, align 8
  %6 = getelementptr inbounds i64, i64* %4, i64 1
  br label %repeat_loop_header

repeat_loop_next:                                 ; preds = %repeat_loop_header
  %7 = bitcast %S* %_3 to i8*
  call void @llvm.lifetime.start.p0i8(i64 136, i8* %7)
  %8 = bitcast %S* %_3 to [17 x i64]*
  %9 = bitcast [17 x i64]* %8 to i8*
  %10 = bitcast [17 x i64]* %item to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %9, i8* align 8 %10, i64 136, i1 false)
  call void @ext(%S* noalias readonly align 8 dereferenceable(136) %_3)
  br label %bb1
}

If you replace the array with a struct with 17 u64 members, the problem also goes away. Again, this causes the IR to not use a loop to initialize item.

This was found while investigating #77613. I think the cause is slightly different, though I would guess it's the same general idea--LLVM can't see how to make some optimization through multiple basic blocks.

#45663 is also related, but may or may not have the same root cause.

Reproduces from 1.0.

@tgnottingham tgnottingham added the C-bug Category: This is a bug. label Oct 14, 2020
@jonas-schievink jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed C-bug Category: This is a bug. labels Oct 14, 2020
@nikic
Copy link
Contributor

nikic commented Oct 17, 2020

https://rust.godbolt.org/z/haxG5T Typical stores to alloca followed by memcpy pattern.

@nikic nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Oct 17, 2020
@nikic
Copy link
Contributor

nikic commented Mar 12, 2021

This no longer generates memcpy on nightly, due to the LLVM 12 update. The extra SROA after simple unrolling breaks it down.

@nikic nikic closed this as completed Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants