Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler fails to optimize out bounds checks on 32-bit x86 and ARM #22924

Closed
rprichard opened this issue Mar 1, 2015 · 3 comments
Closed

Compiler fails to optimize out bounds checks on 32-bit x86 and ARM #22924

rprichard opened this issue Mar 1, 2015 · 3 comments
Assignees
Labels
A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@rprichard
Copy link
Contributor

Test case:

#[inline(never)]
pub fn my_write(dst: &mut [u8]) {
    if dst.len() >= 41 {
        dst[40] = 17;
    }
}

fn main() {
    let mut buf = [0u8; 41];
    my_write(&mut buf);
}

The 32-bit x86 and ARM disassembly for my_write includes two checks of the slice length and a call to panic_bounds_check. The bounds check is optimized away on x86_64.

Here's the x86 assembly:

_ZN8my_write20he7b6161ce2fc6d9beaaE:
    subl    $12, %esp
    cmpl    $41, 4(%ecx)
    jb  .LBB0_3
    movl    4(%ecx), %eax
    cmpl    $41, %eax
    jb  .LBB0_4
    movl    (%ecx), %eax
    movb    $17, 40(%eax)
.LBB0_3:
    addl    $12, %esp
    retl
.LBB0_4:
    movl    %eax, 8(%esp)
    movl    $40, 4(%esp)
    movl    $panic_bounds_check_loc20, (%esp)
    calll   _ZN9panicking18panic_bounds_check20hdb38771954ce4aaf58rE

LLVM bitcode:

; Function Attrs: noinline uwtable
define internal fastcc void @_ZN8my_write20he7b6161ce2fc6d9beaaE({ i8*, i32 }* noalias nocapture dereferenceable(8)) unnamed_addr #0 {
entry-block:
  %1 = bitcast { i8*, i32 }* %0 to i8*
  %2 = bitcast { i8*, i32 }* %0 to i64*
  %3 = load i64* %2, align 4
  %.sroa.3.0.extract.shift.i.i = lshr i64 %3, 32
  %.sroa.3.0.extract.trunc.i.i = trunc i64 %.sroa.3.0.extract.shift.i.i to i32
  %4 = icmp ugt i32 %.sroa.3.0.extract.trunc.i.i, 40
  %5 = trunc i64 %3 to i32
  %6 = inttoptr i32 %5 to i8*
  br i1 %4, label %then-block-17-, label %next-block

then-block-17-:                                   ; preds = %entry-block
  %7 = getelementptr inbounds { i8*, i32 }* %0, i32 0, i32 1
  %8 = load i32* %7, align 4
  %9 = icmp ult i32 %8, 41
  br i1 %9, label %cond, label %next, !prof !0

next:                                             ; preds = %then-block-17-
  %10 = getelementptr inbounds i8* %6, i32 40
  store i8 17, i8* %10, align 1
  br label %next-block

cond:                                             ; preds = %then-block-17-
  tail call void @_ZN9panicking18panic_bounds_check20hdb38771954ce4aaf58rE({ %str_slice, i32 }* noalias readonly dereferenceable(12) @panic_bounds_check_loc20, i32 40, i32 %8)
  unreachable

next-block:                                       ; preds = %entry-block, %next
  tail call void @llvm.lifetime.end(i64 8, i8* %1)
  ret void
}

Compiler version:

rustc 1.0.0-nightly (e233987ce 2015-02-27) (built 2015-02-28)
binary: rustc
commit-hash: e233987ce1de88a48db2ce612019ba644d3cf5dd
commit-date: 2015-02-27
build-date: 2015-02-28
host: i686-unknown-linux-gnu
release: 1.0.0-nightly

My ARM compiler is older -- 2015-02-18 or so.

@rprichard
Copy link
Contributor Author

The problem goes away if I stick this at the start of my_write:

let dst = dst;

@kmcallister kmcallister added I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation labels Mar 2, 2015
@dotdash
Copy link
Contributor

dotdash commented Mar 4, 2015

http://llvm.org/bugs/show_bug.cgi?id=22786

We could work around this by special casing slices and copying them by doing two loads/stores. Not sure if it's worth it.

@emberian
Copy link
Member

@dotdash should obviously be fixed upstream, will help more than just use I assume.

@emberian emberian self-assigned this Mar 25, 2015
dotdash added a commit to dotdash/rust that referenced this issue Jun 20, 2015
This has a number of advantages compared to creating a copy in memory
and passing a pointer. The obvious one is that we don't have to put the
data into memory but can keep it in registers. Since we're currently
passing a pointer anyway (instead of using e.g. a known offset on the
stack, which is what the `byval` attribute would achieve), we only use a
single additional register for each fat pointer, but save at least two
pointers worth of stack in exchange (sometimes more because more than
one copy gets eliminated). On archs that pass arguments on the stack, we
save a pointer worth of stack even without considering the omitted
copies.

Additionally, LLVM can optimize the code a lot better, to a large degree
due to the fact that lots of copies are gone or can be optimized away.
Additionally, we can now emit attributes like nonnull on the data and/or
vtable pointers contained in the fat pointer, potentially allowing for
even more optimizations.

This results in LLVM passes being about 3-7% faster (depending on the
crate), and the resulting code is also a few percent smaller, for
example:

   text    data  filename
5671479 3941461  before/librustc-d8ace771.so
5447663 3905745  after/librustc-d8ace771.so

1944425 2394024  before/libstd-d8ace771.so
1896769 2387610  after/libstd-d8ace771.so

I had to remove a call in the backtrace-debuginfo test, because LLVM can
now merge the tails of some blocks when optimizations are turned on,
which can't correctly preserve line info.

Fixes rust-lang#22924

Cc rust-lang#22891 (at least for fat pointers the code is good now)
bors added a commit that referenced this issue Jun 20, 2015

This has a number of advantages compared to creating a copy in memory
and passing a pointer. The obvious one is that we don't have to put the
data into memory but can keep it in registers. Since we're currently
passing a pointer anyway (instead of using e.g. a known offset on the
stack, which is what the `byval` attribute would achieve), we only use a
single additional register for each fat pointer, but save at least two
pointers worth of stack in exchange (sometimes more because more than
one copy gets eliminated). On archs that pass arguments on the stack, we
save a pointer worth of stack even without considering the omitted
copies.

Additionally, LLVM can optimize the code a lot better, to a large degree
due to the fact that lots of copies are gone or can be optimized away.
Additionally, we can now emit attributes like nonnull on the data and/or
vtable pointers contained in the fat pointer, potentially allowing for
even more optimizations.

This results in LLVM passes being about 3-7% faster (depending on the
crate), and the resulting code is also a few percent smaller, for
example:

|text|data|filename|
|----|----|--------|
|5671479|3941461|before/librustc-d8ace771.so|
|5447663|3905745|after/librustc-d8ace771.so|
| | | |
|1944425|2394024|before/libstd-d8ace771.so|
|1896769|2387610|after/libstd-d8ace771.so|

I had to remove a call in the backtrace-debuginfo test, because LLVM can
now merge the tails of some blocks when optimizations are turned on,
which can't correctly preserve line info.

Fixes #22924

Cc #22891 (at least for fat pointers the code is good now)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants