Write on &mut [u8] and Cursor<&mut [u8]> doesn't optimize very well. #44099

oyvindln · 2017-08-26T14:48:05Z

Calling write on a mutable slice (or one wrapped in a Cursor) with one, or a small amount of bytes results in function call to memcpy call after optimization (opt-level=3), rather than simply using a store as one would expect:

pub fn one_byte(mut buf: &mut [u8], byte: u8) {
    buf.write(&[byte]);
}

Results in:

define void @_ZN6cursor8one_byte17h68c172d435558ab9E(i8* nonnull, i64, i8) unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
_ZN4core3ptr13drop_in_place17hc17de44f7e6456c9E.exit:
  %_10.sroa.0 = alloca i8, align 1
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_10.sroa.0)
  store i8 %2, i8* %_10.sroa.0, align 1
  %3 = icmp ne i64 %1, 0
  %_0.0.sroa.speculated.i.i.i = zext i1 %3 to i64
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull %0, i8* nonnull %_10.sroa.0, i64 %_0.0.sroa.speculated.i.i.i, i32 1, i1 false), !noalias !0
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_10.sroa.0)
  ret void
}

copy_from_slice seems to be part of the issue here, if I change the write implementation on mutable slices to use this instead of copy_from_slice:

for (&input, output) in data[..amt].iter().zip(a.iter_mut()) {
    *output = input;
}

the llvm ir looks much nicer:

define void @_ZN6cursor8one_byte17h68c172d435558ab9E(i8* nonnull, i64, i8) unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %3 = icmp eq i64 %1, 0
  br i1 %3, label %_ZN4core3ptr13drop_in_place17hc17de44f7e6456c9E.exit, label %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17he84ad69753d1c347E.exit.preheader.i"

"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17he84ad69753d1c347E.exit.preheader.i": ; preds = %start
  store i8 %2, i8* %0, align 1, !noalias !0
  br label %_ZN4core3ptr13drop_in_place17hc17de44f7e6456c9E.exit

_ZN4core3ptr13drop_in_place17hc17de44f7e6456c9E.exit: ; preds = %start, %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17he84ad69753d1c347E.exit.preheader.i"
  ret void
}

The for loop will result in vector operations on longer slices, but I'm still unsure about whether doing this change could cause some slowdown on very long slices as the memcpy implementation may be more optimized for the specific system, and it doesn't really solve the underlying issue. There seems to be some problem with optimizing copy_from_slice calls that follow split_at_mut and probably some other calls that involve slice operations (I tried to alter the write function to use unsafe and creating a temporary slice using pointers instead, but that didn't help.)

Happens on both nightly rustc 1.21.0-nightly (2aeb5930f 2017-08-25) and stable (1.19) x86_64-unknown-linux-gnu` (Not sure if memcpy behaviour could be different on other platforms).

The text was updated successfully, but these errors were encountered:

mattico · 2017-08-26T20:38:40Z

I thought I remembered a recent PR to encourage more memcpy use for optimization, but seeing #37573 it seems this has been an issue for a while.

If we do special case small copies in trans, perhaps that optimization will be no longer needed.

oyvindln · 2017-08-26T21:44:57Z

I managed to reduce this a bit further:

#[inline]
fn write(buf: &mut [u8], data: &[u8])  {
    // With this condition it optimizes to a store, without it we get memcpy calls.
    if buf.len() < 1 {
        return;
    }    
    // This also gets rid of the memcpy
    // let amt = data.len();
    // But this doesn't.
    // let amt = buf.len();
    let amt = cmp::min(data.len(), buf.len());
    buf.copy_from_slice(&data[..amt]);
}

pub fn write_byte(buf: &mut [u8], byte: u8) {
    write(buf, &[byte]);
}

It looks like the optimizer has some issues fully optimizing copy_from_slice if there is a possibility that amt in this example could be 0.

oyvindln · 2017-08-26T22:17:49Z

Wrapping the innards of (a local copy of) copy_from_slice in if dst.len() > 0 seems to solve the issue. Is this a viable workaround, or could this possibly cause regressions in other cases?

EDIT: Actually that would change the behaviour slightly, and checking after the assert call doesn't work...

oyvindln · 2017-08-26T22:46:36Z

Even further simplified:

pub fn test(buf: &mut [u8], src: &[u8]) {
    let amt = cmp::min(buf.len(), src.len());
    // Copy 0 or 1 bytes.
    let amt = cmp::min(amt, 1);
    unsafe {
        ptr::copy_nonoverlapping(src.as_ptr(), buf.as_mut_ptr(), amt);
    }
}

arielb1 · 2017-08-27T09:57:35Z

Looks like https://llvm.org/bugs/show_bug.cgi?id=31001 aka #37573

EDIT: LLVM bug is now llvm/llvm-project#30349

oyvindln · 2017-08-27T10:50:09Z

Indeed. Doesn't look like there has been any updates on that llvm bug in the meantime. Based on the discussion in the PR it seems adding a check to write would be the preferred solution.

That said, maybe it would be an idea to add methods to read/write that reads or writes a single byte.

Maybe we could also add a note to the copy_from_slice documentation that states that it may be sub-optimal for small-slices where the size isn't a compile-time constant.

oyvindln changed the title ~~Write on &mut [u8] and Cursor<&mut [u8> doesn't optimize very well.~~ Write on &mut [u8] and Cursor<&mut [u8]> doesn't optimize very well. Aug 26, 2017

Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 26, 2017

saethlin mentioned this issue Nov 12, 2022

memcpy-style copies of small fixed length arrays become memcpy, unless done via a loop with ops::Index #92993

Open

workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write on &mut [u8] and Cursor<&mut [u8]> doesn't optimize very well. #44099

Write on &mut [u8] and Cursor<&mut [u8]> doesn't optimize very well. #44099

oyvindln commented Aug 26, 2017 •

edited

Loading

mattico commented Aug 26, 2017

oyvindln commented Aug 26, 2017 •

edited

Loading

oyvindln commented Aug 26, 2017 •

edited

Loading

oyvindln commented Aug 26, 2017

arielb1 commented Aug 27, 2017 •

edited by scottmcm

Loading

oyvindln commented Aug 27, 2017 •

edited

Loading

Write on &mut [u8] and Cursor<&mut [u8]> doesn't optimize very well. #44099

Write on &mut [u8] and Cursor<&mut [u8]> doesn't optimize very well. #44099

Comments

oyvindln commented Aug 26, 2017 • edited Loading

mattico commented Aug 26, 2017

oyvindln commented Aug 26, 2017 • edited Loading

oyvindln commented Aug 26, 2017 • edited Loading

oyvindln commented Aug 26, 2017

arielb1 commented Aug 27, 2017 • edited by scottmcm Loading

oyvindln commented Aug 27, 2017 • edited Loading

oyvindln commented Aug 26, 2017 •

edited

Loading

oyvindln commented Aug 26, 2017 •

edited

Loading

oyvindln commented Aug 26, 2017 •

edited

Loading

arielb1 commented Aug 27, 2017 •

edited by scottmcm

Loading

oyvindln commented Aug 27, 2017 •

edited

Loading