Skip to content

Commit 2d91939

Browse files
committed
Auto merge of #107634 - scottmcm:array-drain, r=thomcc
Improve the `array::map` codegen The `map` method on arrays [is documented as sometimes performing poorly](https://doc.rust-lang.org/std/primitive.array.html#note-on-performance-and-stack-usage), and after [a question on URLO](https://users.rust-lang.org/t/try-trait-residual-o-trait-and-try-collect-into-array/88510?u=scottmcm) prompted me to take another look at the core [`try_collect_into_array`](https://github.com/rust-lang/rust/blob/7c46fb2111936ad21a8e3aa41e9128752357f5d8/library/core/src/array/mod.rs#L865-L912) function, I had some ideas that ended up working better than I'd expected. There's three main ideas in here, split over three commits: 1. Don't use `array::IntoIter` when we can avoid it, since that seems to not get SRoA'd, meaning that every step writes things like loop counters into the stack unnecessarily 2. Don't return arrays in `Result`s unnecessarily, as that doesn't seem to optimize away even with `unwrap_unchecked` (perhaps because it needs to get moved into a new LLVM type to account for the discriminant) 3. Don't distract LLVM with all the `Option` dances when we know for sure we have enough items (like in `map` and `zip`). This one's a larger commit as to do it I ended up adding a new `pub(crate)` trait, but hopefully those changes are still straight-forward. (No libs-api changes; everything should be completely implementation-detail-internal.) It's still not completely fixed -- I think it needs pcwalton's `memcpy` optimizations still (#103830) to get further -- but this seems to go much better than before. And the remaining `memcpy`s are just `transmute`-equivalent (`[T; N] -> ManuallyDrop<[T; N]>` and `[MaybeUninit<T>; N] -> [T; N]`), so hopefully those will be easier to remove with LLVM16 than the previous subobject copies 🤞 r? `@thomcc` As a simple example, this test ```rust pub fn long_integer_map(x: [u32; 64]) -> [u32; 64] { x.map(|x| 13 * x + 7) } ``` On nightly <https://rust.godbolt.org/z/xK7548TGj> takes `sub rsp, 808` ```llvm start: %array.i.i.i.i = alloca [64 x i32], align 4 %_3.sroa.5.i.i.i = alloca [65 x i32], align 4 %_5.i = alloca %"core::iter::adapters::map::Map<core::array::iter::IntoIter<u32, 64>, [closure@/app/example.rs:2:11: 2:14]>", align 8 ``` (and yes, that's a 6**5**-element array `alloca` despite 6**4**-element input and output) But with this PR it's only `sub rsp, 520` ```llvm start: %array.i.i.i.i.i.i = alloca [64 x i32], align 4 %array1.i.i.i = alloca %"core::mem::manually_drop::ManuallyDrop<[u32; 64]>", align 4 ``` Similarly, the loop it emits on nightly is scalar-only and horrifying ```nasm .LBB0_1: mov esi, 64 mov edi, 0 cmp rdx, 64 je .LBB0_3 lea rsi, [rdx + 1] mov qword ptr [rsp + 784], rsi mov r8d, dword ptr [rsp + 4*rdx + 528] mov edi, 1 lea edx, [r8 + 2*r8] lea r8d, [r8 + 4*rdx] add r8d, 7 .LBB0_3: test edi, edi je .LBB0_11 mov dword ptr [rsp + 4*rcx + 272], r8d cmp rsi, 64 jne .LBB0_6 xor r8d, r8d mov edx, 64 test r8d, r8d jne .LBB0_8 jmp .LBB0_11 .LBB0_6: lea rdx, [rsi + 1] mov qword ptr [rsp + 784], rdx mov edi, dword ptr [rsp + 4*rsi + 528] mov r8d, 1 lea esi, [rdi + 2*rdi] lea edi, [rdi + 4*rsi] add edi, 7 test r8d, r8d je .LBB0_11 .LBB0_8: mov dword ptr [rsp + 4*rcx + 276], edi add rcx, 2 cmp rcx, 64 jne .LBB0_1 ``` whereas with this PR it's unrolled and vectorized ```nasm vpmulld ymm1, ymm0, ymmword ptr [rsp + 64] vpaddd ymm1, ymm1, ymm2 vmovdqu ymmword ptr [rsp + 328], ymm1 vpmulld ymm1, ymm0, ymmword ptr [rsp + 96] vpaddd ymm1, ymm1, ymm2 vmovdqu ymmword ptr [rsp + 360], ymm1 ``` (though sadly still stack-to-stack)
2 parents 2008188 + bb77860 commit 2d91939

File tree

16 files changed

+395
-142
lines changed

16 files changed

+395
-142
lines changed

library/core/src/array/drain.rs

+76
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
use crate::iter::{TrustedLen, UncheckedIterator};
2+
use crate::mem::ManuallyDrop;
3+
use crate::ptr::drop_in_place;
4+
use crate::slice;
5+
6+
/// A situationally-optimized version of `array.into_iter().for_each(func)`.
7+
///
8+
/// [`crate::array::IntoIter`]s are great when you need an owned iterator, but
9+
/// storing the entire array *inside* the iterator like that can sometimes
10+
/// pessimize code. Notable, it can be more bytes than you really want to move
11+
/// around, and because the array accesses index into it SRoA has a harder time
12+
/// optimizing away the type than it does iterators that just hold a couple pointers.
13+
///
14+
/// Thus this function exists, which gives a way to get *moved* access to the
15+
/// elements of an array using a small iterator -- no bigger than a slice iterator.
16+
///
17+
/// The function-taking-a-closure structure makes it safe, as it keeps callers
18+
/// from looking at already-dropped elements.
19+
pub(crate) fn drain_array_with<T, R, const N: usize>(
20+
array: [T; N],
21+
func: impl for<'a> FnOnce(Drain<'a, T>) -> R,
22+
) -> R {
23+
let mut array = ManuallyDrop::new(array);
24+
// SAFETY: Now that the local won't drop it, it's ok to construct the `Drain` which will.
25+
let drain = Drain(array.iter_mut());
26+
func(drain)
27+
}
28+
29+
/// See [`drain_array_with`] -- this is `pub(crate)` only so it's allowed to be
30+
/// mentioned in the signature of that method. (Otherwise it hits `E0446`.)
31+
// INVARIANT: It's ok to drop the remainder of the inner iterator.
32+
pub(crate) struct Drain<'a, T>(slice::IterMut<'a, T>);
33+
34+
impl<T> Drop for Drain<'_, T> {
35+
fn drop(&mut self) {
36+
// SAFETY: By the type invariant, we're allowed to drop all these.
37+
unsafe { drop_in_place(self.0.as_mut_slice()) }
38+
}
39+
}
40+
41+
impl<T> Iterator for Drain<'_, T> {
42+
type Item = T;
43+
44+
#[inline]
45+
fn next(&mut self) -> Option<T> {
46+
let p: *const T = self.0.next()?;
47+
// SAFETY: The iterator was already advanced, so we won't drop this later.
48+
Some(unsafe { p.read() })
49+
}
50+
51+
#[inline]
52+
fn size_hint(&self) -> (usize, Option<usize>) {
53+
let n = self.len();
54+
(n, Some(n))
55+
}
56+
}
57+
58+
impl<T> ExactSizeIterator for Drain<'_, T> {
59+
#[inline]
60+
fn len(&self) -> usize {
61+
self.0.len()
62+
}
63+
}
64+
65+
// SAFETY: This is a 1:1 wrapper for a slice iterator, which is also `TrustedLen`.
66+
unsafe impl<T> TrustedLen for Drain<'_, T> {}
67+
68+
impl<T> UncheckedIterator for Drain<'_, T> {
69+
unsafe fn next_unchecked(&mut self) -> T {
70+
// SAFETY: `Drain` is 1:1 with the inner iterator, so if the caller promised
71+
// that there's an element left, the inner iterator has one too.
72+
let p: *const T = unsafe { self.0.next_unchecked() };
73+
// SAFETY: The iterator was already advanced, so we won't drop this later.
74+
unsafe { p.read() }
75+
}
76+
}

0 commit comments

Comments
 (0)