Skip to content

Lack of autovectorization of loops over references into contiguous memory #106011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
the8472 opened this issue Aug 25, 2024 · 1 comment
Open

Comments

@the8472
Copy link

the8472 commented Aug 25, 2024

Rust iterators over slices yield references to each item. Even when iterating over references to primitives and applying simple, pure functions LLVM only manages to unroll the loops but fails to vectorize them.

Case 1 - slice.iter().fold()

https://rust.godbolt.org/z/fn3W8PGE3

The second function from rust-lang/rust#113789

pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
    keys.iter().fold(true, |a, x| a && *x < max)
}

results in this loop body

.LBB0_5:
        cmp     dword ptr [rdi + 4*r8], edx
        setb    r9b
        and     r9b, al
        cmp     dword ptr [rdi + 4*r8 + 4], edx
        setb    al
        cmp     dword ptr [rdi + 4*r8 + 8], edx
        setb    r10b
        and     r10b, al
        and     r10b, r9b
        cmp     dword ptr [rdi + 4*r8 + 12], edx
        setb    al
        and     al, r10b
        add     r8, 4
        cmp     rsi, r8
        jne     .LBB0_5

This is unrolled and the values are loaded unconditionally, so it's not an issue of dereferencability or profitability of loading the data.

Expected behavior: Vectorizes with vpcmpud

Case 2 - manually unrolled slice.iter().all()

https://rust.godbolt.org/z/dxEsqj3fx

This is derived from an attempt to implement manual unrolling for iterator.all() - which is short-circuiting - to convince LLVM that vectorization would be profitable. Again the unrolling happens and the values are loaded unconditionally but they're not vectorized.

#![feature(slice_as_chunks)]

pub fn all_auto(a: &[u32], b: u32) -> bool {
    all_generic_simple(a, |&x| x == b)
}


fn all_generic_simple<'a>(a: &'a [u32], mut b: impl FnMut(&'a u32) -> bool) -> bool {
    const UNROLL: usize = 4;

    let (chunks, remainder) = a.as_chunks::<UNROLL>();

    for chunk in chunks {
        let chunk: &[u32; 4] = chunk;

        if !(b(&chunk[0]) && b(&chunk[1]) && b(&chunk[2]) && b(&chunk[3])) {
            return false;
        }
    }

    remainder.iter().all(b)
}
.LBB0_1:
        test    rsi, rsi
        je      .LBB0_2
        cmp     dword ptr [rax], edx
        jne     .LBB0_12
        cmp     dword ptr [rax + 4], edx
        jne     .LBB0_12
        cmp     dword ptr [rax + 8], edx
        jne     .LBB0_12
        add     rsi, -16
        lea     r9, [rax + 16]
        cmp     dword ptr [rax + 12], edx
        mov     rax, r9
        je      .LBB0_1

Expected behavior: Vectorizes with vpcmpeqd

Curiously moving the unrolled short-circuiting comparison into a function - which still gets inlined - and bloating its return type with an additional dummy value causes it to vectorize.

.LBB0_2:
        vpcmpeqd        k0, xmm0, xmmword ptr [rdi + rcx]
        kmovd   r8d, k0
        cmp     r8b, 15
        jne     .LBB0_10
        add     rcx, 16
        cmp     rax, rcx
        jne     .LBB0_2

Rust issue similar to this case: rust-lang/rust#105259

Edit: removed the 3rd case, I missed an important step to keep things in vector registers.

@nikic
Copy link
Contributor

nikic commented Aug 25, 2024

From a quick look at the first case only, the problem is that we don't know that *x is (unconditionally) noundef. Vectorization works if you replace the logical and with a frozen bitwise and: https://llvm.godbolt.org/z/56dcW4z6s

Probably LoopVectorize should support logical and/or reductions by inserting a freeze operation (cc @fhahn).

@the8472 the8472 changed the title Lack of or poor autovectorization of loops over references into contiguous memory Lack of autovectorization of loops over references into contiguous memory Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants