Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

align_to prefix max length not taken into account in optimization #72356

Open
hsivonen opened this issue May 19, 2020 · 4 comments
Open

align_to prefix max length not taken into account in optimization #72356

hsivonen opened this issue May 19, 2020 · 4 comments
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-SIMD Area: SIMD (Single Instruction Multiple Data) C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@hsivonen
Copy link
Member

I tried this code:

pub fn count_non_ascii_sse2(buffer: &[u8]) -> u64 {
    let mut count = 0;
    let (prefix, simd, suffix) = unsafe { buffer.align_to::<core::arch::x86_64::__m128i>() };
    for &b in prefix {
        if b >= 0x80 {
            count += 1;
        }
    }
    for &s in simd {
        count += unsafe {core::arch::x86_64::_mm_movemask_epi8(s)}.count_ones() as u64;
    }
    for &b in suffix {
        if b >= 0x80 {
            count += 1;
        }
    }
    count
}

Godbolt link

I expected to see this happen: I expected the compiler to conclude that prefix.len() < 16 and, therefore, not emit autovectorization for the first scalar loop.

Instead, this happened: The compiler emitted an autovectorization for the first scalar loop even though the prefix is never long enough for the autovectorization to be useful.

Meta

rustc --version --verbose:

rustc 1.45.0-nightly (a74d1862d 2020-05-14)
@hsivonen hsivonen added the C-bug Category: This is a bug. label May 19, 2020
@jonas-schievink jonas-schievink added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-SIMD Area: SIMD (Single Instruction Multiple Data) I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 19, 2020
@workingjubilee
Copy link
Member

Does the equivalent C or C++ code perform this optimization with clang or gcc?

@curiouscat2018
Copy link

What about adding assume intrinsic (and math to calculate length) inside buffer.align_to? This solves redundant vectorization according to experiment. It might be difficult to optimize on llvm side.

    let (prefix, simd, suffix) = unsafe { buffer.align_to::<core::arch::x86_64::__m128i>() };
    unsafe {std::intrinsics::assume(prefix.len() < 16);}

https://rust.godbolt.org/z/3ePss8P8Y

@thomcc
Copy link
Member

thomcc commented Jun 24, 2022

Does the equivalent C or C++ code perform this optimization with clang or gcc?

Yeah, in practice align_to often results in bad codegen because of this. Unfortunately, just adding an assume inside it isn't quite right, because under miri and in other cases it will all end up in the prefix slice.

This is one of the reasons I really dislike the align_to API (IMO it should return an Option in the case where it can't align, rather than putting everything in the prefix).

@thomcc
Copy link
Member

thomcc commented Jun 27, 2022

(See also my comments in #53020 (comment), although the issue is otherwise unrelated)

nagisa added a commit to nagisa/rust that referenced this issue Jul 16, 2022
This generalizes the previous `stride == 1` special case to apply to any
situation where the requested alignment is divisible by the stride. This
in turn allows the test case from rust-lang#98809 produce ideal assembly, along
the lines of:

    leaq 15(%rdi), %rax
    andq $-16, %rax

This also produces pretty high quality code for situations where the
alignment of the input pointer isn’t known:

    pub unsafe fn ptr_u32(slice: *const u32) -> *const u32 {
        slice.offset(slice.align_offset(16) as isize)
    }

    // =>

    movl %edi, %eax
    andl $3, %eax
    leaq 15(%rdi), %rcx
    andq $-16, %rcx
    subq %rdi, %rcx
    shrq $2, %rcx
    negq %rax
    sbbq %rax, %rax
    orq  %rcx, %rax
    leaq (%rdi,%rax,4), %rax

Here LLVM is smart enough to replace the `usize::MAX` special case with
a branch-less bitwise-OR approach, where the mask is constructed using
the neg and sbb instructions. This appears to work across various
architectures I’ve tried.

This change ends up introducing more branches and code in situations
where there is less knowledge of the arguments. For example when the
requested alignment is entirely unknown. This use-case was never really
a focus of this function, so I’m not particularly worried, especially
since llvm-mca is saying that the new code is still appreciably faster,
despite all the new branching.

Fixes rust-lang#98809.
Sadly, this does not help with rust-lang#72356.
bors added a commit to rust-lang-ci/rust that referenced this issue Jul 16, 2022
…ark-Simulacrum

Add a special case for align_offset /w stride != 1

This generalizes the previous `stride == 1` special case to apply to any
situation where the requested alignment is divisible by the stride. This
in turn allows the test case from rust-lang#98809 produce ideal assembly, along
the lines of:

    leaq 15(%rdi), %rax
    andq $-16, %rax

This also produces pretty high quality code for situations where the
alignment of the input pointer isn’t known:

    pub unsafe fn ptr_u32(slice: *const u32) -> *const u32 {
        slice.offset(slice.align_offset(16) as isize)
    }

    // =>

    movl %edi, %eax
    andl $3, %eax
    leaq 15(%rdi), %rcx
    andq $-16, %rcx
    subq %rdi, %rcx
    shrq $2, %rcx
    negq %rax
    sbbq %rax, %rax
    orq  %rcx, %rax
    leaq (%rdi,%rax,4), %rax

Here LLVM is smart enough to replace the `usize::MAX` special case with
a branch-less bitwise-OR approach, where the mask is constructed using
the neg and sbb instructions. This appears to work across various
architectures I’ve tried.

This change ends up introducing more branches and code in situations
where there is less knowledge of the arguments. For example when the
requested alignment is entirely unknown. This use-case was never really
a focus of this function, so I’m not particularly worried, especially
since llvm-mca is saying that the new code is still appreciably faster,
despite all the new branching.

Fixes rust-lang#98809.
Sadly, this does not help with rust-lang#72356.
@workingjubilee workingjubilee added the A-autovectorization Area: Autovectorization, which can impact perf or code size label Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-SIMD Area: SIMD (Single Instruction Multiple Data) C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants