-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
align_to prefix max length not taken into account in optimization #72356
Comments
Does the equivalent C or C++ code perform this optimization with clang or gcc? |
What about adding assume intrinsic (and math to calculate length) inside buffer.align_to? This solves redundant vectorization according to experiment. It might be difficult to optimize on llvm side. let (prefix, simd, suffix) = unsafe { buffer.align_to::<core::arch::x86_64::__m128i>() };
unsafe {std::intrinsics::assume(prefix.len() < 16);} |
Yeah, in practice align_to often results in bad codegen because of this. Unfortunately, just adding an assume inside it isn't quite right, because under miri and in other cases it will all end up in the prefix slice. This is one of the reasons I really dislike the align_to API (IMO it should return an Option in the case where it can't align, rather than putting everything in the prefix). |
(See also my comments in #53020 (comment), although the issue is otherwise unrelated) |
This generalizes the previous `stride == 1` special case to apply to any situation where the requested alignment is divisible by the stride. This in turn allows the test case from rust-lang#98809 produce ideal assembly, along the lines of: leaq 15(%rdi), %rax andq $-16, %rax This also produces pretty high quality code for situations where the alignment of the input pointer isn’t known: pub unsafe fn ptr_u32(slice: *const u32) -> *const u32 { slice.offset(slice.align_offset(16) as isize) } // => movl %edi, %eax andl $3, %eax leaq 15(%rdi), %rcx andq $-16, %rcx subq %rdi, %rcx shrq $2, %rcx negq %rax sbbq %rax, %rax orq %rcx, %rax leaq (%rdi,%rax,4), %rax Here LLVM is smart enough to replace the `usize::MAX` special case with a branch-less bitwise-OR approach, where the mask is constructed using the neg and sbb instructions. This appears to work across various architectures I’ve tried. This change ends up introducing more branches and code in situations where there is less knowledge of the arguments. For example when the requested alignment is entirely unknown. This use-case was never really a focus of this function, so I’m not particularly worried, especially since llvm-mca is saying that the new code is still appreciably faster, despite all the new branching. Fixes rust-lang#98809. Sadly, this does not help with rust-lang#72356.
…ark-Simulacrum Add a special case for align_offset /w stride != 1 This generalizes the previous `stride == 1` special case to apply to any situation where the requested alignment is divisible by the stride. This in turn allows the test case from rust-lang#98809 produce ideal assembly, along the lines of: leaq 15(%rdi), %rax andq $-16, %rax This also produces pretty high quality code for situations where the alignment of the input pointer isn’t known: pub unsafe fn ptr_u32(slice: *const u32) -> *const u32 { slice.offset(slice.align_offset(16) as isize) } // => movl %edi, %eax andl $3, %eax leaq 15(%rdi), %rcx andq $-16, %rcx subq %rdi, %rcx shrq $2, %rcx negq %rax sbbq %rax, %rax orq %rcx, %rax leaq (%rdi,%rax,4), %rax Here LLVM is smart enough to replace the `usize::MAX` special case with a branch-less bitwise-OR approach, where the mask is constructed using the neg and sbb instructions. This appears to work across various architectures I’ve tried. This change ends up introducing more branches and code in situations where there is less knowledge of the arguments. For example when the requested alignment is entirely unknown. This use-case was never really a focus of this function, so I’m not particularly worried, especially since llvm-mca is saying that the new code is still appreciably faster, despite all the new branching. Fixes rust-lang#98809. Sadly, this does not help with rust-lang#72356.
I tried this code:
Godbolt link
I expected to see this happen: I expected the compiler to conclude that
prefix.len() < 16
and, therefore, not emit autovectorization for the first scalar loop.Instead, this happened: The compiler emitted an autovectorization for the first scalar loop even though the prefix is never long enough for the autovectorization to be useful.
Meta
rustc --version --verbose
:The text was updated successfully, but these errors were encountered: