-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor optimization of iter().skip() #101814
Comments
Rewriting the function the following way produces the same assembly as the better optimized variant. Seems like the issue happens when using both iterators and for loop at the same time. pub fn test_3(a: [i32; 10]) -> i32 {
a.iter().skip(8).fold(0, |sum, v| sum + v)
} |
This general class of problem is well known -- optimization of exterior iteration in Rust is very challenging. Using interior iteration (as in the previous comment) will generally optimize much better. That said, in this case optimization is likely feasible. Looking at the IR (https://rust.godbolt.org/z/cevdKWcTn) there is a clear opportunity for peeling based on phi invariance here, which should allow follow-on optimization. Would have to investigate closer to find out why it does not trigger. |
Thanks, I wasn't aware that the compiler can have this kind of trouble with exterior iterations, but it's understandable - I will leave this issue open if you're saying that this case has the potential to improve. |
With #[inline(always)] the body of default() will be inlined into external crates but the body will still contain calls to the LZOxide::new(), ParamsOxide::new(DEFAULT_FLAGS), Box::default() and DictOxide::new(DEFAULT_FLAGS). This ends up causing a copy of the large LZOxide to end up on the stack when used with Box::default as seen in: rust-lang/rust#101814
With #[inline(always)] the body of default() will be inlined into external crates but the body will still contain calls to the LZOxide::new(), ParamsOxide::new(DEFAULT_FLAGS), Box::default() and DictOxide::new(DEFAULT_FLAGS). This ends up causing a copy of the large LZOxide to end up on the stack when used with Box::default as seen in: rust-lang/rust#101814
I took a closer look, and the reason why this doesn't peel are multiple checks in canPeel(): https://github.com/llvm/llvm-project/blob/2769ceb0e7a4b4f11c2bf5bd21fd69c154c17ff8/llvm/lib/Transforms/Utils/LoopPeel.cpp#L88 We have a non-exiting latch here, and because of that the non-latch exits are also not terminated by unreachable. It should be possible to relax these requirements, but would need some effort to support branch weight updates. |
Upstream patch: https://reviews.llvm.org/D134803 |
Fixed by the LLVM 16 upgrade. |
Add codegen tests for issues fixed by LLVM 16 Fixes rust-lang#75978. Fixes rust-lang#99960. Fixes rust-lang#101048. Fixes rust-lang#101082. Fixes rust-lang#101814. Fixes rust-lang#103132. Fixes rust-lang#103327.
Add codegen tests for issues fixed by LLVM 16 Fixes rust-lang#75978. Fixes rust-lang#99960. Fixes rust-lang#101048. Fixes rust-lang#101082. Fixes rust-lang#101814. Fixes rust-lang#103132. Fixes rust-lang#103327.
An unfortunate find is that .skip(1) is actually slower than .collect::<Vec<_>>[1..].to_vec(), poor performance of .skip() has already been noted here rust-lang/rust/issues/101814.
Using iter().skip() functions leads to poor optimization compared to the manually done loop with range.
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=b7ed8bf9e4fc3341a92f301fa5185cc5
This produces the following asm output:
Considering the zero-cost abstraction rule and the fact that the compiler knows the size of the array, it should optimize test_1 to at least the same form as test_2 where it correctly detected that we only need two values summed. Instead, there's quite a chunk of asm with lots of branches.
The issue is present both in the stable version (1.63.0) and nightly/beta channels.
The text was updated successfully, but these errors were encountered: