Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor optimization of iter().skip() #101814

Closed
Tearth opened this issue Sep 14, 2022 · 6 comments · Fixed by #109895
Closed

Poor optimization of iter().skip() #101814

Tearth opened this issue Sep 14, 2022 · 6 comments · Fixed by #109895
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Tearth
Copy link

Tearth commented Sep 14, 2022

Using iter().skip() functions leads to poor optimization compared to the manually done loop with range.
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=b7ed8bf9e4fc3341a92f301fa5185cc5

pub fn test_1(a: [i32; 10]) -> i32 {
    let mut sum = 0;
    for v in a.iter().skip(8) {
        sum += v;
    }
    
    sum
}

pub fn test_2(a: [i32; 10]) -> i32 {
    let mut sum = 0;
    for index in 8..10 {
        sum += a[index];
    }
    
    sum
}

This produces the following asm output:

playground::test_1:
	movq	%rdi, %r8
	addq	$40, %r8
	xorl	%esi, %esi
	movl	$8, %edx
	xorl	%eax, %eax
	testb	$1, %sil
	jne	.LBB0_2

.LBB0_5:
	leaq	-1(%rdx), %rsi
	movq	%r8, %rcx
	subq	%rdi, %rcx
	shrq	$2, %rcx
	cmpq	%rsi, %rcx
	jbe	.LBB0_4
	leaq	(%rdi,%rdx,4), %rdi

.LBB0_2:
	cmpq	%r8, %rdi
	je	.LBB0_4
	testq	%rdi, %rdi
	je	.LBB0_4
	addl	(%rdi), %eax
	addq	$4, %rdi
	movb	$1, %sil
	xorl	%edx, %edx
	testb	$1, %sil
	je	.LBB0_5
	jmp	.LBB0_2

.LBB0_4:
	retq

playground::test_2:
	movl	36(%rdi), %eax
	addl	32(%rdi), %eax
	retq

Considering the zero-cost abstraction rule and the fact that the compiler knows the size of the array, it should optimize test_1 to at least the same form as test_2 where it correctly detected that we only need two values summed. Instead, there's quite a chunk of asm with lots of branches.

The issue is present both in the stable version (1.63.0) and nightly/beta channels.

@Tearth Tearth added the C-bug Category: This is a bug. label Sep 14, 2022
@nikic nikic added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Sep 14, 2022
@MatiF100
Copy link

MatiF100 commented Sep 14, 2022

Rewriting the function the following way produces the same assembly as the better optimized variant. Seems like the issue happens when using both iterators and for loop at the same time.
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=bcb4551c22ef765a682c1d6c41eb285f

pub fn test_3(a: [i32; 10]) -> i32 {
    a.iter().skip(8).fold(0, |sum, v| sum + v)
}

@nikic
Copy link
Contributor

nikic commented Sep 14, 2022

This general class of problem is well known -- optimization of exterior iteration in Rust is very challenging. Using interior iteration (as in the previous comment) will generally optimize much better.

That said, in this case optimization is likely feasible. Looking at the IR (https://rust.godbolt.org/z/cevdKWcTn) there is a clear opportunity for peeling based on phi invariance here, which should allow follow-on optimization. Would have to investigate closer to find out why it does not trigger.

@Tearth
Copy link
Author

Tearth commented Sep 14, 2022

Thanks, I wasn't aware that the compiler can have this kind of trouble with exterior iterations, but it's understandable - I will leave this issue open if you're saying that this case has the potential to improve.

jrmuizel added a commit to jrmuizel/miniz_oxide that referenced this issue Sep 15, 2022
With #[inline(always)] the body of default() will be inlined into
external crates but the body will still contain calls to the
LZOxide::new(), ParamsOxide::new(DEFAULT_FLAGS), Box::default() and
DictOxide::new(DEFAULT_FLAGS). This ends up causing a copy of the large
LZOxide to end up on the stack when used with Box::default as seen in:
rust-lang/rust#101814
oyvindln pushed a commit to Frommi/miniz_oxide that referenced this issue Sep 15, 2022
With #[inline(always)] the body of default() will be inlined into
external crates but the body will still contain calls to the
LZOxide::new(), ParamsOxide::new(DEFAULT_FLAGS), Box::default() and
DictOxide::new(DEFAULT_FLAGS). This ends up causing a copy of the large
LZOxide to end up on the stack when used with Box::default as seen in:
rust-lang/rust#101814
@nikic
Copy link
Contributor

nikic commented Sep 27, 2022

I took a closer look, and the reason why this doesn't peel are multiple checks in canPeel(): https://github.com/llvm/llvm-project/blob/2769ceb0e7a4b4f11c2bf5bd21fd69c154c17ff8/llvm/lib/Transforms/Utils/LoopPeel.cpp#L88 We have a non-exiting latch here, and because of that the non-latch exits are also not terminated by unreachable. It should be possible to relax these requirements, but would need some effort to support branch weight updates.

@nikic nikic self-assigned this Sep 27, 2022
@nikic
Copy link
Contributor

nikic commented Sep 28, 2022

Upstream patch: https://reviews.llvm.org/D134803

@nikic nikic added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label Apr 3, 2023
@nikic
Copy link
Contributor

nikic commented Apr 3, 2023

Fixed by the LLVM 16 upgrade.

@Noratrieb Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023
JohnTitor added a commit to JohnTitor/rust that referenced this issue Apr 11, 2023
bors added a commit to rust-lang-ci/rust that referenced this issue Apr 12, 2023
@bors bors closed this as completed in 73f40d4 Apr 12, 2023
Nekuskus added a commit to Nekuskus/Advent-of-Code-2023 that referenced this issue Mar 21, 2024
An unfortunate find is that .skip(1) is actually slower than .collect::<Vec<_>>[1..].to_vec(), poor performance of .skip() has already been noted here rust-lang/rust/issues/101814.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants