-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2x benchmark loss in rayon-hash from multiple codegen-units #47665
Comments
@cuviper I wonder if we can add any |
@nikomatsakis you mean in libstd? Neither But I hope we don't have to generally recommend |
I'm not entirely sure this isn't expected; 2x is a bit much though. We might need to wait on the heuristics that @michaelwoerister has planned. |
@cuviper I'm not sure that being generic affects their need for an inlining hint. It is true that those functions will be instantiated in the downstream crate, but I think they are still isolated into their own codegen unit. (ThinLTO is of course supposed to help here.) I'm not sure what's the best fix though. |
FWIW, the gap has closed some (perhaps from LLVM 6?), with parallel codegen now "only" 40% slower:
The profile of 19.74 │14fa0: cmpq $0x0,(%r12,%rbx,8)
6.54 │14fa5: ↓ jne 14fb2 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1a2>
21.06 │14fa7: add $0x1,%rbx
│14fab: cmp %rbp,%rbx
4.04 │14fae: ↑ jb 14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>
0.04 │14fb0: ↓ jmp 14fde <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1ce>
28.36 │14fb2: mov 0x0(%r13,%rbx,4),%eax
4.34 │14fb7: movq $0x1,0x58(%rsp)
5.45 │14fc0: mov %rax,0x60(%rsp)
0.84 │14fc5: mov %r14,%rdi
3.80 │14fc8: → callq 16380 <<u64 as core::iter::traits::Sum>::sum>
2.52 │14fcd: add %rax,%r15
2.07 │14fd0: add $0x1,%rbx
│14fd4: cmp %rbp,%rbx
0.26 │14fd7: ↑ jb 14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190> versus 0.63 │14092: lea (%rdi,%rcx,4),%rbp
0.01 │14096: nopw %cs:0x0(%rax,%rax,1)
25.16 │140a0: cmpq $0x0,(%rsi,%rcx,8)
7.68 │140a5: ↓ jne 140b6 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x176>
28.34 │140a7: add $0x1,%rcx
0.21 │140ab: add $0x4,%rbp
│140af: cmp %rdx,%rcx
2.63 │140b2: ↑ jb 140a0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x160>
0.04 │140b4: ↓ jmp 140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
24.54 │140b6: test %rbp,%rbp
0.00 │140b9: ↓ je 140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
0.32 │140bb: add $0x1,%rcx
5.10 │140bf: mov 0x0(%rbp),%ebp
3.31 │140c2: add %rbp,%rax
│140c5: cmp %rdx,%rcx
1.40 │140c8: ↑ jb 14092 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x152> |
@rust-lang/cargo, maybe |
If |
Yeah, I was thinking that too after I wrote that. |
@michaelwoerister today Cargo uses separate profiles for However, long-term there is a desire to phase out all special profiles except release and dev (at least in the narrow sense of profiles, as "flags for the compiler"). See also rust-lang/rfcs#2282 which suggests adding user-defined profiles. |
The question of whether or not to have an In that case, we might consider some other profiles (e.g., no optimization at all, or mild optimization), but I sort of suspect that the use cases are thinner. In my experience, the only reason to want -O0 was for debuginfo, and let's premise that our default profile keeps debuginfo intact. Something like opt-dev I guess corresponds to "I don't care about debuginfo but i'm not benchmarking" -- that might be better served by customizing the "default' build settings for you project? Still, it seems like there is ultimately maybe a need for three profiles:
Probably though this conversation should be happening somewhere else. =) |
Yeah, this is closely connected to the work on revamping profiles in Cargo. |
I just benchmarked as part of updating to rayon-1.0, and it's back to 2x slowdown, with |
Is this still an issue of concern? |
Well
I was originally sad about |
I'm seeing a huge slowdown in rayon-hash benchmarks, resolved with
-Ccodegen-units=1
.rayon_set_sum_parallel
is the showcase for this crate, and it suffers the most from CGU.From profiling and disassembly, this seems to mostly be a lost inlining opportunity. In the slower build, the profile is split 35%
bridge_unindexed_producer_consumer
, 34%Iterator::fold
, 28%Sum::sum
, and the hot loop in the first looks like:With CGU=1, 96% of the profile is in
bridge_unindexed_producer_consumer
, with this hot loop:The text was updated successfully, but these errors were encountered: