Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less codegen parallelism than expected with -C codegen-units=16 #64913

Open
SimonSapin opened this issue Sep 30, 2019 · 17 comments
Open

Less codegen parallelism than expected with -C codegen-units=16 #64913

SimonSapin opened this issue Sep 30, 2019 · 17 comments
Labels
A-codegen Area: Code generation I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@SimonSapin
Copy link
Contributor

SimonSapin commented Sep 30, 2019

This is the output of cargo build -Z timings for Servo, on a 14 cores / 28 threads machine, filtered with "Min unit time" at 10 seconds.

Edit: with Rust nightly-2019-09-28

a4fd0099519ccbda2e31718d46f1ff0e546886d6

For most of codegen for the script crate only 3 to 4 threads seem to be CPU-bound, leaving other cores idle, despite codegen-units=16 being the default. (Results are similar if I specify it explicitly.)

Shouldn’t CPU usage be much closer to min(ncpu, codegen-units)?

@csmoe csmoe added the A-codegen Area: Code generation label Sep 30, 2019
@jonas-schievink jonas-schievink added I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Sep 30, 2019
@andjo403
Copy link
Contributor

the problem is the single threaded mir to llvm-ir codegen that can not provide work for all the threads in the parallell llvm-ir to machinecode codegen, as this is a debug build there is not that much that llvm do resulting in low cpu usage maybe is better for a release build

image

thread0 is showing the mir to llvm-ir and the other is showing parallell work not reel threads doing the llvm-ir to machinecode codegen.

the image is the result of using the -Z self-profile option to rustc and then filter out a lot of traces from the result of crox --compact_thread due to chrome was not able to show the 1.4 GB json file on my computer :)

@SimonSapin
Copy link
Contributor Author

Thanks for looking into it @andjo403! Below is -Z timings or release mode. It’s notable that linking is almost 3× as fast. As to codegen for the script crate we do see good CPU utilization but only for some of that time. I’ll try -Z self-profile at some point.

Screenshot from 2019-09-30 23-34-00

@alexcrichton
Copy link
Member

This is actually something you may want to investigate with the script crate more deeply, but it's unfortunately difficult to do so. This may also be indicative of a compiler bug!

Translation to LLVM IR is currently single-threaded, and so in debug mode we're firing off LLVM modules to get codegen'd in the background, but they codegen so quickly that's why it seems to never get cpu usage off the ground (most time taken is spent translating to LLVM IR). The release mode case is similar, but has one more synchronization point. Instead of going directly to codegen the phases look like:

  • One thread translates to LLVM IR
  • As each LLVM module is read, it's, in the background in parallel, optimized
  • When all modules are optimized, a brief moment of serial work happens to perform ThinLTO analysis
  • Then, in parallel and all at once, all modules are optimized again with ThinLTO

From the start of script's purple bar in the release mode case you see humps of large CPU usage, and that's where the "translate to LLVM IR" thread produced a bunch of modules that are all being optimized on worker threads. You then see a long consistent spike of optimization, and most likely this is after the ThinLTO passes (and the brief moment of single-core usage is the ThinLTO analysis pass). The script crate has a long plateau of single-core usage, though, still at the end. This likely means that there's one CGU in particular that takes a huge amount of time to optimize, causing one CGU to be excessively slower than all the other ones.

This looks very familiar to a case I diagnosed a long time ago and I explained how I found that awhile back. We don't have -Z trans-time-graph any more, but you may be able to manually inspect object files to look for huge functions or something like that. (-Z self-profile can surely help here too)

@andjo403
Copy link
Contributor

andjo403 commented Oct 1, 2019

did a run with -Z self-profile for the release build and @alexcrichton is correct there is one cgu that is extra long I do not know what the cgu contains due to I do not know how to only build the script crate to save the llvm-ir files.
image
edit: the log is filtered with all events that is shorter then 1000us is removed. crox --collapse-threads --minimum-duration=1000

@alexcrichton
Copy link
Member

Man those graphs are so cool! It's so visually appalling what's happening there. You can easily see the waterfall of codegen units being spawned for optimization, you can see the brief serial period, and then everything getting optimized at once. Neato!

But yeah so from that we can clearly see that the ThinLTO passes are taking forever on one particular codegen unit, and the normal optimization passes are taking ~2x longer than all other CGUs. All that needs to be done is to go find the name of that CGU in the rlib and dig it out. The best way to do that is probably to:

  • Use ar x to extract the contents of the rlib
  • Look for an object file *.o of a matching name, as well as a *.z file.
  • The *.z file follows a strange bytecode encoding which you'd need custom code to extract and look at. We should probably change that to *.gz at some point.
  • The *.o file you should be able to poke around and perhaps just look for any huge functions. If something looks abnormally massive that may mean it's taking awhile.

In general though I don't really know great ways to dig into this. I'm not sure how to take a slow CGU in LLVM and then diagnose why it's slow. Step one is getting the LLVM IR out, but that's difficult enough to do with rustc.

One thing that may also work is to compile with RUSTFLAGS=-Csave-temps which should spew artifacts all over the place, but there should be at least one intermediate bytecode file with the matching filename somwhere.

@andjo403
Copy link
Contributor

andjo403 commented Oct 4, 2019

made a little POC with adding the clang time-trace feature in the self-profiling and with that I can see that 34% of the time in llvm for the long cgu is compiling <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once do not know how the llvm passes is executed but there is 16565 occurrences of trying to optimize this function

@jdm
Copy link
Contributor

jdm commented Oct 4, 2019

That likely comes from huge numbers of generated functions from this code generator calling this panic catching helper.

@SimonSapin
Copy link
Contributor Author

If a single generic function has many instantiations, do they all end up in the same CGU?

@andjo403
Copy link
Contributor

andjo403 commented Oct 5, 2019

I think that all generic functions that is instantiated in the same module is in the same CGU if I understand the description here https://github.com/rust-lang/rust/blob/master/src/librustc_mir/monomorphize/partitioning.rs#L44

@michaelwoerister do you have some idea of how to split the CGU more?

rustc also have some generic functions that make one CGU larger then the rest

@andjo403
Copy link
Contributor

andjo403 commented Oct 6, 2019

have been looking in to more detail what happen in the partitioning for script and here is the stats

CGU CGU not merged post merge estimated size post inlining estimated size Optimize time (s) Lto time (s) total time (s)
0 yes 848956 1467083 41,66 69,97 111,63
1 yes 524765 585184 25,65 18,41 44,06
2 yes 400296 819963 85,67 70,48 156,15
3 321669 564648 37,86 71,88 109,74
4 298598 413214 22,13 42,82 64,95
5 273580 385766 20,98 39,33 60,31
6 257079 567954 27,82 48,04 75,86
7 244523 390903 23,22 47,90 71,12
8 234306 484797 29,75 50,10 79,85
9 226011 614406 47,01 52,94 99,95
10 yes 220040 220122 12,98 30,26 43,24
11 219207 380161 20,67 30,72 51,39
12 208098 355017 20,58 38,35 58,93
13 194715 323289 20,27 39,03 59,3
14 188629 439558 28,28 40,81 69,09
15 356672 1221227 89,49 118,34 207,83

so some conclusions from the data is that:

  • the merging needs to look at the inlining data to get a better estimate of the size.
    In this case CGU 15 that after inlining is second largest according to the estimates ( takes the longest time to compile) have been merged
  • four CGUs is so large from the beginning that they are not merged with any other CGU and after merge is three still larger then the rest so need to be able to split large CGUs to get more even sizes
  • the time estimate is sometimes way of eg. GCU 0 is estimated to be the largest but is about half the time of CGU 15

@SimonSapin
Copy link
Contributor Author

What is CGU merging?

@andjo403
Copy link
Contributor

andjo403 commented Oct 6, 2019

it is part of the partitioning algorithm when either -Ccodegen-units=val have been added or Incremental is off, the default is 16 CGUs.
for incremental there is somtimes thousands of CGUs and as the partitioning algorithm is the same the CGUs needs to be merged to match the requested number of CGUs.

@SimonSapin
Copy link
Contributor Author

So, does "unmerged" mean that one unit for the purpose of codegen only contains one unit for the purpose of incremental?

@andjo403
Copy link
Contributor

andjo403 commented Oct 7, 2019

yes "unmerged" was an CGU that have not been merged during the merge phase will rename it to "not merged"

Centril added a commit to Centril/rust that referenced this issue Oct 8, 2019
add more info in debug traces for gcu merging

to help in investigation of CGU partitioning problems e.g rust-lang#64913
Centril added a commit to Centril/rust that referenced this issue Oct 9, 2019
add more info in debug traces for gcu merging

to help in investigation of CGU partitioning problems e.g rust-lang#64913
Centril added a commit to Centril/rust that referenced this issue Oct 9, 2019
add more info in debug traces for gcu merging

to help in investigation of CGU partitioning problems e.g rust-lang#64913
@andjo403
Copy link
Contributor

have made #65281 to handle the first of my points from above.
but I also started to think about this case that is using lto shall the inline function even be added to all CGUs?

@andjo403
Copy link
Contributor

hmm seems like some one tested the lto idea already #65052

@andjo403
Copy link
Contributor

I wonder if it is possible to not only look at the size of the CGUs but also what inline items that are common to get more items with internal linkage and less duplication between CGUs

bors added a commit that referenced this issue Oct 17, 2019
for a more even partitioning inline before merge

consider the size of the inlined items for a more even partitioning

for me this change take the compile time for script-servo-opt from 306s to 249s

edit: the times is for a 32 thread cpu.

cc #64913
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants