Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc: Default 32 codegen units at O0 #44853

Merged
merged 1 commit into from
Sep 29, 2017

Conversation

alexcrichton
Copy link
Member

This commit changes the default of rustc to use 32 codegen units when compiling
in debug mode, typically an opt-level=0 compilation. Since their inception
codegen units have matured quite a bit, gaining features such as:

  • Parallel translation and codegen enabling codegen units to get worked on even
    more quickly.
  • Deterministic and reliable partitioning through the same infrastructure as
    incremental compilation.
  • Global rate limiting through the jobserver crate to avoid overloading the
    system.

The largest benefit of codegen units has forever been faster compilation through
parallel processing of modules on the LLVM side of things, using all the cores
available on build machines that typically have many available. Some downsides
have been fixed through the features above, but the major downside remaining is
that using codegen units reduces opportunities for inlining and optimization.
This, however, doesn't matter much during debug builds!

In this commit the default number of codegen units for debug builds has been
raised from 1 to 32. This should enable most cargo build compiles that are
bottlenecked on translation and/or code generation to immediately see speedups
through parallelization on available cores.

Work is being done to always enable multiple codegen units (and therefore
parallel codegen) but it requires #44841 at least to be landed and stabilized,
but stay tuned if you're interested in that aspect!

@alexcrichton
Copy link
Member Author

r? @michaelwoerister

@rust-highfive
Copy link
Collaborator

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton alexcrichton force-pushed the debug-codegen-units branch 3 times, most recently from a6c4f73 to 8e09bfb Compare September 25, 2017 23:06
@ishitatsuyuki
Copy link
Contributor

I feel this is an Cargo thing.

@daboross
Copy link
Contributor

@ishitatsuyuki Cargo can definitely control this - but we want a sane default when running raw rustc too, right?

This will make it so that if no other configuration happens - either via rustc parameters, or cargo configuration, rustc defaults to a very parallel build of a single crate.

Beginner projects which don't use cargo shouldn't have to have a slow build just because they aren't using cargo. This makes a sane default for all uses of rustc.

@michaelwoerister
Copy link
Member

This is the jobserver (for cpu resource management) and async-llvm (for peak memory consumption) really pay off :)

r=me with the tests fixed.

@michaelwoerister
Copy link
Member

On a separate note: I don't like how we often duplicate things between sess.opts and sess.opts.cg/sess.opts.debugging_opts, where only one of them is the correct value but both of them are accessible. But that's not something to solve in this PR.

@michaelwoerister
Copy link
Member

What's the situation with perf.rlo? Are we still limiting benchmarking to a single core there?
cc @Mark-Simulacrum

@Mark-Simulacrum
Copy link
Member

To my knowledge, all benchmarks on perf.rlo currently are using 8 threads of parallelism. @alexcrichton may be able to correct me if I recall incorrectly.

This commit changes the default of rustc to use 32 codegen units when compiling
in debug mode, typically an opt-level=0 compilation. Since their inception
codegen units have matured quite a bit, gaining features such as:

* Parallel translation and codegen enabling codegen units to get worked on even
  more quickly.
* Deterministic and reliable partitioning through the same infrastructure as
  incremental compilation.
* Global rate limiting through the `jobserver` crate to avoid overloading the
  system.

The largest benefit of codegen units has forever been faster compilation through
parallel processing of modules on the LLVM side of things, using all the cores
available on build machines that typically have many available. Some downsides
have been fixed through the features above, but the major downside remaining is
that using codegen units reduces opportunities for inlining and optimization.
This, however, doesn't matter much during debug builds!

In this commit the default number of codegen units for debug builds has been
raised from 1 to 32. This should enable most `cargo build` compiles that are
bottlenecked on translation and/or code generation to immediately see speedups
through parallelization on available cores.

Work is being done to *always* enable multiple codegen units (and therefore
parallel codegen) but it requires rust-lang#44841 at least to be landed and stabilized,
but stay tuned if you're interested in that aspect!
@alexcrichton
Copy link
Member Author

@bors: r=michaelwoerister

@bors
Copy link
Contributor

bors commented Sep 26, 2017

📌 Commit 9e35b79 has been approved by michaelwoerister

@alexcrichton
Copy link
Member Author

@michaelwoerister

agreed that the duplication is unfortunate! I'd hope that one day we could just use functions to access these rather than accessing fields, but agreed that this is probably best left for a future PR

@mersinvald
Copy link
Contributor

mersinvald commented Sep 26, 2017

Forwarding here a comment that I accidentally left in #44841

I've just ran some build time benchmarks on my project that uses a lot of popular rust libraries and codegen (diesel, hyper, serde, tokio, futures, reqwest) on my Intel Core i5 laptop (skylake 2c/4t) and got these results:

1 unit:    92 secs
2 units:   81 secs
4 units:   83 secs
8 units:   85 secs
16 units:  90 secs
32 units:  102 secs

Cargo profile:

# The development profile, used for `cargo build`.
[profile.dev]
opt-level = 0
debug = true 
lto = false
debug-assertions = true 
codegen-units = N

rustc 1.22.0-nightly (17f56c5 2017-09-21)

As expected, the best results is for codegen units of number of cpus and 32 is way to much for an average machine.

Did you concider an option to select number of codegen units depending on number of cpus, with num_cpus crate?

Thank you for working on compile times!

@alexcrichton
Copy link
Member Author

@mersinvald fascinating!

First up though, can you clarify what you were measuring? Was it a cargo build of the entire workspace? Just one crate? From a fresh target directory?

Locally I ran cargo build --all for the entire workspace with the latest nightly (rustc 1.22.0-nightly (6c476ce46 2017-09-25)) and got the following timings

cgus= 1 Duration { secs: 97, nanos: 667334447 }
cgus= 2 Duration { secs: 91, nanos: 974436776 }
cgus= 4 Duration { secs: 88, nanos: 860553853 }
cgus= 8 Duration { secs: 86, nanos: 138881102 }
cgus=16 Duration { secs: 84, nanos: 37066957 }
cgus=32 Duration { secs: 84, nanos: 907956016 }

oddly though sometimes it was very variable what the build times were...

cgus= 1 Duration { secs: 99, nanos: 325352640 }
cgus= 2 Duration { secs: 90, nanos: 759988443 }
cgus= 4 Duration { secs: 88, nanos: 654634816 }
cgus= 8 Duration { secs: 86, nanos: 381700701 }
cgus=16 Duration { secs: 87, nanos: 243640455 }
cgus=32 Duration { secs: 47, nanos: 899950745 }

I've got an 8 core machine locally, but the number of cores vs number of codegen units should have little effect on compile time (in theory). The codegen units are chosen to be explicitly high here to hopefully make sure that no codegen unit takes too long in the optimizer, allowing ideally for optimal use of all available cores throughout compilation. Additionally, more cgus should mean a lower peak memory of rustc itself due to async translation/codegen.

Are you sure you didn't have anythign else running in the background when you were collecting that timing? And were the timings you got reproducible?

@arielb1 arielb1 added the S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. label Sep 26, 2017
@mersinvald
Copy link
Contributor

mersinvald commented Sep 26, 2017

@alexcrichton

First up though, can you clarify what you were measuring?

I'm building with cargo build from the root of the repo, so it's a whole workspace build
Before every iteration I do cargo clean

Good point about background tasks, though. I disabled everything that can eat up CPU to get more steady results and re-ran each build three times:

1:  80.79 79.97 80.47 (avg 80.41)
2:  74.5 73.91 74.75  (avg 74.38)
4:  75.84 75.48 75.83 (avg 75.71)
8:  79.39 78.7 78.27  (avg 78.78)
16: 80.68 80.70 80.75 (avg 80.71)
32: 83.89 83.81 84.26 (avg 83.98)

Results seem to be quiet steady and reproducible

2-4 units are optimal for my setup.

Btw, I've updated rustc to the latest nightly version before running new tests, so now it is:

@alexcrichton
Copy link
Member Author

@mersinvald hm so if you're on Linux, mind poking around with perf? Could you try a perf record of 2 cgus and a perf record of 32 cgus? I think perf diff will work here for looking at the comparison.

Otherwise though this is indeed curious! It may be worth drilling into specific crates as well, maybe going one at a time running rustc by hand. If one crate takes way longer in 32 codegen units than in 2 then that's something to investigate. Overall builds tend to be hard to drill into :(

@mersinvald
Copy link
Contributor

@alexcrichton ok, I'll do perf tomorrow)

@mersinvald
Copy link
Contributor

mersinvald commented Sep 29, 2017

@alexcrichton i've collected statistics for clean cargo build

https://drive.google.com/drive/folders/0B28cL71oGfpOTVRKMUZjTlFQU2M?usp=sharing

diff had been made with perf diff perf.data.2 perf.data.32 > diff

Hope it will help.

I don't think I can interpret this data myself, but it you'll need me to run perf on some specific crates, feel free to ask, I'm happy to help.

@bors
Copy link
Contributor

bors commented Sep 29, 2017

⌛ Testing commit 9e35b79 with merge d514263...

bors added a commit that referenced this pull request Sep 29, 2017
…rister

rustc: Default 32 codegen units at O0

This commit changes the default of rustc to use 32 codegen units when compiling
in debug mode, typically an opt-level=0 compilation. Since their inception
codegen units have matured quite a bit, gaining features such as:

* Parallel translation and codegen enabling codegen units to get worked on even
  more quickly.
* Deterministic and reliable partitioning through the same infrastructure as
  incremental compilation.
* Global rate limiting through the `jobserver` crate to avoid overloading the
  system.

The largest benefit of codegen units has forever been faster compilation through
parallel processing of modules on the LLVM side of things, using all the cores
available on build machines that typically have many available. Some downsides
have been fixed through the features above, but the major downside remaining is
that using codegen units reduces opportunities for inlining and optimization.
This, however, doesn't matter much during debug builds!

In this commit the default number of codegen units for debug builds has been
raised from 1 to 32. This should enable most `cargo build` compiles that are
bottlenecked on translation and/or code generation to immediately see speedups
through parallelization on available cores.

Work is being done to *always* enable multiple codegen units (and therefore
parallel codegen) but it requires #44841 at least to be landed and stabilized,
but stay tuned if you're interested in that aspect!
@bors
Copy link
Contributor

bors commented Sep 29, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: michaelwoerister
Pushing d514263 to master...

@bors bors merged commit 9e35b79 into rust-lang:master Sep 29, 2017
@alexcrichton alexcrichton deleted the debug-codegen-units branch September 30, 2017 08:00
japaric added a commit to rust-embedded/cortex-m-quickstart that referenced this pull request Oct 2, 2017
rust-lang/rust#44853 changed the default number of codegen units from 1 to 32 for the dev profile.
Unfortunately this broke our dev builds so we are reverting the change in the Cargo.toml.
@bluss bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Oct 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.