Skip to content

Much higher compile times with -Z threads=8 than with -Z threads=1 #117755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Shnatsel opened this issue Nov 9, 2023 · 13 comments
Open

Much higher compile times with -Z threads=8 than with -Z threads=1 #117755

Shnatsel opened this issue Nov 9, 2023 · 13 comments
Labels
A-parallel-compiler Area: parallel compiler C-bug Category: This is a bug. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Shnatsel
Copy link
Member

Shnatsel commented Nov 9, 2023

When compiling cargo audit from git on commit b6baecc0ea4e2d115e4e10b10c2196b33d42c1da, I'm seeing the project build in 19 seconds on my machine with -Z threads=1 but it takes 25 seconds with -Z threads=8.

I am using a 6-core desktop CPU, so no chance of NUMA issues. I'm also seeing 25s compile times with -Z threads=6, matching the CPU core count.

I've captured Samply profiles but they are hard to make sense of due to the sheer number of threads (4000 for a single thread, 6000 for multiple threads). They are too big for sharing via firefox.dev, so please find them attached:
profile-1-thread.json.gz
profile-8-threads.json.gz

Meta

rustc --version --verbose:

rustc 1.75.0-nightly (fdaaaf9f9 2023-11-08)
binary: rustc
commit-hash: fdaaaf9f923281ab98b865259aa40fbf93d72c7a
commit-date: 2023-11-08
host: x86_64-unknown-linux-gnu
release: 1.75.0-nightly
LLVM version: 17.0.4
@Shnatsel Shnatsel added the C-bug Category: This is a bug. label Nov 9, 2023
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Nov 9, 2023
@Shnatsel
Copy link
Member Author

Shnatsel commented Nov 9, 2023

@Noratrieb Noratrieb added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-parallel-compiler Area: parallel compiler I-compiletime Issue: Problems and improvements with respect to compile times. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Nov 9, 2023
@mjguzik
Copy link
Contributor

mjguzik commented Nov 10, 2023

I confirm the issue, compiling the same thing:

cargo build -r 606.59s user 33.28s system 1596% cpu 40.078 total
RUSTFLAGS="-Z threads=8" cargo build -r 771.61s user 49.88s system 1660% cpu 49.476 total

Test system has 24 cores.

rustc 1.75.0-nightly (0f44eb3 2023-11-09)
binary: rustc
commit-hash: 0f44eb3
commit-date: 2023-11-09
host: x86_64-unknown-linux-gnu
release: 1.75.0-nightly
LLVM version: 17.0.4

@Kobzol
Copy link
Contributor

Kobzol commented Nov 10, 2023

Do you have the same flags set for both builds? using RUSTFLAGS overrides config.toml, did you perhaps have some options there?

@mjguzik
Copy link
Contributor

mjguzik commented Nov 10, 2023

In my case this is a fresh clone, 0 local changes.

@Shnatsel
Copy link
Member Author

Shnatsel commented Nov 10, 2023

I did have mold configured as the linker in config.toml. After removing the config the parallel frontend is still slower, although by not quite as much.

Single thread: 23.29s
8 threads: 25.70s

That's still a 10% regression.

@Shnatsel
Copy link
Member Author

I've re-measured cargo build --timings and updated the post above to provide a correct baseline, without mold.

I still see syn v1.0.109 compilation time going from 2.3s to 7.4s, tokio v1.29.1 going up from 2.5s to 5.2s, aho-corasick v1.0.2 from 1.1s to 5.1s, and the compilation time of many other crates also increasing. See the full output of --timings for details.

@mjguzik
Copy link
Contributor

mjguzik commented Nov 10, 2023

I reran with RUSTFLAGS="-Z threads=1", got:

RUSTFLAGS="-Z threads=1" cargo build -r 591.51s user 30.31s system 1561% cpu 39.825 total

which is about the same as without the -Z flag.

@Shnatsel
Copy link
Member Author

It's curious that a system with a higher core count is seeing a greater regression: 39.825s to 49.476 is a nearly 20% increase in compilation time, compared to a 10% increase on my 6-core CPU.

@mjguzik
Copy link
Contributor

mjguzik commented Nov 10, 2023

It's not particularly curious, adding more threads to a case which suffers a scalability problem does tend to increase total run time. And I have more cores to exercise the problem at the same time.

I tried to get a differential flamegraph based on perf record output, but perf report ended up executing for almost 2h(!) before I killed it, boggled down in comunicating with addr2line which kept failing to resolve anything (it was making forward progress, just incredibly slowly and the result was useless anyway). Debian 12 for interested parties.

@bjorn3
Copy link
Member

bjorn3 commented Nov 10, 2023

Try perf report --no-inline. That will skip addr2line at the cost of not showing inlines functions.

@nicoburns
Copy link

nicoburns commented Dec 18, 2024

I'm seeing a similar issue (slower with 8 threads than 1) when compiling the following:

  • Repo: https://github.com/DioxusLabs/blitz
  • Commit [3bb203e89](https://github.com/DioxusLabs/blitz/commit/3bb203e899a3292a5401c305e4b707927759beab)
  • Command: cargo +nightly build --release --package interactive --timings (with RUSTFLAGS="-Z thread=8 vs. not)

Something I noticed was that there was a clear pattern in the timings:

  • Large crates were compiling decently faster (e.g. the stylo crate compiled in ~21s vs ~35s)
  • Small crates were compiling much slower. Crates that were compiling in 0.1s or 0.3s were now taking 2-5 whole seconds (once_cell is a particularly bad example: 0.16s with 1 thread, 4.95s with 8 threads!)

I wonder if some of the scalability issues could be fixed with a simple heuristic on how big the crate and disabling parallelism for very small crates.

1 Thread:

Image

8 threads:

Image

See attached files for full timings:

build-timings.zip

@mjguzik
Copy link
Contributor

mjguzik commented Dec 18, 2024

Now that's a blast from the past, I forgot all about this as I dropped Rust.

Looking at this now I think the issue is pretty clear: builds prior to the change were already progressing with parallelism amongst crates (but still single-threaded within a given crate) -- any idle CPU time got scooped up. Now that there is threading for everything there is extra overhead just to spawn these threads (and then have them compete with each other), but they have no idle CPU time to fill in. As a result for small crates there is just more overhead for no benefit.

Some heuristic for thread count would definitely be welcome, but what this probably wants on top of it is a make-like jobserver.

@bjorn3
Copy link
Member

bjorn3 commented Dec 18, 2024

We are already using the jobserver protocol inside rustc to limit thread spawning with cargo taking the place of make if cargo itself doesn't receice a jobserver pipe. This has been the case since forever to limit the amount of threads running LLVM. I'm not entirely sure how rustc-rayon reacts when it wants to spawn a new thread but jobserver says no. Does it block? Or does it put the workitem on the queue of another thread?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parallel-compiler Area: parallel compiler C-bug Category: This is a bug. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants