Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Conditionally build parallel compiler on try #59417

Closed
wants to merge 5 commits into from

Conversation

alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Mar 25, 2019

This commit configures Travis/AppVeyor to conditionally compile parallel
compilers on @bors: try. This is an experiment currently to see how
this plays out, but the intention is that if the commit message contains
the term "parallel-compiler" then when @bors: try is issued it will
perform differently than the try branch does today, building three
compilers: Linux, macOS, and Windows. We currently have no try
builders for macOS or Windows due to typical capacity issues, so it's
intended that this is only very sparingly used from time to time when
necessary.

cc #48685, tracking issue for parallel compilation

@rust-highfive
Copy link
Collaborator

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 25, 2019
@alexcrichton
Copy link
Member Author

@bors: try

Let's see what happens...

@bors

This comment has been minimized.

@bors

This comment has been minimized.

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 25, 2019
@alexcrichton

This comment has been minimized.

@bors

This comment has been minimized.

@bors

This comment has been minimized.

@alexcrichton alexcrichton force-pushed the parallel-binaries branch 4 times, most recently from b47a1c3 to acf826e Compare March 25, 2019 19:59
@alexcrichton

This comment has been minimized.

@bors

This comment has been minimized.

@alexcrichton

This comment has been minimized.

@bors

This comment has been minimized.

bors added a commit that referenced this pull request Mar 25, 2019
ci: Conditionally build parallel compiler on `try`

This commit configures Travis/AppVeyor to conditionally compile parallel
compilers on `@bors: try`. This is an experiment currently to see how
this plays out, but the intention is that if the commit message contains
the term "parallel-compiler" then when `@bors: try` is issued it will
perform differently than the try branch does today, building three
compilers: Linux, macOS, and Windows. We currently have no `try`
builders for macOS or Windows due to typical capacity issues, so it's
intended that this is only very sparingly used from time to time when
necessary.

[parallel-compiler]
@alexcrichton

This comment has been minimized.

@rust-highfive

This comment has been minimized.

@bors

This comment has been minimized.

@alexcrichton
Copy link
Member Author

@bors: try

@bors
Copy link
Contributor

bors commented Apr 1, 2019

⌛ Trying commit 4382c25 with merge 9743253127ed10bcfb5b046647599d211f0b0a0b...

@alexcrichton
Copy link
Member Author

@bors: try

Ok I'm gonna run a test real quick and see what happens if we just enable compiling a parallel compiler but it defaults to 1 thread instead of num_cpus. I'm curious what sort of perf slowdown we'll see

@bors
Copy link
Contributor

bors commented Apr 1, 2019

⌛ Trying commit c4fa1e0 with merge cb23a46f60b0233f366367dd11b5fdc241bce328...

@Zoxc
Copy link
Contributor

Zoxc commented Apr 1, 2019

There's another parallel compiler bug introduced in #58929, but it should still work with a single thread.

@bors
Copy link
Contributor

bors commented Apr 1, 2019

☀️ Try build successful - checks-travis
Build commit: cb23a46f60b0233f366367dd11b5fdc241bce328

@alexcrichton
Copy link
Member Author

Let's see if we can get some ballpark numbers...

@rust-timer build cb23a46f60b0233f366367dd11b5fdc241bce328

@rust-timer
Copy link
Collaborator

Success: Queued cb23a46f60b0233f366367dd11b5fdc241bce328 with parent eab3eb3, comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit cb23a46f60b0233f366367dd11b5fdc241bce328

@alexcrichton
Copy link
Member Author

Ok so that looks like it's a blanket 2-3% slowdown across the board if we enable parallel compilation and peg rustc to just one thread. That's... frankly amazing!

I think given those kind of numbers it might actually be feasible to just turn this on by default for all releases. We'd get a small slowdown which we'd quickly be able to recover many times over by bumping up the default number of threads.

@Zoxc I'm curious, do you think that rustc might be ready for this transition? Concretely I'd imagine that we'd turn on the parallel_compiler #[cfg] by default for CI (like done in this PR), but we'd still default to -Zthreads=1 (or something like that). We'd afterwards likely quickly remove the #[cfg(parallel_compiler)] annotations, only using the parallel versions.

Later once ready we could default threads to greater than 1, but until then we could allow testing on nightly via the unstable -Zthreads option for local investigations.

@alexcrichton
Copy link
Member Author

I've also started some chat on zulip about this!

@mati865
Copy link
Contributor

mati865 commented Apr 2, 2019

Wall-time doesn't look convincing, maybe there is a lot of blocking?

@alexcrichton
Copy link
Member Author

Ah yes sorry my mistake! The interesting metric here indeed is wall time, not instructions executed. That comparison URL is located here and is indeed unfortunately less inspiring, showing a blanket ~10% slowdown, which I think is a bit too serious to land just now.

@Zoxc in addition to the question above about whether the compiler is ready for this, are you aware of low hanging fruit for optimizing?

@alexcrichton
Copy link
Member Author

Hm well so it's also probably worthwhile noting the magnitude of changes here, it looks like all 10%+ changes are in tiny crates, regressing from 5 to 6 seconds for example. Larger crates like servo ones do regress but still on the order of a second in a 15 second compilation already, so I don't think it's necessarily damning evidence

@mati865
Copy link
Contributor

mati865 commented Apr 2, 2019

packed-simd already takes very long time to build and it regresses here by ~10%. That's quite worrying and could be worth checking.

@Zoxc
Copy link
Contributor

Zoxc commented Apr 2, 2019

Turning on parallel_compiler with a single thread should be fine from a correctness standpoint.

I'm not aware of any low hanging fruit, there might be some though as I've mostly focused on making the code run in parallel. Maybe @nnethercote wants to help here and look for some excessive locking?

I'm working on getting rid of Arcs in the query system, but I suspect the overhead comes mostly from locking, and maybe some from Rayon. I can measure this though.

@nnethercote
Copy link
Contributor

One distinct downside of (true, multi-thread) parallelization is that it makes benchmarking much harder. Wall time is the true metric of interest, but it's so noisy that e.g. 1% improvements/regression are impossible to spot. Currently we can use instruction counts as a proxy for wall time, because instruction counts have much less variation, and they're a pretty good proxy. But true parallelization will majorly reduce the usefulness of instruction counts as a metric.

It's a hard problem, I'm not sure what to do about it. It's one reason why I'm biased toward trying to improve coarse-grained parallelization (e.g. pipelining) rather than fine-grained parallelization.

@alexcrichton
Copy link
Member Author

@nnethercote that's a good point! I suspect though we can solve it by always passing -Zthreads=1 to benchmarks on perf.r-l.o?

@alexcrichton
Copy link
Member Author

@Zoxc I don't mind also doing some profiling to look into this, the packed_simd case pointed out by @mati865 is a good one to look into, so I'll try to investigate that and post some results here tomorrow

@nnethercote
Copy link
Contributor

@alexcrichton: Let's assume we reach the point where we are shipping a parallel compiler.
Passing -Zthreads=1 would mean that instruction counts is still a good proxy for wall time, which is good... but it would also mean that we are measuring something that we aren't shipping, which is bad :(

@Zoxc
Copy link
Contributor

Zoxc commented Apr 2, 2019

I opened #59649, #59647, #59644 and #59641 to find out what the cause of the regressions are on a high level.

@Zoxc
Copy link
Contributor

Zoxc commented Apr 3, 2019

Seems like the locks are to blame for most of the regressions.

@alexcrichton
Copy link
Member Author

Ok I think that we've identified an actionable way to go (thanks @Zoxc) so I don't think we're going to want to take this approach where @bors: try builds parallel compilers and we recommend rustup-toolchain-install-master for testing.

To that end I'm going to close this and I've opened up #59667 to track further work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-blocked Status: Blocked on something else such as an RFC or other implementation work. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.