Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize rustc via multi-process approach #47518

Closed
michaelwoerister opened this issue Jan 17, 2018 · 17 comments
Closed

Parallelize rustc via multi-process approach #47518

michaelwoerister opened this issue Jan 17, 2018 · 17 comments
Labels
A-concurrency Area: Concurrency A-process Area: `std::process` and `std::env` C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-parallel Working group: Parallelizing the compiler

Comments

@michaelwoerister
Copy link
Member

michaelwoerister commented Jan 17, 2018

For big crates, the Rust compiler can be stuck in single-threaded execution for quite some time because only the last phase of compilation is properly parallelized. This issue describes one particular approach for making most of compilation parallel.

Basic Concept: Spawn multiple rustc processes that compile "vertical slices" of a crate

The compiler's internal architecture has become rather flexible and demand-driven over the last couple of years and one could imagine implementing an option for the compiler that allows it to just compile part of a crate. Given a deterministic partitioning for a crate, one could then run multiple compilation processes for compiling disjunct parts of a crate in parallel and then stitch those parts together in a final step. This is very similar to a traditional compiler & linker setup.

Advantages

  • The compiler itself could keep its current architecture that is optimized for single-threaded execution. This sidesteps the problem of having to either find one architecture that works well for both single- and multi-threaded settings or maintaining two purpose built architectures at the same time.
  • There has to be almost no coordination between the parallel processes, removing lots and lots of complexity from maintenance and reasoning about performance.
  • Using multiple processes potentially scales to multi-machine setups, something that might become relevant for big projects with distributed build systems.
  • The implementation of this feature can be confined to a small number of special modules instead of touching almost every aspect of the compiler.
  • If low memory consumption instead of short compile times is the priority then compiling parts serially instead of in parallel would reduce peak memory usage (as @Zoxc pointed out below).

Disadvantages

  • Overall, more redundant work will be done for compiling a single crate. Parsing, macro expansion, and coherence checking (to name a few) will be re-done by every process. Things that heavily rely on caching, like trait selection, will have to start from an empty cache for each process. (Note though that each individual process can run in incremental mode.)
  • Doing redundant things in parallel is likely to increase peak memory consumption.
  • The stitching phase does not work very well with our existing crate metadata format.
  • Error messages would have to be handled specially in order to avoid interleaving and duplication.

Conclusion

I am not particularly advocating for following this approach. This issue is meant to provide input for a wider discussion on how to bring more parallelism to the compilation process. This approach is kind of brute-force. However, I have to say, after thinking about it a little I am surprised to actually find it viable :)

cc @rust-lang/compiler

@michaelwoerister michaelwoerister added A-concurrency Area: Concurrency I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. labels Jan 17, 2018
@eddyb
Copy link
Member

eddyb commented Jan 17, 2018

I still think compiling dependent crates in parallel with codegen is better.
Alternatively, we could try the MIR-only rlibs approach and rely on (incremental) codegen units at the very final binary compilation which would be the only one doing codegen.

@michaelwoerister
Copy link
Member Author

I'm thinking of cases like the librustc crate that takes more than a minute to generate metadata on my (rather fast) machine. MIR-only rlibs are an orthogonal concern.

@Zoxc
Copy link
Contributor

Zoxc commented Jan 17, 2018

Another benefit of this approach is that we can split a crate into n parts and only compile one at a time, reducing overall memory consumption.

@retep998
Copy link
Member

Some crates are rather heavy on parsing and expansion, such as winapi, which spends 17% of its time on just that. Duplicating that work across multiple processes might not be the best idea.

@estebank
Copy link
Contributor

I wonder if cargo had a way to recompile the current crate multiple times with different flags and automatically update cargo.toml with the "best" performing flags for any given crate.

@michaelwoerister
Copy link
Member Author

@Zoxc Added it to the list.
@retep998 I imagine this to be opt-in on a crate by crate basis. I also imagine that at least macro expansion will also become splittable at some point. This would be desirable for incremental compilation as well.
@estebank Sounds like a neat subcommand.

@Zoxc
Copy link
Contributor

Zoxc commented Jan 20, 2018

We could also scale the compiler to multiple machines by distributing codegen units compiled to LLVM bitcode and running optimizations on multiple machines. This would be very effective for release builds, given how LLVM dominates the build time and is already parallel. My plan to parallelize the compiler using Rayon would ensure we could generate and send LLVM bitcode even faster, making this more effective.

This has a number of advantages:

  • Doesn't need a complete toolchain, just a single binary with a matching LLVM library. This could quickly be downloaded from rust-lang.org
  • Avoids the need to send source files to build servers. The build server does not need access to a file system at all
  • Allows easy load balancing by doing work stealing of bitcode files
  • It avoids duplicate work. This is especially valuable if the machines are also used for other things (like C++ compilation)
  • We do not have to deal with error messages, as no errors can occur once LLVM bitcode is generated

The disadvantage is that only LLVM optimization and code generation can be distributed, though that is a large portion of the compile time.

This seems like a good idea to me, especially if we could make it easy to setup.

Distributing work across multiple machines also seems to be an effective way to speed up bors too. Does @rust-lang/infra have any opinions on this?

@aidanhs
Copy link
Member

aidanhs commented Jan 20, 2018

#44675 (comment) indicated that tweaking codegen-units decreased bootstrap time but increased time taken to run tests. So even if we distributed and sped up compilation of rustc itself, that's only one part of the story for bors times.

I can see us trying it if it was available, just wanted to note it may not be an easy win.

@Zoxc
Copy link
Contributor

Zoxc commented Jan 20, 2018

@aidanhs I expect that ThinLTO will bring performance with multiple codegen units on pair with a single one. We may have to wait a bit for that though. Updating LLVM would be a good start.

@jkordish jkordish added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Feb 1, 2018
@michaelwoerister
Copy link
Member Author

@Zoxc's variant would mesh well with MIR-only RLIBs.

@Zoxc
Copy link
Contributor

Zoxc commented Jan 24, 2019

Another way to distribute work across multiple machines by sending the whole crate as source code to all the machines. Each machine could then run a single rustc instance. We'd distribute the CGUs to be generated equally between the rustc instances and we'd invoke the query for generating a CGU. With an on-demand architecture, this should only do the required for to actually generate the CGU (only typeck, borrowck the functions involved in the CGU, etc.). We can apply work stealing if the time spend on each CGU is uneven.

This scheme isn't as efficient as the one I proposed above, since parsing and other things would be done per machine, but other things like type checking could scale better.

@michaelwoerister
Copy link
Member Author

That's roughly what I proposed here originally (+work stealing, maybe?).

@Zoxc
Copy link
Contributor

Zoxc commented Jan 24, 2019

@michaelwoerister And it would use a single parallel rustc instance per machine, instead of multiple rustc instances per machine, like you proposed.

@tidux
Copy link

tidux commented Dec 17, 2019

If we're compiling across multiple machines, https://github.com/distcc/distcc might be a good reference point.

@workingjubilee workingjubilee added A-process Area: `std::process` and `std::env` WG-compiler-parallel Working group: Parallelizing the compiler labels Jul 22, 2023
@workingjubilee
Copy link
Member

I am referring this issue to WG-compiler-parallel for assessment by them, given that they are focusing on parallelism (distinctly: via the multithreaded approach) and I believe they should accept/reject this.

@michaelwoerister
Copy link
Member Author

For what it's worth, I mostly opened this issue to explore the design space a bit. I don't think there's a reason to keep it open.

@nnethercote
Copy link
Contributor

With the parallel front-end just being shipped to nightly, I think we can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Concurrency A-process Area: `std::process` and `std::env` C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-parallel Working group: Parallelizing the compiler
Projects
None yet
Development

No branches or pull requests

10 participants