parallelize LLVM optimization and codegen passes #16367

spernsteiner · 2014-08-08T21:34:51Z

This branch adds support for running LLVM optimization and codegen on different parts of a crate in parallel. Instead of translating the crate into a single LLVM compilation unit, rustc now distributes items in the crate among several compilation units, and spawns worker threads to optimize and codegen each compilation unit independently. This improves compile times on multicore machines, at the cost of worse performance in the compiled code. The intent is to speed up build times during development without sacrificing too much optimization.

On the machine I tested this on, librustc build time with -O went from 265 seconds (master branch, single-threaded) to 115s (this branch, with 4 threads), a speedup of 2.3x. For comparison, the build time without -O was 90s (single-threaded). Bootstrapping rustc using 4 threads gets a 1.6x speedup over the default settings (870s vs. 1380s), and building librustc with the resulting stage2 compiler takes 1.3x as long as the master branch (44s vs. 55s, single threaded, ignoring time spent in LLVM codegen).

The user-visible changes from this branch are two new codegen flags:

-C codegen-units=N: Distribute items across N compilation units.
-C codegen-threads=N: Spawn N worker threads for running optimization and codegen. (It is possible to set codegen-threads larger than codegen-units, but this is not very useful.)

Internal changes to the compiler are described in detail on the individual commit messages.

Note: The first commit on this branch is copied from #16359, which this branch depends on.

r? @nick29581

metajack · 2014-08-08T23:40:15Z

Have you clocked the resulting code performance difference? This seems like it would be great for servo.

brson · 2014-08-08T23:45:15Z

Very excited about this.

I don't see the words 'codegen-threads' in this patch. Are you sure it exists? What happens when you specify --codegen-units but not --codegen-threads?

liigo · 2014-08-09T01:14:18Z

Replace the two with --codegen-tasks ?

2014年8月9日上午7:45于 "Brian Anderson" notifications@github.com写道：

Very excited about this.

I don't see the words 'codegen-threads' in this patch. Are you sure it
exists? What happens when you specify --codegen-units but not
--codegen-threads?

alexcrichton · 2014-08-09T03:27:22Z

src/librustc/back/link.rs

+                // For LTO purposes, the bytecode of this library is also
+                // inserted into the archive.  We currently do this only when
+                // codegen_units == 1, so we don't have to deal with multiple
+                // bitcode files per crate.


Could each module be linked into one module to be emitted? (is that too timing-intensive?)

Otherwise, could we name the files bytecodeN.deflate and list how many bytecodes are in the metadata?

Could each module be linked into one module to be emitted? (is that too timing-intensive?)

I haven't tried this yet. I don't think it would take much longer than we already spend linking the object files together. The only problem is, all the LLVM modules are in separate contexts, so we would need to serialize each one and then deserialize into a shared context for linking.

Otherwise, could we name the files bytecodeN.deflate and list how many bytecodes are in the metadata?

Yeah, this is my preferred solution (which I also haven't tried implementing yet).

alexcrichton · 2014-08-09T03:39:18Z

Could you describe some of the difficulties with sharing the Session and across worker threads? Was it mainly that Rc is used liberally inside of it? If so, do you think it would ever be feasible to share the Session in the worker threads?

alexcrichton · 2014-08-09T03:40:19Z

This is also some super amazing work, I'm incredibly excited to see where this goes! Major prosp @epdtry! 🐟

alexcrichton · 2014-08-09T03:42:35Z

src/librustc/back/link.rs

+        }
+
+        match config.opt_level {
+          Some(opt_level) => {


While you're at it, could you 4-space tab this match?

asterite · 2014-08-09T16:58:59Z

I don't know if it's applicable, but the way we do it in Crystal is to have one llvm module for each "logical unit". In our cases each logical unit is a class or a module. Maybe in Rust a "logical unit" is a struct, an array, etc., together will all its impls. Then you can also have another logical unit to be the top level functions.

Then we fire up N threads and each one takes a task (an llvm module) to compile it. This greatly reduces the compilation time. When you split your whole program in N modules and fire up N threads to compile those (as you are proposing here), if a thread finishes early its left without a job to do, so a thread becomes idle. With M smaller modules and N threads, with N > M, when a thread finishes it can start working on another module, reducing the idle time.

Additionally, before compiling each module we write its bitcode to a .bc file in a hidden directory (.crystal, in our case). We then compare that .bc file to the .bc file generated by the previous run. If they turn out to be the same (and this will be true as long as you don't modify any impl of that logical unit), we can safely reuse the .o file of the previous run. This, again, reduces dramatically the times to recompile a project that had minimal changes.

Bits of the source code implementing this behaviour are here and here, in case you want to take a look.

vadimcn · 2014-08-09T18:58:56Z

^THIS^ !!! Please, please implement incremental compilation!

Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point.
But even if they are culled after translation, it would be a major boon in day-to-day development.

nrc · 2014-08-09T20:30:20Z

I would say that the 'logical unit' in Rust is a module, they tend to be fairly small (at least relative to crate size) and are naturally self contained. It is probably worth getting data (at least) for smaller units - thanks for the idea!

Incremental compilation is the next part of the project - looking forward to what comes out of that :-)

spernsteiner · 2014-08-11T16:27:13Z

@metajack:

Have you clocked the resulting code performance difference?

Compiling rustc and all libraries using 4 compilation units produces a rustc that takes about 25% longer to run.

@brson:

I don't see the words 'codegen-threads' in this patch. Are you sure it exists?

It's a codegen flag, so the flag name codegen-threads is generated by a macro from the variable name codegen_threads.

What happens when you specify --codegen-units but not --codegen-threads?

rustc generates several compilation units, then runs optimization and codegen for them all sequentially.

@alexcrichton:

Could you describe some of the difficulties with sharing the Session and across worker threads?

Like most of rustc's major data structures, Session uses RefCell all over the place. I suppose we could share it using a Mutex, if we changed how the ownership is handled and were careful about the lifetimes of the mutex guards.

@asterite:

When you split your whole program in N modules and fire up N threads to compile those (as you are proposing here), if a thread finishes early its left without a job to do, so a thread becomes idle.

One way I tried to address this problem was by adding some basic load balancing: rustc tries to make each LLVM module roughly the same size, so that each worker thread gets the same amount of work to do. I also made codegen-units and codegen-threads separate flags so that you can have several smaller modules per worker thread. (Though in the testing I've done so far, it doesn't seem to help.)

@vadimcn:

Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point.

This is basically my next project. Translation is the #2 time sink in rustc (after LLVM passes), so culling modules (or finer-grained items) before translation seems like the way to go.

spernsteiner · 2014-08-11T21:35:02Z

@alexcrichton:
I think I've fixed all the things you mentioned, except that I haven't implemented LTO against separately compiled libraries yet.

Regarding ld -r (since Github has unhelpfully collapsed that line comment), I haven't seen any problems yet on Linux or OSX. On both I have bootstrapped rustc and run the test suite normally with no problems. On Linux I have also run the test suite with codegen-units > 1, also with no problems. I haven't tested it on Windows yet.

alexcrichton · 2014-08-12T05:48:39Z

Like most of rustc's major data structures, Session uses RefCell all over the place. I suppose we could share it using a Mutex, if we changed how the ownership is handled and were careful about the lifetimes of the mutex guards.

I would definitely expect an Arc<Mutex<Session>> to be passed around (maybe Option<Session> so it could be unwrapped). I'm not entirely sure if this could be done because Rc<T> isn't Send, and I think that the session has a bunch of Rc pointers, but I'm not sure how hard it would be to get rid of those.

It looked like it would make parts of this much nicer to have access to the raw session rather than duplicating some logic here and there, but it may not be too worth it in the end.

Regarding ld -r (since Github has unhelpfully collapsed that line comment), I haven't seen any problems yet on Linux or OSX

I don't think that any of our tests actually use the object file emitted, they just emit it. I also recall that the linker always succeeded in creating an object, but the object itself was just unusable (for one reason or another). Again though, this could all just be misremembering, or some bug which has since been fixed!

spernsteiner · 2014-08-12T16:14:03Z

I don't think that any of our tests actually use the object file emitted, they just emit it.

The run-pass tests link the object into an executable, run the resulting binary, and check that it works. At least one step in that process should fail if ld -r emits a bad object file.

alexcrichton · 2014-08-12T16:21:02Z

Oh dear, I must be over looking a test! I only see two instances of emit=.*obj in the codebase, one is the output-type-permutations run-make test (no linking involved there), and the other is the codegen tests (no linking involved either). What was the test that uses the output of ld -r?

spernsteiner · 2014-08-12T16:32:13Z

OK, let me back up. I think the relevant part of the design was unclear.

On the master branch, rustc produces a single object file crate.o. Then it feeds crate.o into the linker to produce an executable or shared object.

On this branch, rustc produces several object files crate.0.o, crate.1.o, etc. It feeds those into ld -r to produce a combined object file crate.o. Then crate.o is used to produce the final executable/library just like before. (That's why this branch does not need any changes to link_dylib and such.)

So, on this branch, any test that involves compiling and running Rust code will end up using ld -r as part of the linking process.

alexcrichton · 2014-08-12T16:34:44Z

Oh wow, I missed that entirely, I thought it was only used for OutputTypeObject! Sorry I missed that!

In that case, I'm definitely willing to trust ld -r.

nrc · 2014-08-21T00:30:48Z

src/librustc/driver/config.rs

+    codegen_units: uint = (1, parse_uint,
+        "divide crate into N units for optimization and codegen"),
+    codegen_threads: uint = (1, parse_uint,
+        "number of worker threads to use when running codegen"),


Given there was no benefit to having different values here, lets just have one option.

nrc · 2014-08-23T03:28:22Z

OK, looks good! r=me with all the changes (most of which are nits, TBH) and with Alex's review. @alexcrichton r? (specifically the stuff in back and concerning linking, about which I have no idea).

spernsteiner · 2014-08-29T20:38:02Z

@vadimcn: Wow, nice detective work! I would never have expected rm to behave like that.

Here's a second reason why the Windows bot should never have had a problem to begin with: ld is supposed to ignore --force-exe-suffix if -r is specified - and it's apparently worked that way since binutils version 2.7, released in 1996. I don't know what kind of mingw build we've got installed on those Windows bots, but it's definitely not an up-to-date MSYS2, and on top of that it seems to have picked up some strange patches at some point.

Anyway, I've added a workaround that should handle these inconsistencies between Windows toolchains. Now (on Windows) rustc always adds .exe to the output file name when running ld -r, and then after linking renames the output file to the actual desired name. This should give the correct behavior no matter how ld handles --force-exe-suffix -r.

r? @alexcrichton

alexcrichton · 2014-08-30T06:23:33Z

Well then, that's a new segfault I've never seen before!

Break up `CrateContext` into `SharedCrateContext` and `LocalCrateContext`. The local piece corresponds to a single compilation unit, and contains all LLVM-related components. (LLVM data structures are tied to a specific `LLVMContext`, and we will need separate `LLVMContext`s to safely run multithreaded optimization.) The shared piece contains data structures that need to be shared across all compilation units, such as the `ty::ctxt` and some tables related to crate metadata.

Refactor the code in `llvm::back` that invokes LLVM optimization and codegen passes so that it can be called from worker threads. (Previously, it used `&Session` extensively, and `Session` is not `Share`.) The new code can handle multiple compilation units, by compiling each unit to `crate.0.o`, `crate.1.o`, etc., and linking together all the `crate.N.o` files into a single `crate.o` using `ld -r`. The later linking steps can then be run unchanged. The new code preserves the behavior of `--emit`/`-o` when building a single compilation unit. With multiple compilation units, the `--emit=asm/ir/bc` options produce multiple files, so combinations like `--emit=ir -o foo.ll` will not actually produce `foo.ll` (they instead produce several `foo.N.ll` files). The new code supports `-Z lto` only when using a single compilation unit. Compiling with multiple compilation units and `-Z lto` will produce an error. (I can't think of any good reason to do such a thing.) Linking with `-Z lto` against a library that was built as multiple compilation units will also fail, because the rlib does not contain a `crate.bytecode.deflate` file. This could be supported in the future by linking together the `crate.N.bc` files produced when compiling the library into a single `crate.bc`, or by making the LTO code support multiple `crate.N.bytecode.deflate` files.

When inlining an item from another crate, use the original symbol from that crate's metadata instead of generating a new symbol using the `ast::NodeId` of the inlined copy. This requires exporting symbols in the crate metadata in a few additional cases. Having predictable symbols for inlined items will be useful later to avoid generating duplicate object code for inlined items.

Rotate between compilation units while translating. The "worker threads" commit added support for multiple compilation units, but only translated into one, leaving the rest empty. With this commit, `trans` rotates between various compilation units while translating, using a simple stragtegy: upon entering a module, switch to translating into whichever compilation unit currently contains the fewest LLVM instructions. Most of the actual changes here involve getting symbol linkage right, so that items translated into different compilation units will link together properly at the end.

…t glue Use a shared lookup table of previously-translated monomorphizations/glue functions to avoid translating those functions in every compilation unit where they're used. Instead, the function will be translated in whichever compilation unit uses it first, and the remaining compilation units will link against that original definition.

Add a post-processing pass to `trans` that converts symbols from external to internal when possible. Translation with multiple compilation units initially makes most symbols external, since it is not clear when translating a definition whether that symbol will need to be accessed from another compilation unit. This final pass internalizes symbols that are not reachable from other crates and not referenced from other compilation units, so that LLVM can perform more aggressive optimizations on those symbols.

Adjust the handling of `#[inline]` items so that they get translated into every compilation unit that uses them. This is necessary to preserve the semantics of `#[inline(always)]`. Crate-local `#[inline]` functions and statics are blindly translated into every compilation unit. Cross-crate inlined items and monomorphizations of `#[inline]` functions are translated the first time a reference is seen in each compilation unit. When using multiple compilation units, inlined items are given `available_externally` linkage whenever possible to avoid duplicating object code.

spernsteiner · 2014-09-05T21:59:51Z

Older versions of OSX's ld64 linker parse object files using variable-size stack-allocated buffers for some temporary data structures. The bus error seen on 6a60448 occurs because the object file contains too much stuff (mainly, too many unwinding table entries), and those stack allocated buffers overflow the 8MB stack limit of the parser thread. This 8MB stack size is hard-coded inside ld64, so we can't work around the bug by bumping up stack size with ulimit -s.

On master, the librustc build works fine because rustc.o requires about 5MB of stack to parse. This branch triggers a stack overflow because it uses ld -r to generate rustc.o (even with -C codegen-units=1), and ld -r adds a __compact_unwind section to the generated object file. Parsing librustc's __compact_unwind section uses an additional 4MB of stack, which puts the parser thread over its 8MB limit. There is an undocumented flag -no_compact_unwind which is supposed to suppress the generation of the __compact_unwind section, but this flag is ignored when passed in combination with -r.

Newer versions of ld64 fix the stack overflow bug, by having the object file parser use malloc when the required buffer size is large. Unfortunately, according to wikipedia, the fixed ld64 versions (224.1+) are available only with XCode 5+, for OSX 10.8+, while Rust is supposed to support building on OSX 10.7. I'm not sure if there is any way to install newer ld64 on older versions of OSX.

The latest commit on this branch avoids running ld -r when building with only a single compilation unit (which is probably a good idea regardless of the ld64 bug). This will let librustc build without errors (giant object file, but no ld -r doubling its stack use), and the separate compilation tests should also pass (ld -r, but tiny object files). It doesn't fix the underlying problem, though - if anyone using XCode 4 tries to build a large crate with parallel codegen enabled, they will get a nasty segfault from the linker. (Though note that rustc master can already trigger the same error without ld -r, for crates with about twice as many functions as librustc.)

alexcrichton · 2014-09-06T05:25:16Z

@epdtry, oh my, that is quite the investigation! That's quite unfortunate that we'll segfault on older versions of OSX. It looks like there's not a whole lot we can do right now though. I'm sad that this may mean that we have to turn off parallel codegen for rustc itself by default (at least for osx), but we can cross that bridge later!

alexcrichton · 2014-09-06T05:25:32Z

Also, major major props for that investigation, that must have been quite a beast to track down!

@nick29581

This branch adds support for running LLVM optimization and codegen on different parts of a crate in parallel. Instead of translating the crate into a single LLVM compilation unit, `rustc` now distributes items in the crate among several compilation units, and spawns worker threads to optimize and codegen each compilation unit independently. This improves compile times on multicore machines, at the cost of worse performance in the compiled code. The intent is to speed up build times during development without sacrificing too much optimization. On the machine I tested this on, `librustc` build time with `-O` went from 265 seconds (master branch, single-threaded) to 115s (this branch, with 4 threads), a speedup of 2.3x. For comparison, the build time without `-O` was 90s (single-threaded). Bootstrapping `rustc` using 4 threads gets a 1.6x speedup over the default settings (870s vs. 1380s), and building `librustc` with the resulting stage2 compiler takes 1.3x as long as the master branch (44s vs. 55s, single threaded, ignoring time spent in LLVM codegen). The user-visible changes from this branch are two new codegen flags: * `-C codegen-units=N`: Distribute items across `N` compilation units. * `-C codegen-threads=N`: Spawn `N` worker threads for running optimization and codegen. (It is possible to set `codegen-threads` larger than `codegen-units`, but this is not very useful.) Internal changes to the compiler are described in detail on the individual commit messages. Note: The first commit on this branch is copied from #16359, which this branch depends on. r? @nick29581

l0kod · 2014-09-06T09:36:13Z

Awesome work! Looks great for an efficient #2369.

Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point.

I think the Ninja build system use hashes from the compilation commands and source files (all dependencies) instead of relying on timestamps. @bors should like it ;)
Shake can do it as well for source files: http://neilmitchell.blogspot.fr/2014/06/shake-file-hashesdigests.html

japaric · 2014-09-06T14:28:11Z

Q: Is this flag ignored if --test is passed to the compiler?

I just tried rustc --test -L target/deps -C codegen-units=8 src/lib.rs on my library that has 300+ tests and the compile time is still 20 seconds, and CPU usage remained at 100% (one thread).

Did I do something wrong? (Also -C codegen-threads=8 returns error: unknown codegen option)

spernsteiner · 2014-09-06T15:13:55Z

@japaric,

The flag is not ignored, it's just that your library is small enough that it doesn't get much benefit from this patch (especially when optimization is turned off).

With -C codegen-units=1 (the default):

time: 1.879 s   translation
  time: 0.142 s llvm function passes
  time: 0.067 s llvm module passes
  time: 3.547 s codegen passes
  time: 0.000 s codegen passes
time: 4.264 s   LLVM passes
  time: 0.408 s running linker
time: 0.409 s   linking

real    0m16.718s

And with -C codegen-units=4:

time: 2.927 s   translation
time: 0.054 s   llvm function passes
time: 0.055 s   llvm function passes
time: 0.056 s   llvm function passes
time: 0.055 s   llvm function passes
time: 0.022 s   llvm module passes
time: 0.025 s   llvm module passes
time: 0.025 s   llvm module passes
time: 0.026 s   llvm module passes
time: 1.422 s   codegen passes
time: 1.443 s   codegen passes
time: 1.448 s   codegen passes
time: 0.000 s   codegen passes
time: 1.474 s   codegen passes
time: 1.875 s   LLVM passes
  time: 0.489 s running linker
time: 0.492 s   linking

real    0m15.472s

Since rustc spends only 4 seconds in LLVM passes to begin with, there is not much room for improvement. Setting codegen-units=4 reduces the time by about 2.5s, but also slows down translation and linking, so the overall benefit is tiny.

Also -C codegen-threads=8 returns error: unknown codegen option

I removed that flag because in my testing I found no benefit from setting codegen-threads != codegen-units.

japaric · 2014-09-06T15:32:09Z

@epdtry Thanks for the detailed info!

It seems that the bottleneck is the type checking phase in my particular case, and now I'm wondering if spending 50%+ of the time in that phase is normal (but that's off-topic).

fix: Make `value_ty` query fallible

alexcrichton reviewed Aug 9, 2014
View reviewed changes

src/librustc/back/link.rs

}

match config.opt_level {

Some(opt_level) => {

Copy link

Member

alexcrichton Aug 9, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it, could you 4-space tab this match?

nrc reviewed Aug 21, 2014
View reviewed changes

l0kod mentioned this pull request Aug 31, 2014

Incremental recompilation #2369

Closed

spernsteiner added 14 commits September 5, 2014 09:18

make CrateContext fields private

cf35cb3

move back::link::write into a separate file

e29aa14

add tests for separate compilation

4b70269

use target-specific linker args when combining compilation units

b5a0b70

don't leave unwanted temporary files with --emit=ir/asm

1b676fb

add workaround for mingw ld --force-exe-suffix behavior

4d9a478

don't use ld -r with -C codegen-units=1

6d2d47b

bors closed this Sep 6, 2014

bors merged commit 6d2d47b into rust-lang:master Sep 6, 2014

mbrubeck mentioned this pull request Sep 17, 2014

Frequent "Could not compile script" errors in Travis Linux builds servo/servo#3378

Closed

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 21, 2024

Auto merge of rust-lang#16367 - Veykril:value-ty, r=Veykril

63c4e69

fix: Make `value_ty` query fallible

parallelize LLVM optimization and codegen passes #16367

parallelize LLVM optimization and codegen passes #16367

Uh oh!

Conversation

spernsteiner commented Aug 8, 2014

Uh oh!

metajack commented Aug 8, 2014

Uh oh!

brson commented Aug 8, 2014

Uh oh!

liigo commented Aug 9, 2014

Uh oh!

alexcrichton Aug 9, 2014

Choose a reason for hiding this comment

Uh oh!

spernsteiner Aug 11, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Aug 9, 2014

Uh oh!

alexcrichton commented Aug 9, 2014

Uh oh!

alexcrichton Aug 9, 2014

Choose a reason for hiding this comment

Uh oh!

asterite commented Aug 9, 2014

Uh oh!

vadimcn commented Aug 9, 2014

Uh oh!

nrc commented Aug 9, 2014

Uh oh!

spernsteiner commented Aug 11, 2014

Uh oh!

spernsteiner commented Aug 11, 2014

Uh oh!

alexcrichton commented Aug 12, 2014

Uh oh!

spernsteiner commented Aug 12, 2014

Uh oh!

alexcrichton commented Aug 12, 2014

Uh oh!

spernsteiner commented Aug 12, 2014

Uh oh!

alexcrichton commented Aug 12, 2014

Uh oh!

nrc Aug 21, 2014

Choose a reason for hiding this comment

Uh oh!

nrc commented Aug 23, 2014

Uh oh!

spernsteiner commented Aug 29, 2014

Uh oh!

alexcrichton commented Aug 30, 2014

Uh oh!

spernsteiner commented Sep 5, 2014

Uh oh!

alexcrichton commented Sep 6, 2014

Uh oh!

alexcrichton commented Sep 6, 2014

Uh oh!

l0kod commented Sep 6, 2014

Uh oh!

japaric commented Sep 6, 2014

Uh oh!

spernsteiner commented Sep 6, 2014

Uh oh!

japaric commented Sep 6, 2014

Uh oh!

Uh oh!