|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: Faster compilation with the parallel front-end in nightly |
| 4 | +author: Nicholas Nethercote |
| 5 | +team: The Parallel Rustc Working Group <https://www.rust-lang.org/governance/teams/compiler#Parallel%20rustc%20working%20group> |
| 6 | +--- |
| 7 | + |
| 8 | +The Rust compiler's front-end can now use parallel execution to significantly |
| 9 | +reduce compile times. To try it, run the nightly compiler with the `-Z |
| 10 | +threads=8` option. This feature is currently experimental, and we aim to ship |
| 11 | +it in the stable compiler in 2024. |
| 12 | + |
| 13 | +Keep reading to learn why a parallel front-end is needed and how it works, or |
| 14 | +just skip ahead to the [How to use it](parallel-rustc.html#how-to-use-it) |
| 15 | +section. |
| 16 | + |
| 17 | +## Compile times and parallelism |
| 18 | + |
| 19 | +Rust compile times are a perennial concern. The [Compiler Performance Working |
| 20 | +Group](https://www.rust-lang.org/governance/teams/compiler#Compiler%20performance%20working%20group) |
| 21 | +has continually improved compiler performance for several years. For example, |
| 22 | +in the first 10 months of 2023, there were mean reductions in compile time of |
| 23 | +[13%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=wall-time&nonRelevant=true), |
| 24 | +in peak memory use of |
| 25 | +[15%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=max-rss&nonRelevant=true), |
| 26 | +and in binary size of |
| 27 | +[7%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=size%3Alinked_artifact&nonRelevant=true), |
| 28 | +as measured by our performance suite. |
| 29 | + |
| 30 | +However, at this point the compiler has been heavily optimized and new |
| 31 | +improvements are hard to find. There is no low-hanging fruit remaining. |
| 32 | + |
| 33 | +But there is one piece of large but high-hanging fruit: parallelism. Current |
| 34 | +Rust compiler users benefit from two kinds of parallelism, and the newly |
| 35 | +parallel front-end adds a third kind. |
| 36 | + |
| 37 | +### Existing interprocess parallelism |
| 38 | + |
| 39 | +When you compile a Rust program, Cargo launches multiple rustc processes, |
| 40 | +compiling multiple crates in parallel. This works well. Try compiling a large |
| 41 | +Rust program with the `-j1` flag to disable this parallelization and it will |
| 42 | +take a lot longer than normal. |
| 43 | + |
| 44 | +You can visualise this parallelism if you build with Cargo's |
| 45 | +[`--timings`](https://doc.rust-lang.org/cargo/reference/timings.html) flag, |
| 46 | +which produces a chart showing how the crates are compiled. The following image |
| 47 | +shows the timeline when building [ripgrep](https://crates.io/crates/ripgrep) on |
| 48 | +a machine with 28 virtual cores. |
| 49 | + |
| 50 | + |
| 51 | + |
| 52 | +There are 60 horizontal lines, each one representing a distinct process. Their |
| 53 | +durations range from a fraction of a second to multiple seconds. Most of them |
| 54 | +are rustc, and the few orange ones are build scripts. The first twenty run all |
| 55 | +start at the same time. This is possible because there are no dependencies |
| 56 | +between the relevant crates. But further down the graph, parallelism reduces as |
| 57 | +crate dependencies increase. Although the compiler can overlap compilation of |
| 58 | +dependent crates somewhat thanks to a feature called [pipelined |
| 59 | +compilation](https://github.com/rust-lang/rust/issues/60988), there is much |
| 60 | +less parallel execution happening towards the end of compilation, and this is |
| 61 | +typical for large Rust programs. Interprocess parallelism is not enough to take |
| 62 | +full advantage of many cores. For more speed, we need parallelism within each process. |
| 63 | + |
| 64 | +### Existing intraprocess parallelism: the back-end |
| 65 | + |
| 66 | +The compiler is split into two halves: the front-end and the back-end. |
| 67 | + |
| 68 | +The front-end does many things, including parsing, type checking, and borrow |
| 69 | +checking. Until this week, it could not use parallel execution. |
| 70 | + |
| 71 | +The back-end performs code generation. It generates code in chunks called |
| 72 | +"codegen units" and then LLVM processes these in parallel. This is a form of |
| 73 | +coarse-grained parallelism. |
| 74 | + |
| 75 | +We can visualize the difference between the serial front-end and the parallel |
| 76 | +back-end. The following image shows the output of a profiler called |
| 77 | +[Samply](https://github.com/mstange/samply/) measuring rustc as it does a |
| 78 | +release build of the final crate in Cargo. The image is superimposed with |
| 79 | +markers that indicate front-end and back-end execution. |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | +Each horizontal line represents a thread. The main thread is labelled "rustc" |
| 84 | +and is shown at the bottom. It is busy for most of the execution. The other 16 |
| 85 | +threads are LLVM threads, labelled "opt cgu.00" through to "opt cgu.15". There |
| 86 | +are 16 threads because 16 is the default number of codegen units for a release |
| 87 | +build. |
| 88 | + |
| 89 | +There are several things worth noting. |
| 90 | +- Front-end execution takes 10.2 seconds. |
| 91 | +- Back-end execution occurs takes 6.2 seconds, and the LLVM threads are running |
| 92 | + for 5.9 seconds of that. |
| 93 | +- The parallel code generation is highly effective. Imagine if all those LLVM |
| 94 | + executed one after another! |
| 95 | +- Even though there are 16 LLVM threads, at no point are all 16 executing at |
| 96 | + the same time, despite this being run on a machine with 28 cores. (The peak |
| 97 | + is 14 or 15.) This is because the main thread translates its internal code |
| 98 | + representation (MIR) to LLVM's code representation (LLVM IR) in serial. This |
| 99 | + takes a brief period for each codegen unit, and explains the staircase shape |
| 100 | + on the left-hand side of the code generation threads. There is some room for |
| 101 | + improvement here. |
| 102 | +- The front-end is entirely serial. There is a lot of room for improvement |
| 103 | + here. |
| 104 | + |
| 105 | +### New intraprocess parallelism: the front-end |
| 106 | + |
| 107 | +The front-end is now capable of parallel execution. It uses |
| 108 | +[Rayon](https://crates.io/crates/rayon) to perform compilation tasks using |
| 109 | +fine-grained parallelism. Many data structures are synchronized by mutexes and |
| 110 | +read-write locks, atomic types are used where appropriate, and many front-end |
| 111 | +operations are made parallel. The addition of parallelism was done by modifying |
| 112 | +a relatively small number of key points in the code. The vast majority of the |
| 113 | +front-end code did not need to be changed. |
| 114 | + |
| 115 | +When the parallel front-end is enabled and configured to use eight threads, we |
| 116 | +get the following Samply profile when compiling the same example as before. |
| 117 | + |
| 118 | + |
| 119 | + |
| 120 | +Again, there are several things worth nothing. |
| 121 | +- Front-end execution takes 5.9 seconds (down from 10.2 seconds). |
| 122 | +- Back-end execution takes 5.3 seconds (down from 6.2 seconds), and the LLVM |
| 123 | + threads are running for 4.9 seconds of that (down from 5.9 seconds). |
| 124 | +- There are seven additional threads labelled "rustc" operating in the |
| 125 | + front-end. The reduced front-end time shows they are reasonably effective, |
| 126 | + but the thread utilization is patchy, with the eight threads all having |
| 127 | + periods of inactivity. There is room for significant improvement here. |
| 128 | +- Eight of the LLVM threads start at the same time. This is because the eight |
| 129 | + "rustc" threads create the LLVM IR for eight codegen units in parallel. (For |
| 130 | + seven of those threads that is the only work they do in the back-end.) After |
| 131 | + that, the staircase effect returns because only one "rustc" thread does LLVM |
| 132 | + IR generation while seven or more LLVM threads are active. If the number of |
| 133 | + threads used by the front-end was changed to 16 the staircase shape would |
| 134 | + disappear entirely, though in this case the final execution time would barely |
| 135 | + change. |
| 136 | + |
| 137 | +### Putting it all together |
| 138 | + |
| 139 | +Rust compilation has long benefited from interprocess parallelism, via Cargo, |
| 140 | +and from intraprocess parallelism in the back-end. It can now also benefit from |
| 141 | +intraprocess parallelism in the front-end. |
| 142 | + |
| 143 | +You might wonder how interprocess parallelism and intraprocess parallelism |
| 144 | +interact. If we have 20 parallel rustc invocations and each one can have up to |
| 145 | +16 threads running, could we end up with hundreds of threads on a machine with |
| 146 | +only tens of cores, resulting in inefficient execution as the OS tries its best |
| 147 | +to schedule them? |
| 148 | + |
| 149 | +Fortunately no. The compiler uses the [jobserver |
| 150 | +protocol](https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html) |
| 151 | +to limit the number of threads it creates. If a lot of interprocess parallelism |
| 152 | +is occuring, intraprocess parallelism will be limited appropriately, and |
| 153 | +the number of threads will not exceed the number of cores. |
| 154 | + |
| 155 | +## How to use it |
| 156 | + |
| 157 | +The nightly compiler is now [shipping with the parallel front-end |
| 158 | +enabled](https://github.com/rust-lang/rust/pull/117435). However, **by default |
| 159 | +it runs in single-threaded mode** and won't reduce compile times. |
| 160 | + |
| 161 | +Keen users can opt into multi-threaded mode with the `-Z threads` option. For |
| 162 | +example: |
| 163 | +``` |
| 164 | +$ RUSTFLAGS="-Z threads=8" cargo build --release |
| 165 | +``` |
| 166 | +Alternatively, to opt in from a |
| 167 | +[config.toml](https://doc.rust-lang.org/cargo/reference/config.html) file (for |
| 168 | +one or more projects), add these lines: |
| 169 | +``` |
| 170 | +[build] |
| 171 | +rustflags = ["-Z", "threads=8"] |
| 172 | +``` |
| 173 | +It may be surprising that single-threaded mode is the default. Why parallelize |
| 174 | +the front-end and then run it in single-threaded mode? The answer is simple: |
| 175 | +caution. This is a big change! The parallel front-end has a lot of new code. |
| 176 | +Single-threaded mode exercises most of the new code, but excludes the |
| 177 | +possibility of threading bugs such as deadlocks that can affect multi-threaded |
| 178 | +mode. Even in Rust, parallel programs are harder to write correctly than serial |
| 179 | +programs. For this reason the parallel front-end also won't be shipped in beta |
| 180 | +or stable releases for some time. |
| 181 | + |
| 182 | +### Performance effects |
| 183 | + |
| 184 | +When the parallel front-end is run in single-threaded mode, compilation times |
| 185 | +are typically 0% to 2% slower than with the serial front-end. This should be |
| 186 | +barely noticeable. |
| 187 | + |
| 188 | +When the parallel front-end is run in multi-threaded mode with `-Z threads=8`, |
| 189 | +our [measurements on real-world |
| 190 | +code](https://github.com/rust-lang/compiler-team/issues/681) show that compile |
| 191 | +times can be reduced by up to 50%, though the effects vary widely and depend on |
| 192 | +the characteristics of the code and its build configuration. For example, dev |
| 193 | +builds are likely to see bigger improvements than release builds because |
| 194 | +release builds usually spend more time doing optimizations in the back-end. A |
| 195 | +small number of cases compile more slowly in multi-threaded mode than |
| 196 | +single-threaded mode. These are mostly tiny programs that already compile |
| 197 | +quickly. |
| 198 | + |
| 199 | +We recommend eight threads because this is the configuration we have tested the |
| 200 | +most and it is known to give good results. Values lower than eight will see |
| 201 | +smaller benefits. Values greater than eight will give diminishing returns and |
| 202 | +may even give worse performance. |
| 203 | + |
| 204 | +If a 50% improvement seems low when going from one to eight threads, recall |
| 205 | +from the explanation above that the front-end only accounts for part of compile |
| 206 | +times, and the back-end is already parallel. You can't beat [Amdahl's |
| 207 | +Law](https://en.wikipedia.org/wiki/Amdahl%27s_law). |
| 208 | + |
| 209 | +Memory usage can increase significantly in multi-threaded mode. We have seen |
| 210 | +increases of up to 35%. This is unsurprising given that various parts of |
| 211 | +compilation, each of which requires a certain amount of memory, are now |
| 212 | +executing in parallel. |
| 213 | + |
| 214 | +### Correctness |
| 215 | + |
| 216 | +Reliability in single-threaded mode should be high. |
| 217 | + |
| 218 | +In multi-threaded mode there are some known bugs, including deadlocks. If |
| 219 | +compilation hangs, you have probably hit one of them. |
| 220 | + |
| 221 | +### Feedback |
| 222 | + |
| 223 | +If you have any problems with the parallel front-end, please [check the issues |
| 224 | +marked with the "WG-compiler-parallel" |
| 225 | +label](https://github.com/rust-lang/rust/labels/WG-compiler-parallel). |
| 226 | +If your problem does not match any of the existing issues, please file a new |
| 227 | +issue. |
| 228 | + |
| 229 | +For more general feedback, please start a discussion on the [wg-parallel-rustc |
| 230 | +Zulip |
| 231 | +channel](https://rust-lang.zulipchat.com/#narrow/stream/187679-t-compiler.2Fwg-parallel-rustc). |
| 232 | +We are particularly interested to hear the performance effects on the code you |
| 233 | +care about. |
| 234 | + |
| 235 | +# Future work |
| 236 | + |
| 237 | +We are working to improve the performance of the parallel front-end. As the |
| 238 | +graphs above showed, there is room to improve the utilization of the threads in |
| 239 | +the front-end. We are also ironing out the remaining bugs in multi-threaded |
| 240 | +mode. |
| 241 | + |
| 242 | +We aim to stabilize the `-Z threads` option and ship the parallel front-end |
| 243 | +running by default in multi-threaded mode on stable releases in 2024. |
| 244 | + |
| 245 | +# Acknowledgments |
| 246 | + |
| 247 | +The parallel front-end has been under development for a long time. It was |
| 248 | +started by [@Zoxc](https://github.com/Zoxc/), who also did most of the work for |
| 249 | +several years. After a period of inactivity, the project was revived this year |
| 250 | +by [@SparrowLii](https://github.com/sparrowlii/), who led the effort to get it |
| 251 | +shipped. Other members of the Parallel Rustc Working Group have also been |
| 252 | +involved with reviews and other activities. Many thanks to everyone involved. |
0 commit comments