|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: Faster compilation with the parallel front-end in nightly |
| 4 | +author: Nicholas Nethercote |
| 5 | +team: The Parallel Rustc Working Group <https://www.rust-lang.org/governance/teams/compiler#Parallel%20rustc%20working%20group> |
| 6 | +--- |
| 7 | + |
| 8 | +The Rust compiler's front-end can now use parallel execution to significantly |
| 9 | +reduce compile times. To try it out, run the nightly compiler with the `-Z |
| 10 | +threads=8` command line option. |
| 11 | + |
| 12 | +Keep reading to learn why a parallel front-end is needed and how it works, or |
| 13 | +just skip ahead to the [How to use it](parallel-rustc.html#how-to-use-it) |
| 14 | +section. |
| 15 | + |
| 16 | +## Compile times and parallelism |
| 17 | + |
| 18 | +Rust compile times are a perennial concern. The [Compiler Performance Working |
| 19 | +Group](https://www.rust-lang.org/governance/teams/compiler#Compiler%20performance%20working%20group) |
| 20 | +has been consistently improving compiler performance for several years. For |
| 21 | +example, in the first 10 months of 2023, we achieved mean reductions in compile |
| 22 | +time of |
| 23 | +[13%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=wall-time&nonRelevant=true), |
| 24 | +in peak memory use of |
| 25 | +[15%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=max-rss&nonRelevant=true), |
| 26 | +and in binary size of |
| 27 | +[7%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=size%3Alinked_artifact&nonRelevant=true), |
| 28 | +as measured by our performance suite. |
| 29 | + |
| 30 | +Unfortunately, at this point the compiler has been heavily optimized, and there |
| 31 | +is no low-hanging fruit left. New improvements are hard to find. |
| 32 | + |
| 33 | +But there is one piece of large but high-hanging fruit: parallelism. Current |
| 34 | +Rust compiler users benefit from two kinds of parallelism, and the newly |
| 35 | +parallel front-end adds a third kind. |
| 36 | + |
| 37 | +### Existing interprocess parallelism |
| 38 | + |
| 39 | +When you compile a Rust program, Cargo launches multiple rustc processes, |
| 40 | +compiling multiple crates in parallel. This works well. Try compiling a large |
| 41 | +Rust program with the `-j1` flag to disable this parallelization and it will |
| 42 | +take a lot longer than normal. |
| 43 | + |
| 44 | +You can visualise this parallelism if you build with Cargo's |
| 45 | +[`--timings`](https://doc.rust-lang.org/cargo/reference/timings.html) flag, |
| 46 | +which produces a chart showing how the crates were compiled. The following |
| 47 | +image shows the timeline when building |
| 48 | +[ripgrep](https://crates.io/crates/ripgrep) on a machine with 28 virtual cores. |
| 49 | + |
| 50 | + |
| 51 | + |
| 52 | +There are 60 horizontal lines, each one representing a distinct process. Some |
| 53 | +take a tiny fraction of a second, and some take multiple seconds. Most of them |
| 54 | +are rustc, and the few orange ones are build scripts. The first twenty run in |
| 55 | +parallel. This is possible because there are no dependencies between the |
| 56 | +relevant crates. But as we get further down the graph, parallelism reduces as |
| 57 | +crate dependencies increase. Although the compiler can overlap compilation of |
| 58 | +dependent crates somewhat thanks to a feature called [pipelined |
| 59 | +compilation](https://github.com/rust-lang/rust/issues/60988), there is much |
| 60 | +less parallel execution happening towards the end of compilation, and this is |
| 61 | +typical for large Rust programs. Interprocess parallelism will only take us |
| 62 | +some of the way. For more speed, we need parallelism within each process. |
| 63 | + |
| 64 | +### Existing intraprocess parallelism: the back-end |
| 65 | + |
| 66 | +The compiler is split into two halves: the front-end and the back-end. |
| 67 | + |
| 68 | +The front-end parses code, does type checking and borrow checking, and various |
| 69 | +other things. It currently does not use any parallelism. |
| 70 | + |
| 71 | +The back-end performs code generation. It generates code in chunks called |
| 72 | +"codegen units" and then LLVM processes these in parallel. This is a form of |
| 73 | +coarse-grained parallelism. |
| 74 | + |
| 75 | +We can visualize the difference with a profiler. The following image shows the |
| 76 | +output of a profiler called [Samply](https://github.com/mstange/samply/) |
| 77 | +measuring rustc doing a release build of the final crate in Cargo. The |
| 78 | +image is superimposed with markers that indicate front-end and back-end |
| 79 | +execution. |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | +Each horizontal line represents a thread. The main thread is labelled "rustc" |
| 84 | +and is shown at the bottom. It is busy for most of the execution. The other 16 |
| 85 | +threads are LLVM code generation threads, labelled "opt cgu.00" through to "opt |
| 86 | +cgu.15". There are 16 threads because 16 is the default number of codegen units |
| 87 | +for a release build. |
| 88 | + |
| 89 | +There are several things worth noting. |
| 90 | +- Front-end execution takes 10.2 seconds. |
| 91 | +- Back-end execution occurs takes 6.2 seconds, and the LLVM threads are running |
| 92 | + for 5.9 seconds of that. |
| 93 | +- The parallel code generation is highly effective. Imagine if all those code |
| 94 | + generation threads executed one after another! |
| 95 | +- Even though there are 16 code generation threads, at no point are all 16 |
| 96 | + executing at the same time, despite this being run on a machine with 28 |
| 97 | + cores. (The peak is 14 or 15.) This is because the main thread translates |
| 98 | + its internal code representation (MIR) to LLVM's code representation (LLVM |
| 99 | + IR) in serial. This takes a brief period for each codegen unit, and explains |
| 100 | + the staircase shape on the left-hand side of the code generation threads. |
| 101 | + There is some room for improvement here. |
| 102 | +- The front-end is entirely serial. There is a lot of room for improvement |
| 103 | + here. |
| 104 | + |
| 105 | +### New intraprocess parallelism: the front-end |
| 106 | + |
| 107 | +The front-end is now capable of parallel execution. It uses |
| 108 | +[Rayon](https://crates.io/crates/rayon) to perform compilation tasks using |
| 109 | +fine-grained parallelism. Many data structures are synchronized by mutexes and |
| 110 | +read-write locks, atomic types are used where appropriate, and many front-end |
| 111 | +operations are made parallel. The addition of parallelism was done by modifying |
| 112 | +a relatively small number of key points in the code. The vast majority of the |
| 113 | +front-end code did not need to be changed. |
| 114 | + |
| 115 | +When the parallel front-end is enabled and configured to use eight threads, we |
| 116 | +get the following Samply profile when compiling the same example as before. |
| 117 | + |
| 118 | + |
| 119 | + |
| 120 | +Again, there are several things worth nothing. |
| 121 | +- Front-end execution takes 5.9 seconds (down from 10.2 seconds). |
| 122 | +- Back-end execution takes 5.3 seconds (down from 6.2 seconds), and the LLVM |
| 123 | + threads are running for 4.9 seconds of that (down from 5.9 seconds). |
| 124 | +- There are seven additional threads labelled "rustc" operating in the |
| 125 | + front-end. The reduced front-end time shows they are reasonably effective, |
| 126 | + but the thread utilization is patchy, with the eight threads all having |
| 127 | + periods of inactivity. There is room for significant improvement here. |
| 128 | +- Eight of the LLVM threads start at the same time. This is because the eight |
| 129 | + "rustc" threads create the LLVM IR for eight codegen units in parallel. (For |
| 130 | + seven of those threads that is the only work they do in the back-end.) After |
| 131 | + that, the staircase effect returns because only one "rustc" thread does LLVM |
| 132 | + IR generation while seven or more other LLVM threads are active. If the |
| 133 | + number of threads used by the front-end was changed to 16 the staircase shape |
| 134 | + would disappear entirely, though in this case the final execution time would |
| 135 | + barely change. |
| 136 | + |
| 137 | +### Putting it all together |
| 138 | + |
| 139 | +Rust compilation has long benefited from intraprocess parallelism, via Cargo, |
| 140 | +and from interprocess parallelism in the back-end. It can now also benefit from |
| 141 | +interprocess parallelism in the front-end. |
| 142 | + |
| 143 | +Attentive readers might be wondering how intraprocess parallelism and |
| 144 | +interprocess parallelism interact. If we have 20 parallel rustc invocations and |
| 145 | +each one can have up to 16 threads running, might we end up 100s of threads on |
| 146 | +a machine with only 10s of cores, resulting in inefficient execution as the OS |
| 147 | +tries its best to schedule them? |
| 148 | + |
| 149 | +Fortunately no. The compiler uses the [jobserver |
| 150 | +protocol](https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html) |
| 151 | +to limit the number of threads it creates. If a lot of interprocess parallelism |
| 152 | +is occuring, intraprocess parallelism will be limited appropriately, and |
| 153 | +the number of threads will not exceed the number of cores. |
| 154 | + |
| 155 | +## How to use it |
| 156 | + |
| 157 | +As of (XXX: date enabled) the nightly compiler is [shipped with the parallel |
| 158 | +front-end enabled](https://github.com/rust-lang/rust/pull/117435). However, |
| 159 | +**by default it runs in single-threaded mode** and won't reduce compile times. |
| 160 | +Keen users who want to try multi-threaded mode should use the `-Z threads=8` |
| 161 | +option. |
| 162 | + |
| 163 | +This default may be surprising. Why parallelize the front-end and then run it |
| 164 | +in single-threaded mode? The answer is simple: caution. This is a big change! |
| 165 | +The parallel front-end has a lot of new code. Single-threaded mode exercises |
| 166 | +most of the new code, but excludes the possibility of threading bugs such as |
| 167 | +deadlocks that still occasionally affect multi-threaded mode. Parallel |
| 168 | +execution is harder to get right than serial execution, even in Rust. For this |
| 169 | +reason the parallel front-end also won't be shipped in Beta or Stable releases |
| 170 | +for some time. |
| 171 | + |
| 172 | +### Performance effects |
| 173 | + |
| 174 | +When the parallel front-end is run in single-threaded mode, compilation times |
| 175 | +are typically within 2% of the serial front-end, which should be barely |
| 176 | +noticeable. |
| 177 | + |
| 178 | +When the parallel front-end is run in multi-threaded mode with `-Z threads=8`, |
| 179 | +out [measurements on real-world |
| 180 | +code](https://github.com/rust-lang/compiler-team/issues/681) on show that |
| 181 | +compile times can be reduced by up to 50%, though the affects vary widely and |
| 182 | +depend greatly on the characteristics of the code being compiled and the build |
| 183 | +configuration. For example, dev builds are likely to see bigger improvements |
| 184 | +than release builds because release builds usually spend more time doing |
| 185 | +optimizations in the back-end. A small number of cases compile more slowly in |
| 186 | +multi-threaded mode than single-threaded mode. |
| 187 | + |
| 188 | +We recommend eight threads because this is the configuration we have tested the |
| 189 | +most and it is known to give good results. Values lower than eight will see |
| 190 | +smaller benefits. Values greater than eight will give diminishing returns and |
| 191 | +may even give worse performance. |
| 192 | + |
| 193 | +If a 50% improvement seems low when going from one to eight threads, recall |
| 194 | +from the explanation above that the front-end only accounts for part of compile |
| 195 | +times, and the back-end is already parallel. You can't beat [Amdahl's |
| 196 | +Law](https://en.wikipedia.org/wiki/Amdahl%27s_law). |
| 197 | + |
| 198 | +Memory usage can increase significantly in multi-threaded mode. This is |
| 199 | +unsurprising given that various tasks, each of which requires a certain amount |
| 200 | +of memory, are now executing in parallel. We have seen increases of up to 35%. |
| 201 | + |
| 202 | +### Correctness |
| 203 | + |
| 204 | +Reliability in single-threaded mode should be high. |
| 205 | + |
| 206 | +In multi-threaded mode there are some known bugs, including deadlocks. If |
| 207 | +compilation hangs, you have probably hit one of them. |
| 208 | + |
| 209 | +### Feedback |
| 210 | + |
| 211 | +If you have any problems with the parallel front-end, please [file an issue |
| 212 | +marked with the "WG-compiler-parallel" |
| 213 | +label](https://github.com/rust-lang/rust/labels/WG-compiler-parallel). |
| 214 | +That link also shows existing known problems. |
| 215 | + |
| 216 | +For more general feedback, please start a discussion on the [wg-parallel-rustc |
| 217 | +Zulip |
| 218 | +channel](https://rust-lang.zulipchat.com/#narrow/stream/187679-t-compiler.2Fwg-parallel-rustc). |
| 219 | +We are particularly interested to hear the performance effects on the code you |
| 220 | +care about. |
| 221 | + |
| 222 | +# Future work |
| 223 | + |
| 224 | +We are working to improve the performance of the parallel front-end. As the |
| 225 | +graphs above showed, there is room to improve the utilization of the threads in |
| 226 | +the front-end. |
| 227 | + |
| 228 | +We are also working to iron out the remaining bugs in multi-threaded mode. We |
| 229 | +aim to stabilize the `-Z threads` option and ship the parallel front-end in |
| 230 | +multi-threaded mode on Stable releases in 2024. |
| 231 | + |
| 232 | +# Acknowledgments |
| 233 | + |
| 234 | +The parallel front-end has been under development for a long time. It was |
| 235 | +created by [@Zoxc](https://github.com/Zoxc/), who also did most of the work for |
| 236 | +several years. After a period of inactivity, the project was revived this year |
| 237 | +by [@SparrowLii](https://github.com/sparrowlii/), who led the effort to get it |
| 238 | +shipped. Other members of the Parallel Rustc Working Group have also been |
| 239 | +involved with reviews and other activities. Many thanks to everyone involved. |
0 commit comments