Skip to content

Commit eadd5fe

Browse files
committed
Add parallel-rustc blog post.
1 parent 2340640 commit eadd5fe

File tree

4 files changed

+252
-0
lines changed

4 files changed

+252
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
---
2+
layout: post
3+
title: Faster compilation with the parallel front-end in nightly
4+
author: Nicholas Nethercote
5+
team: The Parallel Rustc Working Group <https://www.rust-lang.org/governance/teams/compiler#Parallel%20rustc%20working%20group>
6+
---
7+
8+
The Rust compiler's front-end can now use parallel execution to significantly
9+
reduce compile times. To try it, run the nightly compiler with the `-Z
10+
threads=8` option. This feature is currently experimental, and we aim to ship
11+
it in the stable compiler in 2024.
12+
13+
Keep reading to learn why a parallel front-end is needed and how it works, or
14+
just skip ahead to the [How to use it](parallel-rustc.html#how-to-use-it)
15+
section.
16+
17+
## Compile times and parallelism
18+
19+
Rust compile times are a perennial concern. The [Compiler Performance Working
20+
Group](https://www.rust-lang.org/governance/teams/compiler#Compiler%20performance%20working%20group)
21+
has continually improved compiler performance for several years. For example,
22+
in the first 10 months of 2023, there were mean reductions in compile time of
23+
[13%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=wall-time&nonRelevant=true),
24+
in peak memory use of
25+
[15%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=max-rss&nonRelevant=true),
26+
and in binary size of
27+
[7%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=size%3Alinked_artifact&nonRelevant=true),
28+
as measured by our performance suite.
29+
30+
However, at this point the compiler has been heavily optimized and new
31+
improvements are hard to find. There is no low-hanging fruit remaining.
32+
33+
But there is one piece of large but high-hanging fruit: parallelism. Current
34+
Rust compiler users benefit from two kinds of parallelism, and the newly
35+
parallel front-end adds a third kind.
36+
37+
### Existing interprocess parallelism
38+
39+
When you compile a Rust program, Cargo launches multiple rustc processes,
40+
compiling multiple crates in parallel. This works well. Try compiling a large
41+
Rust program with the `-j1` flag to disable this parallelization and it will
42+
take a lot longer than normal.
43+
44+
You can visualise this parallelism if you build with Cargo's
45+
[`--timings`](https://doc.rust-lang.org/cargo/reference/timings.html) flag,
46+
which produces a chart showing how the crates are compiled. The following image
47+
shows the timeline when building [ripgrep](https://crates.io/crates/ripgrep) on
48+
a machine with 28 virtual cores.
49+
50+
![`cargo build --timings` output when compiling ripgrep](../../../../images/inside-rust/2023-11-10-parallel-rustc/cargo-build-timings.png)
51+
52+
There are 60 horizontal lines, each one representing a distinct process. Their
53+
durations range from a fraction of a second to multiple seconds. Most of them
54+
are rustc, and the few orange ones are build scripts. The first twenty run all
55+
start at the same time. This is possible because there are no dependencies
56+
between the relevant crates. But further down the graph, parallelism reduces as
57+
crate dependencies increase. Although the compiler can overlap compilation of
58+
dependent crates somewhat thanks to a feature called [pipelined
59+
compilation](https://github.com/rust-lang/rust/issues/60988), there is much
60+
less parallel execution happening towards the end of compilation, and this is
61+
typical for large Rust programs. Interprocess parallelism is not enough to take
62+
full advantage of many cores. For more speed, we need parallelism within each process.
63+
64+
### Existing intraprocess parallelism: the back-end
65+
66+
The compiler is split into two halves: the front-end and the back-end.
67+
68+
The front-end does many things, including parsing, type checking, and borrow
69+
checking. Until this week, it could not use parallel execution.
70+
71+
The back-end performs code generation. It generates code in chunks called
72+
"codegen units" and then LLVM processes these in parallel. This is a form of
73+
coarse-grained parallelism.
74+
75+
We can visualize the difference between the serial front-end and the parallel
76+
back-end. The following image shows the output of a profiler called
77+
[Samply](https://github.com/mstange/samply/) measuring rustc as it does a
78+
release build of the final crate in Cargo. The image is superimposed with
79+
markers that indicate front-end and back-end execution.
80+
81+
![Samply output when compiling Cargo, serial](../../../../images/inside-rust/2023-11-10-parallel-rustc/samply-serial.png)
82+
83+
Each horizontal line represents a thread. The main thread is labelled "rustc"
84+
and is shown at the bottom. It is busy for most of the execution. The other 16
85+
threads are LLVM threads, labelled "opt cgu.00" through to "opt cgu.15". There
86+
are 16 threads because 16 is the default number of codegen units for a release
87+
build.
88+
89+
There are several things worth noting.
90+
- Front-end execution takes 10.2 seconds.
91+
- Back-end execution occurs takes 6.2 seconds, and the LLVM threads are running
92+
for 5.9 seconds of that.
93+
- The parallel code generation is highly effective. Imagine if all those LLVM
94+
executed one after another!
95+
- Even though there are 16 LLVM threads, at no point are all 16 executing at
96+
the same time, despite this being run on a machine with 28 cores. (The peak
97+
is 14 or 15.) This is because the main thread translates its internal code
98+
representation (MIR) to LLVM's code representation (LLVM IR) in serial. This
99+
takes a brief period for each codegen unit, and explains the staircase shape
100+
on the left-hand side of the code generation threads. There is some room for
101+
improvement here.
102+
- The front-end is entirely serial. There is a lot of room for improvement
103+
here.
104+
105+
### New intraprocess parallelism: the front-end
106+
107+
The front-end is now capable of parallel execution. It uses
108+
[Rayon](https://crates.io/crates/rayon) to perform compilation tasks using
109+
fine-grained parallelism. Many data structures are synchronized by mutexes and
110+
read-write locks, atomic types are used where appropriate, and many front-end
111+
operations are made parallel. The addition of parallelism was done by modifying
112+
a relatively small number of key points in the code. The vast majority of the
113+
front-end code did not need to be changed.
114+
115+
When the parallel front-end is enabled and configured to use eight threads, we
116+
get the following Samply profile when compiling the same example as before.
117+
118+
![Samply output when compiling Cargo, parallel](../../../../images/inside-rust/2023-11-10-parallel-rustc/samply-parallel.png)
119+
120+
Again, there are several things worth nothing.
121+
- Front-end execution takes 5.9 seconds (down from 10.2 seconds).
122+
- Back-end execution takes 5.3 seconds (down from 6.2 seconds), and the LLVM
123+
threads are running for 4.9 seconds of that (down from 5.9 seconds).
124+
- There are seven additional threads labelled "rustc" operating in the
125+
front-end. The reduced front-end time shows they are reasonably effective,
126+
but the thread utilization is patchy, with the eight threads all having
127+
periods of inactivity. There is room for significant improvement here.
128+
- Eight of the LLVM threads start at the same time. This is because the eight
129+
"rustc" threads create the LLVM IR for eight codegen units in parallel. (For
130+
seven of those threads that is the only work they do in the back-end.) After
131+
that, the staircase effect returns because only one "rustc" thread does LLVM
132+
IR generation while seven or more LLVM threads are active. If the number of
133+
threads used by the front-end was changed to 16 the staircase shape would
134+
disappear entirely, though in this case the final execution time would barely
135+
change.
136+
137+
### Putting it all together
138+
139+
Rust compilation has long benefited from interprocess parallelism, via Cargo,
140+
and from intraprocess parallelism in the back-end. It can now also benefit from
141+
intraprocess parallelism in the front-end.
142+
143+
You might wonder how interprocess parallelism and intraprocess parallelism
144+
interact. If we have 20 parallel rustc invocations and each one can have up to
145+
16 threads running, could we end up with hundreds of threads on a machine with
146+
only tens of cores, resulting in inefficient execution as the OS tries its best
147+
to schedule them?
148+
149+
Fortunately no. The compiler uses the [jobserver
150+
protocol](https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html)
151+
to limit the number of threads it creates. If a lot of interprocess parallelism
152+
is occuring, intraprocess parallelism will be limited appropriately, and
153+
the number of threads will not exceed the number of cores.
154+
155+
## How to use it
156+
157+
The nightly compiler is now [shipping with the parallel front-end
158+
enabled](https://github.com/rust-lang/rust/pull/117435). However, **by default
159+
it runs in single-threaded mode** and won't reduce compile times.
160+
161+
Keen users can opt into multi-threaded mode with the `-Z threads` option. For
162+
example:
163+
```
164+
$ RUSTFLAGS="-Z threads=8" cargo build --release
165+
```
166+
Alternatively, to opt in from a
167+
[config.toml](https://doc.rust-lang.org/cargo/reference/config.html) file (for
168+
one or more projects), add these lines:
169+
```
170+
[build]
171+
rustflags = ["-Z", "threads=8"]
172+
```
173+
It may be surprising that single-threaded mode is the default. Why parallelize
174+
the front-end and then run it in single-threaded mode? The answer is simple:
175+
caution. This is a big change! The parallel front-end has a lot of new code.
176+
Single-threaded mode exercises most of the new code, but excludes the
177+
possibility of threading bugs such as deadlocks that can affect multi-threaded
178+
mode. Even in Rust, parallel programs are harder to write correctly than serial
179+
programs. For this reason the parallel front-end also won't be shipped in beta
180+
or stable releases for some time.
181+
182+
### Performance effects
183+
184+
When the parallel front-end is run in single-threaded mode, compilation times
185+
are typically 0% to 2% slower than with the serial front-end. This should be
186+
barely noticeable.
187+
188+
When the parallel front-end is run in multi-threaded mode with `-Z threads=8`,
189+
our [measurements on real-world
190+
code](https://github.com/rust-lang/compiler-team/issues/681) show that compile
191+
times can be reduced by up to 50%, though the effects vary widely and depend on
192+
the characteristics of the code and its build configuration. For example, dev
193+
builds are likely to see bigger improvements than release builds because
194+
release builds usually spend more time doing optimizations in the back-end. A
195+
small number of cases compile more slowly in multi-threaded mode than
196+
single-threaded mode. These are mostly tiny programs that already compile
197+
quickly.
198+
199+
We recommend eight threads because this is the configuration we have tested the
200+
most and it is known to give good results. Values lower than eight will see
201+
smaller benefits. Values greater than eight will give diminishing returns and
202+
may even give worse performance.
203+
204+
If a 50% improvement seems low when going from one to eight threads, recall
205+
from the explanation above that the front-end only accounts for part of compile
206+
times, and the back-end is already parallel. You can't beat [Amdahl's
207+
Law](https://en.wikipedia.org/wiki/Amdahl%27s_law).
208+
209+
Memory usage can increase significantly in multi-threaded mode. We have seen
210+
increases of up to 35%. This is unsurprising given that various parts of
211+
compilation, each of which requires a certain amount of memory, are now
212+
executing in parallel.
213+
214+
### Correctness
215+
216+
Reliability in single-threaded mode should be high.
217+
218+
In multi-threaded mode there are some known bugs, including deadlocks. If
219+
compilation hangs, you have probably hit one of them.
220+
221+
### Feedback
222+
223+
If you have any problems with the parallel front-end, please [check the issues
224+
marked with the "WG-compiler-parallel"
225+
label](https://github.com/rust-lang/rust/labels/WG-compiler-parallel).
226+
If your problem does not match any of the existing issues, please file a new
227+
issue.
228+
229+
For more general feedback, please start a discussion on the [wg-parallel-rustc
230+
Zulip
231+
channel](https://rust-lang.zulipchat.com/#narrow/stream/187679-t-compiler.2Fwg-parallel-rustc).
232+
We are particularly interested to hear the performance effects on the code you
233+
care about.
234+
235+
# Future work
236+
237+
We are working to improve the performance of the parallel front-end. As the
238+
graphs above showed, there is room to improve the utilization of the threads in
239+
the front-end. We are also ironing out the remaining bugs in multi-threaded
240+
mode.
241+
242+
We aim to stabilize the `-Z threads` option and ship the parallel front-end
243+
running by default in multi-threaded mode on stable releases in 2024.
244+
245+
# Acknowledgments
246+
247+
The parallel front-end has been under development for a long time. It was
248+
started by [@Zoxc](https://github.com/Zoxc/), who also did most of the work for
249+
several years. After a period of inactivity, the project was revived this year
250+
by [@SparrowLii](https://github.com/sparrowlii/), who led the effort to get it
251+
shipped. Other members of the Parallel Rustc Working Group have also been
252+
involved with reviews and other activities. Many thanks to everyone involved.
Loading
Loading
Loading

0 commit comments

Comments
 (0)