Skip to content

Commit 6773597

Browse files
committed
Add parallel-rustc blog post.
1 parent 2340640 commit 6773597

File tree

4 files changed

+239
-0
lines changed

4 files changed

+239
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
---
2+
layout: post
3+
title: Faster compilation with the parallel front-end in nightly
4+
author: Nicholas Nethercote
5+
team: The Parallel Rustc Working Group <https://www.rust-lang.org/governance/teams/compiler#Parallel%20rustc%20working%20group>
6+
---
7+
8+
The Rust compiler's front-end can now use parallel execution to significantly
9+
reduce compile times. To try it out, run the nightly compiler with the `-Z
10+
threads=8` command line option.
11+
12+
Keep reading to learn why a parallel front-end is needed and how it works, or
13+
just skip ahead to the [How to use it](parallel-rustc.html#how-to-use-it)
14+
section.
15+
16+
## Compile times and parallelism
17+
18+
Rust compile times are a perennial concern. The [Compiler Performance Working
19+
Group](https://www.rust-lang.org/governance/teams/compiler#Compiler%20performance%20working%20group)
20+
has been consistently improving compiler performance for several years. For
21+
example, in the first 10 months of 2023, we achieved mean reductions in compile
22+
time of
23+
[13%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=wall-time&nonRelevant=true),
24+
in peak memory use of
25+
[15%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=max-rss&nonRelevant=true),
26+
and in binary size of
27+
[7%](https://perf.rust-lang.org/compare.html?start=2023-01-01&end=2023-10-31&stat=size%3Alinked_artifact&nonRelevant=true),
28+
as measured by our performance suite.
29+
30+
Unfortunately, at this point the compiler has been heavily optimized, and there
31+
is no low-hanging fruit left. New improvements are hard to find.
32+
33+
But there is one piece of large but high-hanging fruit: parallelism. Current
34+
Rust compiler users benefit from two kinds of parallelism, and the newly
35+
parallel front-end adds a third kind.
36+
37+
### Existing interprocess parallelism
38+
39+
When you compile a Rust program, Cargo launches multiple rustc processes,
40+
compiling multiple crates in parallel. This works well. Try compiling a large
41+
Rust program with the `-j1` flag to disable this parallelization and it will
42+
take a lot longer than normal.
43+
44+
You can visualise this parallelism if you build with Cargo's
45+
[`--timings`](https://doc.rust-lang.org/cargo/reference/timings.html) flag,
46+
which produces a chart showing how the crates were compiled. The following
47+
image shows the timeline when building
48+
[ripgrep](https://crates.io/crates/ripgrep) on a machine with 28 virtual cores.
49+
50+
![`cargo build --timings` output when compiling ripgrep](../../../../images/inside-rust/2023-11-10-parallel-rustc/cargo-build-timings.png)
51+
52+
There are 60 horizontal lines, each one representing a distinct process. Some
53+
take a tiny fraction of a second, and some take multiple seconds. Most of them
54+
are rustc, and the few orange ones are build scripts. The first twenty run in
55+
parallel. This is possible because there are no dependencies between the
56+
relevant crates. But as we get further down the graph, parallelism reduces as
57+
crate dependencies increase. Although the compiler can overlap compilation of
58+
dependent crates somewhat thanks to a feature called [pipelined
59+
compilation](https://github.com/rust-lang/rust/issues/60988), there is much
60+
less parallel execution happening towards the end of compilation, and this is
61+
typical for large Rust programs. Interprocess parallelism will only take us
62+
some of the way. For more speed, we need parallelism within each process.
63+
64+
### Existing intraprocess parallelism: the back-end
65+
66+
The compiler is split into two halves: the front-end and the back-end.
67+
68+
The front-end parses code, does type checking and borrow checking, and various
69+
other things. It currently does not use any parallelism.
70+
71+
The back-end performs code generation. It generates code in chunks called
72+
"codegen units" and then LLVM processes these in parallel. This is a form of
73+
coarse-grained parallelism.
74+
75+
We can visualize the difference with a profiler. The following image shows the
76+
output of a profiler called [Samply](https://github.com/mstange/samply/)
77+
measuring rustc doing a release build of the final crate in Cargo. The
78+
image is superimposed with markers that indicate front-end and back-end
79+
execution.
80+
81+
![Samply output when compiling Cargo, serial](../../../../images/inside-rust/2023-11-10-parallel-rustc/samply-serial.png)
82+
83+
Each horizontal line represents a thread. The main thread is labelled "rustc"
84+
and is shown at the bottom. It is busy for most of the execution. The other 16
85+
threads are LLVM code generation threads, labelled "opt cgu.00" through to "opt
86+
cgu.15". There are 16 threads because 16 is the default number of codegen units
87+
for a release build.
88+
89+
There are several things worth noting.
90+
- Front-end execution takes 10.2 seconds.
91+
- Back-end execution occurs takes 6.2 seconds, and the LLVM threads are running
92+
for 5.9 seconds of that.
93+
- The parallel code generation is highly effective. Imagine if all those code
94+
generation threads executed one after another!
95+
- Even though there are 16 code generation threads, at no point are all 16
96+
executing at the same time, despite this being run on a machine with 28
97+
cores. (The peak is 14 or 15.) This is because the main thread translates
98+
its internal code representation (MIR) to LLVM's code representation (LLVM
99+
IR) in serial. This takes a brief period for each codegen unit, and explains
100+
the staircase shape on the left-hand side of the code generation threads.
101+
There is some room for improvement here.
102+
- The front-end is entirely serial. There is a lot of room for improvement
103+
here.
104+
105+
### New intraprocess parallelism: the front-end
106+
107+
The front-end is now capable of parallel execution. It uses
108+
[Rayon](https://crates.io/crates/rayon) to perform compilation tasks using
109+
fine-grained parallelism. Many data structures are synchronized by mutexes and
110+
read-write locks, atomic types are used where appropriate, and many front-end
111+
operations are made parallel. The addition of parallelism was done by modifying
112+
a relatively small number of key points in the code. The vast majority of the
113+
front-end code did not need to be changed.
114+
115+
When the parallel front-end is enabled and configured to use eight threads, we
116+
get the following Samply profile when compiling the same example as before.
117+
118+
![Samply output when compiling Cargo, parallel](../../../../images/inside-rust/2023-11-10-parallel-rustc/samply-parallel.png)
119+
120+
Again, there are several things worth nothing.
121+
- Front-end execution takes 5.9 seconds (down from 10.2 seconds).
122+
- Back-end execution takes 5.3 seconds (down from 6.2 seconds), and the LLVM
123+
threads are running for 4.9 seconds of that (down from 5.9 seconds).
124+
- There are seven additional threads labelled "rustc" operating in the
125+
front-end. The reduced front-end time shows they are reasonably effective,
126+
but the thread utilization is patchy, with the eight threads all having
127+
periods of inactivity. There is room for significant improvement here.
128+
- Eight of the LLVM threads start at the same time. This is because the eight
129+
"rustc" threads create the LLVM IR for eight codegen units in parallel. (For
130+
seven of those threads that is the only work they do in the back-end.) After
131+
that, the staircase effect returns because only one "rustc" thread does LLVM
132+
IR generation while seven or more other LLVM threads are active. If the
133+
number of threads used by the front-end was changed to 16 the staircase shape
134+
would disappear entirely, though in this case the final execution time would
135+
barely change.
136+
137+
### Putting it all together
138+
139+
Rust compilation has long benefited from intraprocess parallelism, via Cargo,
140+
and from interprocess parallelism in the back-end. It can now also benefit from
141+
interprocess parallelism in the front-end.
142+
143+
Attentive readers might be wondering how intraprocess parallelism and
144+
interprocess parallelism interact. If we have 20 parallel rustc invocations and
145+
each one can have up to 16 threads running, might we end up 100s of threads on
146+
a machine with only 10s of cores, resulting in inefficient execution as the OS
147+
tries its best to schedule them?
148+
149+
Fortunately no. The compiler uses the [jobserver
150+
protocol](https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html)
151+
to limit the number of threads it creates. If a lot of interprocess parallelism
152+
is occuring, intraprocess parallelism will be limited appropriately, and
153+
the number of threads will not exceed the number of cores.
154+
155+
## How to use it
156+
157+
As of (XXX: date enabled) the nightly compiler is [shipped with the parallel
158+
front-end enabled](https://github.com/rust-lang/rust/pull/117435). However,
159+
**by default it runs in single-threaded mode** and won't reduce compile times.
160+
Keen users who want to try multi-threaded mode should use the `-Z threads=8`
161+
option.
162+
163+
This default may be surprising. Why parallelize the front-end and then run it
164+
in single-threaded mode? The answer is simple: caution. This is a big change!
165+
The parallel front-end has a lot of new code. Single-threaded mode exercises
166+
most of the new code, but excludes the possibility of threading bugs such as
167+
deadlocks that still occasionally affect multi-threaded mode. Parallel
168+
execution is harder to get right than serial execution, even in Rust. For this
169+
reason the parallel front-end also won't be shipped in Beta or Stable releases
170+
for some time.
171+
172+
### Performance effects
173+
174+
When the parallel front-end is run in single-threaded mode, compilation times
175+
are typically within 2% of the serial front-end, which should be barely
176+
noticeable.
177+
178+
When the parallel front-end is run in multi-threaded mode with `-Z threads=8`,
179+
out [measurements on real-world
180+
code](https://github.com/rust-lang/compiler-team/issues/681) on show that
181+
compile times can be reduced by up to 50%, though the affects vary widely and
182+
depend greatly on the characteristics of the code being compiled and the build
183+
configuration. For example, dev builds are likely to see bigger improvements
184+
than release builds because release builds usually spend more time doing
185+
optimizations in the back-end. A small number of cases compile more slowly in
186+
multi-threaded mode than single-threaded mode.
187+
188+
We recommend eight threads because this is the configuration we have tested the
189+
most and it is known to give good results. Values lower than eight will see
190+
smaller benefits. Values greater than eight will give diminishing returns and
191+
may even give worse performance.
192+
193+
If a 50% improvement seems low when going from one to eight threads, recall
194+
from the explanation above that the front-end only accounts for part of compile
195+
times, and the back-end is already parallel. You can't beat [Amdahl's
196+
Law](https://en.wikipedia.org/wiki/Amdahl%27s_law).
197+
198+
Memory usage can increase significantly in multi-threaded mode. This is
199+
unsurprising given that various tasks, each of which requires a certain amount
200+
of memory, are now executing in parallel. We have seen increases of up to 35%.
201+
202+
### Correctness
203+
204+
Reliability in single-threaded mode should be high.
205+
206+
In multi-threaded mode there are some known bugs, including deadlocks. If
207+
compilation hangs, you have probably hit one of them.
208+
209+
### Feedback
210+
211+
If you have any problems with the parallel front-end, please [file an issue
212+
marked with the "WG-compiler-parallel"
213+
label](https://github.com/rust-lang/rust/labels/WG-compiler-parallel).
214+
That link also shows existing known problems.
215+
216+
For more general feedback, please start a discussion on the [wg-parallel-rustc
217+
Zulip
218+
channel](https://rust-lang.zulipchat.com/#narrow/stream/187679-t-compiler.2Fwg-parallel-rustc).
219+
We are particularly interested to hear the performance effects on the code you
220+
care about.
221+
222+
# Future work
223+
224+
We are working to improve the performance of the parallel front-end. As the
225+
graphs above showed, there is room to improve the utilization of the threads in
226+
the front-end.
227+
228+
We are also working to iron out the remaining bugs in multi-threaded mode. We
229+
aim to stabilize the `-Z threads` option and ship the parallel front-end in
230+
multi-threaded mode on Stable releases in 2024.
231+
232+
# Acknowledgments
233+
234+
The parallel front-end has been under development for a long time. It was
235+
created by [@Zoxc](https://github.com/Zoxc/), who also did most of the work for
236+
several years. After a period of inactivity, the project was revived this year
237+
by [@SparrowLii](https://github.com/sparrowlii/), who led the effort to get it
238+
shipped. Other members of the Parallel Rustc Working Group have also been
239+
involved with reviews and other activities. Many thanks to everyone involved.
Loading
Loading
Loading

0 commit comments

Comments
 (0)