Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for speeding up rustc via its build configuration #103595

Open
15 of 26 tasks
nnethercote opened this issue Oct 26, 2022 · 28 comments
Open
15 of 26 tasks

Tracking issue for speeding up rustc via its build configuration #103595

nnethercote opened this issue Oct 26, 2022 · 28 comments
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance

Comments

@nnethercote
Copy link
Contributor

nnethercote commented Oct 26, 2022

There are several ways to speed up rustc by changing its build configuration, without changing its code: use a single codegen unit (CGU), profile-guided optimization (PGO), link-time optimization (LTO), post-link optimization (via BOLT), and using a better allocator (e.g. jemalloc or mimalloc).

This is a tracking issue for doing these for the most popular Tier 1 platforms: Linux64 (x86_64-unknown-linux-gnu), Win64 (x86_64-pc-windows-msvc), and Mac (x86_64-apple-darwin, and more recently aarch64-apple-darwin).

Items marked with [2022] are on the Compiler performance roadmap for 2022.

Single CGU

Benefits: rustc is faster, uses less memory, has a smaller binary.
Costs: rustc takes longer to build.

PGO

Benefits: rustc is faster.
Costs: rustc takes longer to build.

Other PGO attempts:

LTO

Benefits: rustc is faster.
Costs: rustc takes longer to build.

This is all thin LTO, which gets most of the benefits of fat LTO with a much lower link-time cost.

Other LTO attempts:

BOLT

Benefits: rustc is faster.
Costs: rustc takes longer to build.

Bolt only works on ELF binaries, and thus is Linux-only.

Instruction set

Benefits: rustc is faster?
Costs: rustc won't run on old CPUs.

  • x86_64: Update to v2/v3/APX sometime in the future. So far, the perf. wins haven't been convincing enough to upgrade, because it will reduce compatibility for older CPUs. Some perf. results can be found here.

Linker

Benefits: rustc (linking) is faster.
Costs: hard to get working.

Better allocator

Benefits: rustc (linking) is faster.
Costs: rustc uses more memory?

  • Linux64: jemalloc, done some time ago.
  • Win64 [2022]
  • Mac: jemalloc, done some time ago.

Note: #92249 and #92317 tried using two different versions of mimalloc (one 1.7-based, one 2.0-based) instead of jemalloc, but the speed/memory tradeoffs in both cases were deemed inferior (the max-rss regressions expected to be fixed in the 2.x series still exist as of 2.0.6, see #103944).

Note: we use a better allocator by simply overriding malloc/free, rather than using #[global_allocator]. See this Zulip thread for some discussion about the sub-optimality of this.

About tracking issues

Tracking issues are used to record the overall progress of implementation.
They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions.
A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature.
Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.

@nnethercote nnethercote added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC WG-compiler-performance Working group: Compiler Performance labels Oct 26, 2022
@the8472
Copy link
Member

the8472 commented Oct 26, 2022

Another thing to try that was brought up on zulip is aligning the text segment to 2MB so it can be loaded in a way that transparent huge pages kick in, but I'm not 100% sure if this even works yet, last time I checked large page support in the page cache was still a WIP and I haven't seen it in release notes.

@lqd
Copy link
Member

lqd commented Oct 26, 2022

An update for windows:

@nnethercote
Copy link
Contributor Author

Another thing to try that was brought up on zulip is aligning the text segment to 2MB so it can be loaded in a way that transparent huge pages kick in

Given that it's not even clear that this works, let's leave it off this issue for now.

@the8472
Copy link
Member

the8472 commented Oct 27, 2022

It looks like file-backed huge pages are supported now. https://www.kernel.org/doc/html/latest/filesystems/proc.html#meminfo

FileHugePages
Memory used for filesystem data (page cache) allocated with huge pages
FilePmdMapped
Page cache mapped into userspace with huge pages

I think it was introduced with this one torvalds/linux@793917d so it needs at least 5.18

It also depends on filesystem support.

@tschuett
Copy link

I read and actually checked that rustc links dynamically against rustc_driver. The same article said we dlopen codegen backends. Isn't there a way for the common case, a static binary: rustc + rustc_driver + LLVM?

@Kobzol
Copy link
Contributor

Kobzol commented Oct 31, 2022

We could do that in theory, but I'm not sure if it would be that useful. Static linking has some benefits, but mostly only for tiny crates, the diminishing returns start to kick in early. rustc is basically a "one-liner" that calls into the entrypoint of rustc_driver, and that is now LTO-optimized.

@tschuett
Copy link

I would enable LTO over the full binary. Startup time may also be better.

@Kobzol
Copy link
Contributor

Kobzol commented Oct 31, 2022

Well, the "full binary" is one function call into rustc_driver :) But it's true that I haven't tried benchmarking static linking on top of the current LTO optimized librustc_driver. I'll try it, to see if static linking + LTO could provide substantial benefits.

Shipping a statically linked rustc would probably only be possible on some OSes (Linux), and it would increase the size of the distributed artifacts by a nontrivial amount.

@tschuett
Copy link

My LLVM folder is full of large statically linked LTO'd binaries (OSX).

Could you also statically link in the LLVM backend?

@Kobzol
Copy link
Contributor

Kobzol commented Oct 31, 2022

In theory, yes. In practice, I'm not sure if our current build system supports it (will check).

@tschuett
Copy link

No worries.

-rwxr-xr-x 1 xxx staff 129M Jul 28 18:44 clang-15

It only links against system libraries and no LLVM libraries.

@Elabajaba
Copy link

I'm not sure if it helps, but on Windows LLVM can be built with a different allocator using the -DLLVM_INTEGRATED_CRT_ALLOC=path/to/allocator flag (supports rpmalloc|mimalloc|snmalloc). However I tried doing that a few months ago, and while the llvm builds seemed to work fine, rustc_llvm failed to build using a version of LLVM built with mimalloc or snmalloc (I didn't end up testing rpmalloc at the time). The errors at the time were: for snmalloc error: renaming of the library 'INCLUDE' was specified, however this crate contains no '#[link(...)]' attributes referencing this library for mimalloc the error was the same, except it was F instead of INCLUDE

The LLVM pr (https://reviews.llvm.org/D71786) also showed some pretty major performance gains when it was implemented.

@jyn514
Copy link
Member

jyn514 commented Feb 3, 2023

@michaelwoerister also suggested in #49180 that we could set codegen-units=1 for the compiler (we already do that for std).

@the8472
Copy link
Member

the8472 commented Dec 13, 2023

x86_64: Update to v2/v3/APX sometime in the future. So far, the perf. wins haven't been convincing enough to upgrade, because it will reduce compatibility for older CPUs.

Note that there's a bit of a catch-22. We could start adding specialized SIMD impls for some important core routines if std were built with a higher baseline, which would increase the performance delta. But as long as such builds don't exist it's hardly worth it because it'll only benefit users of -Zbuild-std.

Do any of the ARM targets offer a baseline that's high enough to include fancy simd features? Maybe some generic simdification work on those could show potential benefits that would be enabled on top of what's gained by compiling with a higher baseline.

Or maybe an AVX2 codepath could be added to hashbrown since that can benefit some users even without build-std. @Amanieu have there been any experiments in that direction?

@Mark-Simulacrum
Copy link
Member

Apple aarch64 should have a very modern baseline.

@Kobzol
Copy link
Contributor

Kobzol commented Dec 13, 2023

Note that there's a bit of a catch-22. We could start adding specialized SIMD impls for some important core routines if std were built with a higher baseline, which would increase the performance delta

That's a very good point, I agree with that!

I think that we could start with the compiler (and its stdlib), to potentially make it faster, but still keep the actual Linux x64 target without v2/v3/v4 CPU features by default.

Do you have any ideas where we could start using x86 v2 CPU features?

@the8472
Copy link
Member

the8472 commented Dec 13, 2023

I think that we could start with the compiler (and its stdlib),

Doesn't the compiler link the same stdlib it uses to build programs?

Do you have any ideas where we could start using x86 v2 CPU features?

Adopting the utf8 validation impl from simdutf8. Other than that it'll need some exploration. Maybe the stdsimd folks have some ideas on tap.
There's always a difference between "if I had all the most recent CPU features I could..." and more modest optimizations 😅

Maybe rustc hash could be replaced with a different mixing function? ... odd, I can't find the feature level of the PCLMULQDQ instruction.

@Kobzol
Copy link
Contributor

Kobzol commented Dec 13, 2023

Doesn't the compiler link the same stdlib it uses to build programs?

IIRC it doesn't, it has its own copy. Conceptually, it needs to be able to use to add any (target) stdlib to the resulting program to cross-compile. But I might be wrong.

@tschuett
Copy link

The targets on the Platform Support page look like legit target triples. Could you encode v4 in triple and ship two Linux versions for x86?

@Kobzol
Copy link
Contributor

Kobzol commented Dec 14, 2023

In theory yes, but I'm not sure if it's the best solution. Maybe we could bump the defaut target to v2/v3 and keep an unoptimized v1 target for people with old CPUs. In any case, maintaining a target is not for free, so maybe there are better solutions.

To clarify, it's a very different thing to ship a v2 compiler and to make the x86 Linux target v2 by default. We're really only considering the first thing for now.

@tschuett
Copy link

Shipping a highly tuned v4 compiler with -mtune=icelake should give some speedup and you can use current cpu features: AVX512, PCLMULQDQ, ..

Haswell was launched June 4, 2013.

x86-64: CMOV, CMPXCHG8B, FPU, FXSR, MMX, FXSR, SCE, SSE, SSE2
x86-64-v2: (close to Nehalem) CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
x86-64-v3: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
x86-64-v4: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

@Kobzol
Copy link
Contributor

Kobzol commented Dec 14, 2023

Yes, our measurements show that v3 produces ~1-3% speedup for the compiler. But on it's own, that hasn't been worth it to use it so far, because there is non-trivial maintenance costs for doing that, plus we would drop some existing users. We'll need to tread carefully.

@the8472
Copy link
Member

the8472 commented Dec 14, 2023

AVX-512 is not really viable for broad use because AMD only started shipping it recently and intel has only shipped it to some of their market segments (workstation/server chips) and even disabled it on some recent chips due to inconsistencies between P and E cores.
AVX2 has a much larger user base.

See https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam?platform=linux
99% for SSE42 (~v2)
93% for AVX2 (~v3)
6.5% for AVX512 (~v4)

bors added a commit to rust-lang-ci/rust that referenced this issue Mar 11, 2024
Build `rustc` with 1CGU on `x86_64-pc-windows-msvc`

Distribute `x86_64-pc-windows-msvc` artifacts built with `rust.codegen-units=1`, like we already do on Linux.

1) effect on code size on `x86_64-pc-windows-msvc`: it's a 3.67% reduction on `rustc_driver.dll`
- before, [`41d97c8a5dea2731b0e56fe97cd7cb79e21cff79`](https://ci-artifacts.rust-lang.org/rustc-builds/41d97c8a5dea2731b0e56fe97cd7cb79e21cff79/rustc-nightly-x86_64-pc-windows-msvc.tar.xz): 137605632
- after, [`704aaa875e4acccc973cbe4579e66afbac425691`](https://ci-artifacts.rust-lang.org/rustc-builds/704aaa875e4acccc973cbe4579e66afbac425691/rustc-nightly-x86_64-pc-windows-msvc.tar.xz): 132551680

2) time it took on CI
- the [first `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8155647651/job/22291592507) took: 1h 31m
- the [second `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8157043594/job/22295790552) took: 1h 32m

3) most recent perf results:
- on a slightly noisy desktop [here](rust-lang#112267 (comment))
- ChrisDenton's results [here](rust-lang#112267 (comment))

Related tracking issue for build configuration: rust-lang#103595
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 11, 2024
Build `rustc` with 1CGU on `x86_64-apple-darwin`

Distribute `x86_64-apple-darwin` artifacts built with `rust.codegen-units=1`, like we already do on Linux.

1) effect on code size on `x86_64-apple-darwin`: it's a 11.14% reduction on `librustc_driver.dylib`
- before, [`41d97c8a5dea2731b0e56fe97cd7cb79e21cff79`](https://ci-artifacts.rust-lang.org/rustc-builds/41d97c8a5dea2731b0e56fe97cd7cb79e21cff79/rustc-nightly-x86_64-apple-darwin.tar.xz): 161232048
- after, [`7549dbdc09f0c4f6cc84002ac03081828054784b`](https://ci-artifacts.rust-lang.org/rustc-builds/7549dbdc09f0c4f6cc84002ac03081828054784b/rustc-nightly-x86_64-apple-darwin.tar.xz): 143256928

2) time it took on CI:
- the [first `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8155512915/job/22291187124) took: 1h 33m
- the [second `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8157057880/job/22295839911) took: 1h 45m

3) most recent perf results on (a noisy) x64 mac are [here](rust-lang#112268 (comment)).

Related tracking issue for build configuration: rust-lang#103595
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 12, 2024
Build `rustc` with 1CGU on `x86_64-apple-darwin`

Distribute `x86_64-apple-darwin` artifacts built with `rust.codegen-units=1`, like we already do on Linux.

1) effect on code size on `x86_64-apple-darwin`: it's a 11.14% reduction on `librustc_driver.dylib`
- before, [`41d97c8a5dea2731b0e56fe97cd7cb79e21cff79`](https://ci-artifacts.rust-lang.org/rustc-builds/41d97c8a5dea2731b0e56fe97cd7cb79e21cff79/rustc-nightly-x86_64-apple-darwin.tar.xz): 161232048
- after, [`7549dbdc09f0c4f6cc84002ac03081828054784b`](https://ci-artifacts.rust-lang.org/rustc-builds/7549dbdc09f0c4f6cc84002ac03081828054784b/rustc-nightly-x86_64-apple-darwin.tar.xz): 143256928

2) time it took on CI:
- the [first `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8155512915/job/22291187124) took: 1h 33m
- the [second `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8157057880/job/22295839911) took: 1h 45m

3) most recent perf results on (a noisy) x64 mac are [here](rust-lang#112268 (comment)).

Related tracking issue for build configuration: rust-lang#103595
@klensy
Copy link
Contributor

klensy commented Nov 10, 2024

Attempted to tune ColdFuncOpt option for PGO (llvm/llvm-project#69030) #132779 to decrease file size, but sadly, no good results. Possible reasons: current linux dist uses bolt on top of pgo, which increases size; i forgot to apply the same optimization for libllvm (applied only for rustc_driver).

@davidhewitt
Copy link
Contributor

With the recent-ish promotion of aarch64-apple-darwin to tier 1, and all recent apple hardware being shipped with architecture, will this issue be updated to add (or replace) that as a Mac target for optimization? It's not clear to me what the optimization status is for that toolchain, and this seemed the best place to attempt to find it.

@nnethercote
Copy link
Contributor Author

@Kobzol, @lqd: does Mac/ARM64 get all the same treatment as Mac/Intel?

@lqd
Copy link
Member

lqd commented Nov 30, 2024

It has LTO and jemalloc, but not the 1CGU config. It’s not going to be earth shattering or anything, but I’ll make some local tests and a PR for try builds soon, to see the potential improvements there.

@lqd
Copy link
Member

lqd commented Dec 2, 2024

I’ll make some local tests and a PR for try builds soon

done in #133747 -- I've updated the OP now that it has merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance
Projects
None yet
Development

No branches or pull requests

10 participants