Exploring PGO for the Rust compiler

This issue is a landing place for discussion of whether and how to apply profile-guided optimization to `rustc`. There is some preliminary investigation of the topic in the [Exploring PGO for the Rust compiler](https://blog.rust-lang.org/inside-rust/2020/11/11/exploring-pgo-for-the-rust-compiler.html) post on the Inside Rust blog. The gist of it is that the performance gains offered by PGO look very promising but we need to

 - confirm the results on different machines and platforms,
 - make sure that there are no reasons to *not* do PGO on the compiler, and
 - find a feasible way to implement this on CI (or find a less ambitious alternative).

Let's start with the first point.


### Confirming the results

The blog post contains a step by step description of how to obtain a PGOed compiler -- but it is rather time consuming to actually do that. In order to make things easier I could provide a branch of the compiler that has all the changes already applied and, more importantly, a pre-recorded, checked-in `.profdata` file for both LLVM and rustc. Alternatively, I could just put up the final toolchain for download somewhere. Even better would be to make it available via rustup somehow. Please comment below on how to best approach this.

### Reasons not to do PGO?

Concerns raised so far are:

- This makes `rustc` builds non-reproducible -- something which I don't think is true. With a fixed `.profdata` file, both rustc and Clang should always generate the same output. That is `-Cprofile-use` and `-fprofile-use` do not introduce any source of randomness, as far as I can tell. So if the `.profdata` file being used is tracked by version control, we should be fine. It would be good to get some kind of further confirmation of that though.

- If we apply PGO just to stable and beta releases, we don't get enough testing for PGO-specific toolchain bugs.

- It is too much effort to continuously monitor the effect of PGO (e.g. via perf.rlo) because we would need PGOed nightlies in addition to non-PGOed nightlies (the latter of which serve as a baseline).

- Doing PGO might be risky in that it adds another opportunity for LLVM bugs to introduce miscompilations.

- It makes CI more complicated.

- It increases cycle times for the compiler.

The last two points *can* definitely be true. Finding out whether they have to be is the point of the next section:


### Find a feasible way of using PGO for rustc

There are several ways we can bring PGO to rustc:

1. Provide rustbuild support for easily building your own fully PGOed compiler.
2. Provide PGOed builds only for stable and beta releases, where the additional cycle time is offset by the lower build frequency.
3. Provide a kind of "best-effort" PGO which uses out-dated (but regularly updated) profiling data, in the hope that it is accurate enough to still give most of the gains. 

Let's go through the points in more detail:

1. **Easy DIY PGO via rustbuild** - I think we should definitely do this. There is quite a bit of design space on how to structure the concrete build options (@luser has posted some [relevant thoughts](https://github.com/rust-lang/cargo/issues/7618#issuecomment-731269657) in a related topic). But overall it should not be too much work, and since it is completely opt-in, there's also little risk involved. In addition, it is also a necessary intermediate step for the other two options.

2. **PGO for beta and stable releases only** - The feasibility of option (2) depends on a few things:

  - Is it acceptable from a testing point of view to build stable and beta artifacts with different settings than regular CI builds? Arguably beta releases get quite a bit of testing because they are used for building the compiler itself. On the other hand, building the compiler is a quite sensitive task.

  - Is it technically actually possible to do the long, three-phase compilation process on CI, or would we run into time limits set by the infrastructure? We might be more flexible in this respect now than we have been in the past.

  - How do we handle cross-compiled toolchains where profile data collection and compilation cannot run on the same system? A simple answer there is: don't do PGO for these targets. A possible better answer is to use profiling data collected on another system. This is even more relevant for the "best-effort" approach as described below.

  Personally I'm on the fence whether I find this approach acceptable or not -- especially given that there is a third option that is potentially quite a bit better.

3. **Do PGO on a best-effort** - After @pnkfelix asked a few questions in this direction, I've been looking into the LLVM profile data format a bit and it looks like it's actually quite robust:

  - Every function entry contains a hash value of the function's control flow graph. This gives LLVM the ability to check if a given entry is safe to use for a given function and, if not, it can just ignore the data and compile the function normally. That would be great news because it would mean that we can use profile data collected from a different version of the compiler and still get PGO for most functions. As a consequence, we could have a `.profdata` file in version control and *always* use it. An asynchronous automated task could then regularly do data collection and check it into the repository. 

  - PGO works at the LLVM IR level, so everything is still rather platform independent. My guess is that the majority of functions has the same CFG on different platforms, meaning that the profile data can be collected on one platform and then be used on all other platforms. That might massively decrease the amount of complexity for bringing PGO to CI. It would also be great news for targets like macOS where the build hardware is too weak to do the whole 3-phase build.

  - Function entries are keyed by symbol name, so if the symbol name is the same across platforms (which it should be the case with the new symbol mangling scheme), LLVM should have no trouble finding the entry for a given function in a `.profdata` file collected on a different platform. 

  Overall I came to like this approach quite a bit. Once we have a `.profdata` file being just another file in the git repository things become quite simple. If it is enough for that file to be "eventually consistent" we can just always use PGO without thinking about it twice. Profile data collection becomes nicely decoupled from the rest of the build process.


I think the next step is to check whether the various assumptions made above actually hold, leading to the following concrete tasks:

- [x] Confirm that PGO is actually worth the trouble, i.e. independently replicate the results from the [Exploring PGO for the Rust compiler](https://blog.rust-lang.org/inside-rust/2020/11/11/exploring-pgo-for-the-rust-compiler.html) blog post on different systems. (Done. See https://github.com/rust-lang/rust/issues/79442#issuecomment-763543926)
- [ ] Verify that the LLVM profdata format is as robust as described above:
  - [x] Try to find documentation or ask LLVM folks if support for partially out-of-date profdata is well supported and an actual design goal (see https://github.com/rust-lang/rust/issues/79442#issuecomment-734845976)
  - [ ] Try to find documentation or ask LLVM folks if platform independence is well supported and an actual design goal.
  - [ ] Ask people who have experience using this in production.
  - [ ] Try it out: Compile various test programs with out-of-date data and data collected on another platform. See if that leads to any hard errors.
- [ ] Investigate how out-of-date profdata for rutsc typically is if it were collected only once a day (for example).
- [ ] Investigate how big the mismatch between different platforms is. Concretely: 
  - [ ] How many hash mismatches do we get on x86-64 Windows and macOS when compiling with profdata collected on x86-64 Linux? 
  - [ ] How many hash mismatches do we get on Aarch64 macOS when compiling with profdata collected on x86-64 Linux?
  - [ ] What about x86 vs x86-64?
- [ ] Investigate how much slower it is to build an instrumented compiler.
- [ ] Investigate if *using* profdata leads to a significant compile time increase, that is, make sure that it is feasible to always compile with `-Cprofile-use`. 
- [ ] Double-check that PGO does not introduce a significant additional risk of running into LLVM miscompilation bugs. Ask production users for their experience.
- [ ] Check if Rust symbol names with the current (legacy) symbol mangling scheme are platform-dependent, or if we would need to switch the compiler to the new scheme if want to use profdata across platforms.
- [ ] Confirm that `-fprofile-use` and `-Cprofile-use` do not affect binary reproducibility (if used with a fixed `.profdata` file).

Once we know about all of the above we should be in a good position to decide whether to make an MCP to officially implement this.

Please post any feedback that you might have below!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exploring PGO for the Rust compiler #79442

Confirming the results

Reasons not to do PGO?

Find a feasible way of using PGO for rustc

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Exploring PGO for the Rust compiler #79442

Description

Confirming the results

Reasons not to do PGO?

Find a feasible way of using PGO for rustc

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions