-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploring PGO for the Rust compiler #79442
Comments
One concern with the "best-effort" approach that I just became aware of is how it affects performance testing: Let's say you want to optimize some expensive function |
This might be overcomplicating things, but what if the default was not to use PGO builds, and enable them only for nightly/beta/stable? Then The disadvantage is that nightlies will now require a second full build; it will no longer be possible to use the latest build artifacts from bors. |
a question that I think is missing is how storing |
A variation on approach3: Have stage1 gather PGO data while building stage2 for an auto-merge, then save that somewhere so it can be used during the next stage2 build of anything that has that merge as its nearest ancestor in the history. |
@andjo403 While by no means a solution, git LFS may be helpful regarding the size of the repo. |
That's a good point! Using Git LFS sounds a bit problematic to me because of its reliance on external storage. Maybe the data could be stored in a separate repository that gets pulled in as a submodule? Then one would not have to pull the entire thing. |
There seems to be a text-based profile data format that looks pretty mergeable: _ZNK4llvm20MemorySSAWrapperPass14verifyAnalysisEv
# Func Hash:
22759827559
# Num Counters:
2
# Counter Values:
0
0
_ZN4llvm9DIBuilder17createNullPtrTypeEv
# Func Hash:
12884901887
# Num Counters:
1
# Counter Values:
0
_ZN4llvm15SmallVectorImplINS_26AArch64GenRegisterBankInfo17PartialMappingIdxEE6appendIPKS2_vEEvT_S7_
# Func Hash:
37713126052
# Num Counters:
3
# Counter Values:
0
0
0
# Num Value Kinds:
1
# ValueKind = IPVK_MemOPSize:
1
# NumValueSites:
1
0 I don't know how well supported it is. Surprisingly it seems to be slightly more compact than the binary format (20 MB vs 23 MB and 57 MB vs 64 MB). It also compresses better than the binary format. But it would have to be stored in the repository uncompressed in order to be diffable, right? Or does git have any tricks up its sleeve that allow it to store compressed diffs? |
A variation of this would be to build a non-PGOed baseline compiler just for perf.rlo runs. That could happen in parallel to building the modified compiler. |
This slide deck from 2013 states the following design goals for LLVM's instrumentation based PGO support:
The presentation probably refers to front-end based instrumentation, since the IR-level instrumentation that This blog post also talks about function CFG hashes making sure that out-dated profile data is detected and ignored. Interestingly it also mentions that the dotnet runtime uses a best-effort approach similar to the one described above. |
I think it would be quite reasonable to store the PGO data files on S3 or something similar and just have the URL or something similar point to that in CI (or, optionally, local builds). I expect regardless of what we do we'll want it to be optional to enable it. I would not expect us to store them in git or similar because -- at least AFAIK -- inspecting changes to them isn't really feasible/desirable. We basically already do this for the bootstrap compiler (i.e., it's just downloaded by hash/version) and these artifacts would be no different. One question I have @michaelwoerister is the extent to which we can profile-use artifacts built on different machines -- are there absolute path dependencies here? Do we need some special handling for this? In particular, I would love for local developers to be able to use the same artifacts CI did without too much hassle (i.e., not in docker but just building directly). It sounds like based on what you've said this should not be a problem but would be good to be certain here. (I guess this is part of "reproducible builds" -- do I get the same profiling information across different runs on the same workload? Or does e.g. ASLR make the profiles radically different). If the profiles are sufficiently opaque as to not care too much about the producing rustc's origins, one approach might be to use perf.rlo hardware exclusively to generate the instrumented rustc's and profile them. We already build rustc at each commit on perf.rust-lang.org in order to record the bootstrap compile times, and building it in an instrumented fashion would not be too hard, I suspect. Once we had that we could use it to gather profiling data (likely on the perf.rlo benchmarks) and feed that back into the next master commit. This would mean we're always off by one commit's worth of changes but I expect that to be a minor loss. I think a great next step here would be to get some idea on:
Presuming the answer to these questions is "not much" (5% wall time is probably limit on current perf hardware; but I imagine that getting better or more hardware would not be too hard if we needed to), then I think a good series of next steps would be:
|
When it comes to hosting the profile data in version control versus somewhere external, I think the main question to clarify is how (historically) reproducible we want PGO builds to be: If we store profile data in git we can go to any commit and get the exact same build because PGO data is guaranteed to be available. If we host the data externally we have less of a guarantee that the data will still be available after a few months or years. However, after you mentioned the bootstrap compiler also being stored externally, I now realize that we already have "critical" data stored outside of version control. So storing PGO data on S3 would not make things worse at least.
Yes, there are some absolute paths in the profile data. Some symbol names are augmented with the path of their originating source file -- this seems to be necessary for ThinLTO to work properly in all cases. I only discovered this recently. But there is good news:
Overall I think this problem is solvable.
You get the same profile data if (and only if) the workload is deterministic. If there is some source of randomness, like if pointers are being hashed or compared (even without ASLR), then profile data will change. However, if we just store the profile data somewhere, things should be deterministic -- which luckily also happens to be the better approach from a build times perspective.
Not much slower but noticeable, I think. I added that question to the TODO list in the OP.
Quite noticeable. I think a 20-30% slowdown should be expected.
I don't think instruction counts would get a lot noisier -- but maybe I am wrong. Instrumentation code has to access various runtime counters in memory all the time, which might mess with the cache. And it has to write all that data to disk, which might introduce noise too. Overall I am skeptical about completely switching perf.rlo to using instrumented builds. On the plus side it would solve the unfairness problem mentioned above. And it would make setting this up easier. But I'm a bit worried that it might skew the performance data too much. One thing to consider here is that the accuracy of instrumentation-based profile data collection is quite independent of the underlying hardware, since it works by just containing how many times each branch is taken. So it can be moved to a slow machine without problem and, more importantly, it can be executed on machines with inconsistent performance characteristics (like in a VPS). I'm also confident that the entire perf.rlo benchmark suite is way too big and we could get the same profile data quality with something that has 10% the runtime. So, I currently tend to think that we would be better off running data collection separately somewhere. Although it can still be based on the perf.rlo framework (running in a special mode) if that makes things easier.
Is that the same compiler that is then used to run the benchmarks? I assumed that it would be much better from a maintainability standpoint to add a couple more "regular" docker-based builds for providing the instrumented compiler (one for Unix, one for Windows). They could even do the data collection right after building (because we don't need to care about hardware performance consistency). The fairness problem mentioned above could also be solved by always using an non-PGOed compiler for perf.rlo benchmarking. In the worst case this would mean a single additional x86-64 Linux dist build, right? |
@Mark-Simulacrum I think the first point in your list of action items (adding PGO support to rustbuild) makes sense, regardless of how we proceed exactly. I opened #79562 for discussing that in detail. |
OK, so it sounds like using perf.rlo to collect data is likely not a good fit: it's both not really needed, since we expect it to be about as deterministic as running that collection in CI, and would unacceptably slow down builds too.
No, perf.rlo doesn't use the compiler it builds to run benchmarks today. The unfairness problem indeed seems hard to tackle. I was initially thinking that it wouldn't be that big a deal, but I think the most unfortunate element is that we'd presumably begin to "expect" regressions from changing hot code (since it would lose PGO benefits) and that seems pretty bad. I think using non-PGO builds on perf for now is probably the way to go; I think we should be able to afford a single perf builder. It'll also be good to have something to compare against in case any weird bugs show up later on, to make sure it's not PGO being buggy. That said, if we go with the off-by-one approach to data collection, the unfairness problem will spread to nightlies too: if a patch changing hot code lands and ships in nightly, then that patch will plausibly be a regression to nightly performance. On beta and stable we probably won't see that as much (we can land dummy README changing patches or something before release). I'm not sure if we should try to mitigate that somehow. Maybe in practice the effects of PGO on even very hot code are not major enough that this is all worrying over nothing. So maybe it's worth taking a look at doing PGO within a single build cycle (i.e., we build a compiler, collect data, and then build another compiler) in CI. If that's feasible then it removes the unfairness problem and is all around better, I suspect. I think it makes sense to wait until we have support in rustbuild for doing this and then see how much we can fit into e.g. x86_64-linux builders to start: if we can pull off a full PGO cycle, great, if not, we can start taking a look at other options (like, for example only doing "perfect" PGO on beta/stable and doing so across several CI cycles, and on nightly just using beta/stable PGO data perhaps). |
👍
Yes -- I think that would be acceptable though. The unfairness problem is more of an issue for performance measurement where you want to have accurate numbers about a small change. For real-world compile times I don't think it would be noticeable. And for stable and beta you can get rid of the problem "manually" by doing an empty commit so that PGO data effectively can catch up with the actual code.
My estimate is that that would be intolerably slow
I think so too. |
FYI I looked into this a while back and there just isn't any straightforward way to make the workload deterministic. For Firefox builds, we settled on being comfortable with publishing the profile data and making sure that the optimized build step was deterministic given that same input. That means that anyone ought to be able to reproduce the Firefox builds we publish given the same source + profiling data we publish, which seems like a reasonable compromise. We split the build into three separate tasks: the instrumented build, the profile collection, the optimized build. This also helped us enable PGO for cross-compiled builds like the macOS build on Linux. If you're going to have a fixed set of profile data that gets updated periodically then that simplifies things further. A lot of the Firefox build choices were made prior to switching all the builds to clang, so some of these things that are possible with LLVM PGO were not possible with MSVC/GCC PGO. |
Yes, I think that is the most promising approach and it works well with re-using profile data generated on other machines/platforms. |
#80262 added PGO support for the Rust part Linux x64 dist builds and perf.rlo shows the expected speedups for check builds and other test cases that don't invoke LLVM 🎉 I think this is confirmation enough that the results from my blog post can indeed be extrapolated to other systems too. |
what needs to be done to allow Windows builds to use the benefit of PGO? |
Is PGO on ice for non-Linux x64 builds? |
I noticed that PGO fails in the final step when LTO is enabled for the builds. Not sure why this happens, but I get a |
LLVM 14 (with in-tree support for BOLT) is nearing its release. I'll try to use BOLT to optimize LLVM (just LLVM, not |
@Kobzol what's the current status of PGO? We have it enabled on all nightly builds, right? Can we close this issue now? |
It's enabled for x64 and Windows, not yet for macOS. |
This issue is a landing place for discussion of whether and how to apply profile-guided optimization to
rustc
. There is some preliminary investigation of the topic in the Exploring PGO for the Rust compiler post on the Inside Rust blog. The gist of it is that the performance gains offered by PGO look very promising but we need toLet's start with the first point.
Confirming the results
The blog post contains a step by step description of how to obtain a PGOed compiler -- but it is rather time consuming to actually do that. In order to make things easier I could provide a branch of the compiler that has all the changes already applied and, more importantly, a pre-recorded, checked-in
.profdata
file for both LLVM and rustc. Alternatively, I could just put up the final toolchain for download somewhere. Even better would be to make it available via rustup somehow. Please comment below on how to best approach this.Reasons not to do PGO?
Concerns raised so far are:
This makes
rustc
builds non-reproducible -- something which I don't think is true. With a fixed.profdata
file, both rustc and Clang should always generate the same output. That is-Cprofile-use
and-fprofile-use
do not introduce any source of randomness, as far as I can tell. So if the.profdata
file being used is tracked by version control, we should be fine. It would be good to get some kind of further confirmation of that though.If we apply PGO just to stable and beta releases, we don't get enough testing for PGO-specific toolchain bugs.
It is too much effort to continuously monitor the effect of PGO (e.g. via perf.rlo) because we would need PGOed nightlies in addition to non-PGOed nightlies (the latter of which serve as a baseline).
Doing PGO might be risky in that it adds another opportunity for LLVM bugs to introduce miscompilations.
It makes CI more complicated.
It increases cycle times for the compiler.
The last two points can definitely be true. Finding out whether they have to be is the point of the next section:
Find a feasible way of using PGO for rustc
There are several ways we can bring PGO to rustc:
Let's go through the points in more detail:
Easy DIY PGO via rustbuild - I think we should definitely do this. There is quite a bit of design space on how to structure the concrete build options (@luser has posted some relevant thoughts in a related topic). But overall it should not be too much work, and since it is completely opt-in, there's also little risk involved. In addition, it is also a necessary intermediate step for the other two options.
PGO for beta and stable releases only - The feasibility of option (2) depends on a few things:
Is it acceptable from a testing point of view to build stable and beta artifacts with different settings than regular CI builds? Arguably beta releases get quite a bit of testing because they are used for building the compiler itself. On the other hand, building the compiler is a quite sensitive task.
Is it technically actually possible to do the long, three-phase compilation process on CI, or would we run into time limits set by the infrastructure? We might be more flexible in this respect now than we have been in the past.
How do we handle cross-compiled toolchains where profile data collection and compilation cannot run on the same system? A simple answer there is: don't do PGO for these targets. A possible better answer is to use profiling data collected on another system. This is even more relevant for the "best-effort" approach as described below.
Personally I'm on the fence whether I find this approach acceptable or not -- especially given that there is a third option that is potentially quite a bit better.
Every function entry contains a hash value of the function's control flow graph. This gives LLVM the ability to check if a given entry is safe to use for a given function and, if not, it can just ignore the data and compile the function normally. That would be great news because it would mean that we can use profile data collected from a different version of the compiler and still get PGO for most functions. As a consequence, we could have a
.profdata
file in version control and always use it. An asynchronous automated task could then regularly do data collection and check it into the repository.PGO works at the LLVM IR level, so everything is still rather platform independent. My guess is that the majority of functions has the same CFG on different platforms, meaning that the profile data can be collected on one platform and then be used on all other platforms. That might massively decrease the amount of complexity for bringing PGO to CI. It would also be great news for targets like macOS where the build hardware is too weak to do the whole 3-phase build.
Function entries are keyed by symbol name, so if the symbol name is the same across platforms (which it should be the case with the new symbol mangling scheme), LLVM should have no trouble finding the entry for a given function in a
.profdata
file collected on a different platform.Overall I came to like this approach quite a bit. Once we have a
.profdata
file being just another file in the git repository things become quite simple. If it is enough for that file to be "eventually consistent" we can just always use PGO without thinking about it twice. Profile data collection becomes nicely decoupled from the rest of the build process.I think the next step is to check whether the various assumptions made above actually hold, leading to the following concrete tasks:
-Cprofile-use
.-fprofile-use
and-Cprofile-use
do not affect binary reproducibility (if used with a fixed.profdata
file).Once we know about all of the above we should be in a good position to decide whether to make an MCP to officially implement this.
Please post any feedback that you might have below!
The text was updated successfully, but these errors were encountered: