-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable MemorySSA in MemCpyOpt #82806
Conversation
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 91b2cf71dd4a87c341bc07409a345857ac687981 with merge dd2f66f571385888e89e2f017de490bbc855c499... |
☀️ Try build successful - checks-actions |
Queued dd2f66f571385888e89e2f017de490bbc855c499 with parent 8fd946c, future comparison URL. |
Finished benchmarking try commit (dd2f66f571385888e89e2f017de490bbc855c499): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit b022c9dbfa16eb4c64ff19b23e7daf85112bb8ae with merge 0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8... |
☀️ Try build successful - checks-actions |
Queued 0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8 with parent 8fd946c, future comparison URL. |
Finished benchmarking try commit (0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
LLVM 12 ships with an implementation of MemCpyOpt which is based on MSSA instead of MDA. This implementation can eliminate memcpys across blocks, and as such fixes many (but not all) failures to eliminate redundant memcpys for Rust code. Unfortunately this was only enabled by default shortly after LLVM 12 was cut. I think it may be worthwhile to backport the enablement for our purposes. There's two ways to do this, with corresponding perf results: Enable using Enable by backporting upstream change: https://perf.rust-lang.org/compare.html?start=8fd946c63a6c3aae9788bd459d278cb2efa77099&end=0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8 The difference is that the first enables MSSA but keeps the current position of the pass in the pipeline. This has compile-time cost, because it requires an extra MSSA calculation in most cases. The second one (which is how it was done upstream) also slightly adjusts the position where MemCpyOpt is run, so an existing MSSA calculation can be reused. In terms of results, we can see improvements on check/debug builds in both cases, which indicates that rustc itself is faster. For opt builds the second version is faster, which is most clearly visible for the boostrap timings, although there is one outlier with deep-vector-opt. r? @nagisa |
The message should be probably copied over to the first message so that it ends up in the merge commit description. The results all look pretty good. Sadly the deep-vector-opt has regressed significantly in wall-time (by ~12%, roughly matching the regression in instruction counts) too, so it'd be good to look into how representative the test case is of a typical Rust code. I suspect its going to have a huge number of memcpys or humongous basic blocks or very many of them. I'm also curious how large is the improvement on the runtime of the code. We see from the compiler time improvements that there is some, but the compiler isn't always a very representative workload. Significant improvements in e.g. even the |
I've looked into what happens in the deep-vector-opt case: The input file is https://github.com/rust-lang/rustc-perf/blob/master/collector/benchmarks/deep-vector/src/main.rs which basically compiles down to 140000 GEPs + stores of zero. The reason for the compile-time regression in this case is that that we now have an additional InstCombine run prior to MemCpyOpt, which tries to (futilely) fold all those instructions. Previously the stores got combined into a memset a bit earlier, reducing the amount of work InstCombine does. The end result is the same in both cases (the function is optimized away entirely). I'm personally not particularly bothered by this, as this is a degenerate input, and the regression is a non-pathological one (linear). As to runtime impact, I believe that unnecessary memcpys are a pretty common problem in Rust, but I don't have a way to quantify it beyond the check build results. @jrmuizel might have some insight here, I believe he reported a lot of memcpy optimization problems affecting servo. |
https://github.com/jrmuizel/memcpy-find should be a useful tool for evaluating this. I'll try to do some before and after on some projects with this branch if I can find the time. |
I confirmed, as expected, that this does fix #56172, which was a reduced test case from a relatively hot path in WebRender. |
Yeah, that's exactly what I had hoped for. I think we're pretty comfortable taking a comptime perf hit here given the runtime perf improvements we're hoping to get here, even if they can't be quantified too well. I still think it might make sense to track this kind of thing, but I'm not super concerned about it either. I'm happy to r+ this once the PR is adjusted to the point where it can be landed. |
Building wrench with this branch brings the number of memcpy's down from 4206 to 3611 and the sum of sizes from 167776 to 133643 |
Upstream report: https://bugs.llvm.org/show_bug.cgi?id=49509 |
This updates the LLVM submodule to pick up a backported patch to enable MemorySSA-based MemCpyOpt, which is capable of optimizing away memcpy's across basic blocks.
📌 Commit 623ca84 has been approved by |
@bors p=1 Lets get this out of the pipeline sooner so that we can start landing other LLVM backports for miscompiles we've encountered. |
☀️ Test successful - checks-actions |
While the issues that have been closed are indeed fixed to some extent, I'm not particularly convinced they're intrinsically fixed. It feels like you could still be unlucky with the optimizer. And that's not mentioning that the problems still exist with unoptimized builds. Or with cranelift, I suppose. |
@glandium What more could be done? Indeed, you might still get unlucky with the optimizer, but the closed issues in question can also be generally characterized as "I got unlucky with the optimizer". I'm not sure that a bulletproof resolution exists. |
@bstrie There's two dimensions to this problem: The optimization problem, which is mostly resolved by this change, and the language semantics problem, which is not. Rust has neither placement new nor guaranteed NRVO, both of which may be desirable to provide hard guarantees that certain copies do not occur (e.g. that if you box a value, no stack allocation will be created even in completely unoptimized builds). I believe the closed issues all refer to optimization problems and we have separate issues to track the language semantics. |
LLVM 12 ships with an implementation of MemCpyOpt which is based on MSSA instead of MDA. This implementation can eliminate memcpys across blocks, and as such fixes many (but not all) failures to eliminate redundant memcpys for Rust code. Unfortunately this was only enabled by default shortly after LLVM 12 was cut. This backports the enablement to our LLVM fork.
Perf results: https://perf.rust-lang.org/compare.html?start=8fd946c63a6c3aae9788bd459d278cb2efa77099&end=0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8
There are improvements on check and debug builds, which indicate that rustc itself has become faster. For opt builds this is, on average, a very minor improvement as well, although there is one significant outlier with deep-vector-opt. This benchmark creates ~140000 zero stores, which are now coalesced into a memset slightly later, resulting in longer compile-time for intermediate passes.