Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable MemorySSA in MemCpyOpt #82806

Merged
merged 1 commit into from
Mar 11, 2021
Merged

Enable MemorySSA in MemCpyOpt #82806

merged 1 commit into from
Mar 11, 2021

Conversation

nikic
Copy link
Contributor

@nikic nikic commented Mar 5, 2021

LLVM 12 ships with an implementation of MemCpyOpt which is based on MSSA instead of MDA. This implementation can eliminate memcpys across blocks, and as such fixes many (but not all) failures to eliminate redundant memcpys for Rust code. Unfortunately this was only enabled by default shortly after LLVM 12 was cut. This backports the enablement to our LLVM fork.

Perf results: https://perf.rust-lang.org/compare.html?start=8fd946c63a6c3aae9788bd459d278cb2efa77099&end=0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8

There are improvements on check and debug builds, which indicate that rustc itself has become faster. For opt builds this is, on average, a very minor improvement as well, although there is one significant outlier with deep-vector-opt. This benchmark creates ~140000 zero stores, which are now coalesced into a memset slightly later, resulting in longer compile-time for intermediate passes.

@nikic
Copy link
Contributor Author

nikic commented Mar 5, 2021

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 5, 2021
@bors
Copy link
Contributor

bors commented Mar 5, 2021

⌛ Trying commit 91b2cf71dd4a87c341bc07409a345857ac687981 with merge dd2f66f571385888e89e2f017de490bbc855c499...

@bors
Copy link
Contributor

bors commented Mar 5, 2021

☀️ Try build successful - checks-actions
Build commit: dd2f66f571385888e89e2f017de490bbc855c499 (dd2f66f571385888e89e2f017de490bbc855c499)

@rust-timer
Copy link
Collaborator

Queued dd2f66f571385888e89e2f017de490bbc855c499 with parent 8fd946c, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (dd2f66f571385888e89e2f017de490bbc855c499): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Mar 5, 2021
@nikic
Copy link
Contributor Author

nikic commented Mar 5, 2021

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 5, 2021
@bors
Copy link
Contributor

bors commented Mar 5, 2021

⌛ Trying commit b022c9dbfa16eb4c64ff19b23e7daf85112bb8ae with merge 0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8...

@bors
Copy link
Contributor

bors commented Mar 5, 2021

☀️ Try build successful - checks-actions
Build commit: 0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8 (0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8)

@rust-timer
Copy link
Collaborator

Queued 0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8 with parent 8fd946c, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 5, 2021
@nikic
Copy link
Contributor Author

nikic commented Mar 5, 2021

LLVM 12 ships with an implementation of MemCpyOpt which is based on MSSA instead of MDA. This implementation can eliminate memcpys across blocks, and as such fixes many (but not all) failures to eliminate redundant memcpys for Rust code. Unfortunately this was only enabled by default shortly after LLVM 12 was cut. I think it may be worthwhile to backport the enablement for our purposes. There's two ways to do this, with corresponding perf results:

Enable using -enable-memcpyopt-memoryssa: https://perf.rust-lang.org/compare.html?start=8fd946c63a6c3aae9788bd459d278cb2efa77099&end=dd2f66f571385888e89e2f017de490bbc855c499

Enable by backporting upstream change: https://perf.rust-lang.org/compare.html?start=8fd946c63a6c3aae9788bd459d278cb2efa77099&end=0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8

The difference is that the first enables MSSA but keeps the current position of the pass in the pipeline. This has compile-time cost, because it requires an extra MSSA calculation in most cases. The second one (which is how it was done upstream) also slightly adjusts the position where MemCpyOpt is run, so an existing MSSA calculation can be reused.

In terms of results, we can see improvements on check/debug builds in both cases, which indicates that rustc itself is faster. For opt builds the second version is faster, which is most clearly visible for the boostrap timings, although there is one outlier with deep-vector-opt.

r? @nagisa

@nagisa
Copy link
Member

nagisa commented Mar 6, 2021

The message should be probably copied over to the first message so that it ends up in the merge commit description.

The results all look pretty good. Sadly the deep-vector-opt has regressed significantly in wall-time (by ~12%, roughly matching the regression in instruction counts) too, so it'd be good to look into how representative the test case is of a typical Rust code. I suspect its going to have a huge number of memcpys or humongous basic blocks or very many of them.

I'm also curious how large is the improvement on the runtime of the code. We see from the compiler time improvements that there is some, but the compiler isn't always a very representative workload. Significant improvements in e.g. even the deep-vector-opt runtime could justify the compile time regressions. Alas I don't think this is a piece of information we can get out of perf :(

@nikic
Copy link
Contributor Author

nikic commented Mar 6, 2021

I've looked into what happens in the deep-vector-opt case: The input file is https://github.com/rust-lang/rustc-perf/blob/master/collector/benchmarks/deep-vector/src/main.rs which basically compiles down to 140000 GEPs + stores of zero. The reason for the compile-time regression in this case is that that we now have an additional InstCombine run prior to MemCpyOpt, which tries to (futilely) fold all those instructions. Previously the stores got combined into a memset a bit earlier, reducing the amount of work InstCombine does. The end result is the same in both cases (the function is optimized away entirely). I'm personally not particularly bothered by this, as this is a degenerate input, and the regression is a non-pathological one (linear).

As to runtime impact, I believe that unnecessary memcpys are a pretty common problem in Rust, but I don't have a way to quantify it beyond the check build results. @jrmuizel might have some insight here, I believe he reported a lot of memcpy optimization problems affecting servo.

@jrmuizel
Copy link
Contributor

jrmuizel commented Mar 6, 2021

https://github.com/jrmuizel/memcpy-find should be a useful tool for evaluating this. I'll try to do some before and after on some projects with this branch if I can find the time.

@jrmuizel
Copy link
Contributor

jrmuizel commented Mar 6, 2021

I confirmed, as expected, that this does fix #56172, which was a reduced test case from a relatively hot path in WebRender.

@nagisa
Copy link
Member

nagisa commented Mar 6, 2021

compiles down to 140000 GEPs + stores of zero

Yeah, that's exactly what I had hoped for. I think we're pretty comfortable taking a comptime perf hit here given the runtime perf improvements we're hoping to get here, even if they can't be quantified too well.

I still think it might make sense to track this kind of thing, but I'm not super concerned about it either.

I'm happy to r+ this once the PR is adjusted to the point where it can be landed.

@jyn514 jyn514 added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 6, 2021
@jrmuizel
Copy link
Contributor

jrmuizel commented Mar 6, 2021

Building wrench with this branch brings the number of memcpy's down from 4206 to 3611 and the sum of sizes from 167776 to 133643

@nikic
Copy link
Contributor Author

nikic commented Mar 10, 2021

Upstream report: https://bugs.llvm.org/show_bug.cgi?id=49509

This updates the LLVM submodule to pick up a backported patch
to enable MemorySSA-based MemCpyOpt, which is capable of optimizing
away memcpy's across basic blocks.
@nikic
Copy link
Contributor Author

nikic commented Mar 11, 2021

Picked up the fix for the PowerPC hang. This also includes the fix for #80810 now, as it landed in the submodule in the meantime.

@bors r=nagisa

@bors
Copy link
Contributor

bors commented Mar 11, 2021

📌 Commit 623ca84 has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 11, 2021
@nagisa
Copy link
Member

nagisa commented Mar 11, 2021

@bors p=1 Lets get this out of the pipeline sooner so that we can start landing other LLVM backports for miscompiles we've encountered.

@bors
Copy link
Contributor

bors commented Mar 11, 2021

⌛ Testing commit 623ca84 with merge 4a8b6f7...

@bors
Copy link
Contributor

bors commented Mar 11, 2021

☀️ Test successful - checks-actions
Approved by: nagisa
Pushing 4a8b6f7 to master...

@glandium
Copy link
Contributor

While the issues that have been closed are indeed fixed to some extent, I'm not particularly convinced they're intrinsically fixed. It feels like you could still be unlucky with the optimizer. And that's not mentioning that the problems still exist with unoptimized builds. Or with cranelift, I suppose.

@bstrie
Copy link
Contributor

bstrie commented Mar 15, 2021

@glandium What more could be done? Indeed, you might still get unlucky with the optimizer, but the closed issues in question can also be generally characterized as "I got unlucky with the optimizer". I'm not sure that a bulletproof resolution exists.

@nikic
Copy link
Contributor Author

nikic commented Mar 15, 2021

@bstrie There's two dimensions to this problem: The optimization problem, which is mostly resolved by this change, and the language semantics problem, which is not. Rust has neither placement new nor guaranteed NRVO, both of which may be desirable to provide hard guarantees that certain copies do not occur (e.g. that if you box a value, no stack allocation will be created even in completely unoptimized builds). I believe the closed issues all refer to optimization problems and we have separate issues to track the language semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.