-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow hoisting vconst
instructions out of loops
#5909
Conversation
Staring at some SIMD code and what LLVM and v8 both generate it appears that a common technique for SIMD-loops is to hoist constants outside of loops since they're nontrivial to rematerialize unlike integer constants. This commit updates the `loop_hoist_level` calculation with egraphs to have a nonzero default for instructions that have no arguments (e.g. consts) which enables hoisting these instructions out of loops. Note, though, that for now I've listed the maximum as hoisting outside of one loop, but not all of them. While theoretically vconsts could move up to the top of the function I'd be worried about their impact on register pressure and having to save/restore around calls or similar, so hopefully if the hot part of a program is a single loop then hoisting out of one loop is a reasonable-enough heuristic for now. Locally on x64 with a benchmark that just encodes binary to hex this saw a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have an intuition off-hand for how to trade off register pressure against reloading large constants, so thank you for writing comments on the intuition you're following here. It seems reasonable to me.
I appreciate the filetest showing that vconst
gets hoisted out of a loop. I'd love to also have a filetest verifying that vconst
is only hoisted one loop level. But I think this is correct, and fine to merge as-is.
I have a vague sense that it would be nice to come up with primitives at the ISLE level of some sort so we could write rules about I was worried about impact on SpiderMonkey.wasm in particular though, because constant rematerialization (of the integer variety) was important there and there is one giant loop in particular, the interpreter loop, with high register pressure. However I think we're fine because remat is a separate mechanism that overrides LICM, and the unchanged filetests in |
Oh I wonder if this works.... /bench_x64 |
compilation :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 20386224.96 ± 9458874.85 (confidence = 99%) commit.so is 1.05x to 1.13x faster than main.so! [206699528 221541225.76 248828442] commit.so instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [189136 206701.52 235040] commit.so instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [490272 522666.00 655030] commit.so instantiation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [163288 183903.04 236106] commit.so compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [274605536 301715860.72 334694818] commit.so execution :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [112537148 115649290.16 119222514] commit.so execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [8872564 9065157.84 9461668] commit.so compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [6765882046 6891988617.28 7001096270] commit.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [1036212126 1046099038.96 1067491184] commit.so |
Locally the numbers I got were:
with everything else at "no difference" |
Hmm -- that's kind of unfortunate; I would've hoped for less impact. Probably from increased regalloc time due to higher register pressure? ( If so, one way we could lessen the impact here is to apply the heuristic only to certain opcodes -- define a predicate called something like (That raises the question too: what other opcodes are being hoisted?) |
Hm ok as I remeasure this I think that I'm just showing noise on my machine. The exact same I know @jameysharp you mentioned awhile back that you went to great lengths to measure small changes in performance, do you know if it'd be easy-ish to use that setup to measure a change like this? |
Given that, I'm happy to see this merge then! "Exactly the same code produced" is about as rigorous an argument for "no perf impact" as one can make :-) |
For future reference, my notes on low-variance profiling are at https://github.com/bytecodealliance/sightglass/blob/main/docs/cpu-isolation.md if that helps. |
Staring at some SIMD code and what LLVM and v8 both generate it appears that a common technique for SIMD-loops is to hoist constants outside of loops since they're nontrivial to rematerialize unlike integer constants. This commit updates the
loop_hoist_level
calculation with egraphs to have a nonzero default for instructions that have no arguments (e.g. consts) which enables hoisting these instructions out of loops.Note, though, that for now I've listed the maximum as hoisting outside of one loop, but not all of them. While theoretically vconsts could move up to the top of the function I'd be worried about their impact on register pressure and having to save/restore around calls or similar, so hopefully if the hot part of a program is a single loop then hoisting out of one loop is a reasonable-enough heuristic for now.
Locally on x64 with a benchmark that just encodes binary to hex this saw a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.