Allow hoisting `vconst` instructions out of loops #5909

alexcrichton · 2023-03-02T05:42:55Z

Staring at some SIMD code and what LLVM and v8 both generate it appears that a common technique for SIMD-loops is to hoist constants outside of loops since they're nontrivial to rematerialize unlike integer constants. This commit updates the loop_hoist_level calculation with egraphs to have a nonzero default for instructions that have no arguments (e.g. consts) which enables hoisting these instructions out of loops.

Note, though, that for now I've listed the maximum as hoisting outside of one loop, but not all of them. While theoretically vconsts could move up to the top of the function I'd be worried about their impact on register pressure and having to save/restore around calls or similar, so hopefully if the hot part of a program is a single loop then hoisting out of one loop is a reasonable-enough heuristic for now.

Locally on x64 with a benchmark that just encodes binary to hex this saw a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.

Staring at some SIMD code and what LLVM and v8 both generate it appears that a common technique for SIMD-loops is to hoist constants outside of loops since they're nontrivial to rematerialize unlike integer constants. This commit updates the `loop_hoist_level` calculation with egraphs to have a nonzero default for instructions that have no arguments (e.g. consts) which enables hoisting these instructions out of loops. Note, though, that for now I've listed the maximum as hoisting outside of one loop, but not all of them. While theoretically vconsts could move up to the top of the function I'd be worried about their impact on register pressure and having to save/restore around calls or similar, so hopefully if the hot part of a program is a single loop then hoisting out of one loop is a reasonable-enough heuristic for now. Locally on x64 with a benchmark that just encodes binary to hex this saw a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.

jameysharp

I don't have an intuition off-hand for how to trade off register pressure against reloading large constants, so thank you for writing comments on the intuition you're following here. It seems reasonable to me.

I appreciate the filetest showing that vconst gets hoisted out of a loop. I'd love to also have a filetest verifying that vconst is only hoisted one loop level. But I think this is correct, and fine to merge as-is.

cfallin · 2023-03-02T19:51:22Z

I have a vague sense that it would be nice to come up with primitives at the ISLE level of some sort so we could write rules about vconst hoisting specifically, but I won't block anything here on that. This is a totally reasonable approach and heuristic for now, I think.

I was worried about impact on SpiderMonkey.wasm in particular though, because constant rematerialization (of the integer variety) was important there and there is one giant loop in particular, the interpreter loop, with high register pressure. However I think we're fine because remat is a separate mechanism that overrides LICM, and the unchanged filetests in licm.clif seem to confirm this. Just to be extra-safe though, would you mind testing SM via Sightglass before we merge?

alexcrichton · 2023-03-02T19:54:07Z

Oh I wonder if this works....

/bench_x64

jlb6740 · 2023-03-02T20:02:00Z

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

Δ = 20386224.96 ± 9458874.85 (confidence = 99%)

commit.so is 1.05x to 1.13x faster than main.so!

[206699528 221541225.76 248828442] commit.so
[225941572 241927450.72 280238918] main.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[189136 206701.52 235040] commit.so
[191910 213628.96 382480] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[490272 522666.00 655030] commit.so
[482422 508389.36 534288] main.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[163288 183903.04 236106] commit.so
[165690 179144.64 191834] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[274605536 301715860.72 334694818] commit.so
[275865682 300141379.52 353410662] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[112537148 115649290.16 119222514] commit.so
[112444412 116049765.20 117620050] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[8872564 9065157.84 9461668] commit.so
[8896812 9034417.84 9306780] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[6765882046 6891988617.28 7001096270] commit.so
[6782510656 6910459005.44 6988000084] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[1036212126 1046099038.96 1067491184] commit.so
[1033616880 1045758186.64 1057583540] main.so

alexcrichton · 2023-03-03T00:57:47Z

Locally the numbers I got were:

main.so is 1.06x to 1.08x faster than licm.so! - compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
main.so is 1.05x to 1.05x faster than licm.so! - compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
licm.so is 1.01x to 1.05x faster than main.so! - execution :: cycles :: benchmarks/bz2/benchmark.wasm

with everything else at "no difference"

cfallin · 2023-03-03T01:04:32Z

main.so is 1.06x to 1.08x faster than licm.so! - compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

main.so is 1.05x to 1.05x faster than licm.so! - compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

Hmm -- that's kind of unfortunate; I would've hoped for less impact. Probably from increased regalloc time due to higher register pressure? (perf diff could tell us?)

If so, one way we could lessen the impact here is to apply the heuristic only to certain opcodes -- define a predicate called something like Opcode::is_expensive_nullary() true only for vconst initially and use that to conditionalize your change to the hoist-level.

(That raises the question too: what other opcodes are being hoisted?)

alexcrichton · 2023-03-03T04:14:18Z

Hm ok as I remeasure this I think that I'm just showing noise on my machine. The exact same *.cwasm is produced before and after this change for spidermonkey.wasm in the sightglass repository, and the input wasm module validates without the simd feature enabled so there's definitely no vconst anywhere in there.

I know @jameysharp you mentioned awhile back that you went to great lengths to measure small changes in performance, do you know if it'd be easy-ish to use that setup to measure a change like this?

cfallin · 2023-03-03T05:01:04Z

The exact same *.cwasm is produced before and after this change for spidermonkey.wasm in the sightglass repository,

Given that, I'm happy to see this merge then! "Exactly the same code produced" is about as rigorous an argument for "no perf impact" as one can make :-)

jameysharp · 2023-03-06T20:44:31Z

For future reference, my notes on low-variance profiling are at https://github.com/bytecodealliance/sightglass/blob/main/docs/cpu-isolation.md if that helps.

github-actions bot added the cranelift Issues related to the Cranelift code generator label Mar 2, 2023

jameysharp approved these changes Mar 2, 2023

View reviewed changes

Test vconst is only hoisted one loop out

2bfef9e

alexcrichton closed this Mar 6, 2023

alexcrichton reopened this Mar 6, 2023

alexcrichton enabled auto-merge March 6, 2023 15:08

alexcrichton added this pull request to the merge queue Mar 6, 2023

Merged via the queue into bytecodealliance:main with commit 18ee645 Mar 6, 2023

alexcrichton deleted the licm-vconst branch March 6, 2023 16:03

jameysharp mentioned this pull request Apr 5, 2023

Add release notes for 8.0.0 #6145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow hoisting `vconst` instructions out of loops #5909

Allow hoisting `vconst` instructions out of loops #5909

alexcrichton commented Mar 2, 2023

jameysharp left a comment

cfallin commented Mar 2, 2023

alexcrichton commented Mar 2, 2023

jlb6740 commented Mar 2, 2023

alexcrichton commented Mar 3, 2023

cfallin commented Mar 3, 2023

alexcrichton commented Mar 3, 2023

cfallin commented Mar 3, 2023

jameysharp commented Mar 6, 2023

Allow hoisting vconst instructions out of loops #5909

Allow hoisting vconst instructions out of loops #5909

Conversation

alexcrichton commented Mar 2, 2023

jameysharp left a comment

Choose a reason for hiding this comment

cfallin commented Mar 2, 2023

alexcrichton commented Mar 2, 2023

jlb6740 commented Mar 2, 2023

alexcrichton commented Mar 3, 2023

cfallin commented Mar 3, 2023

alexcrichton commented Mar 3, 2023

cfallin commented Mar 3, 2023

jameysharp commented Mar 6, 2023

Allow hoisting `vconst` instructions out of loops #5909

Allow hoisting `vconst` instructions out of loops #5909