Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow hoisting vconst instructions out of loops #5909

Merged
merged 2 commits into from
Mar 6, 2023

Conversation

alexcrichton
Copy link
Member

Staring at some SIMD code and what LLVM and v8 both generate it appears that a common technique for SIMD-loops is to hoist constants outside of loops since they're nontrivial to rematerialize unlike integer constants. This commit updates the loop_hoist_level calculation with egraphs to have a nonzero default for instructions that have no arguments (e.g. consts) which enables hoisting these instructions out of loops.

Note, though, that for now I've listed the maximum as hoisting outside of one loop, but not all of them. While theoretically vconsts could move up to the top of the function I'd be worried about their impact on register pressure and having to save/restore around calls or similar, so hopefully if the hot part of a program is a single loop then hoisting out of one loop is a reasonable-enough heuristic for now.

Locally on x64 with a benchmark that just encodes binary to hex this saw a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.

Staring at some SIMD code and what LLVM and v8 both generate it appears
that a common technique for SIMD-loops is to hoist constants outside of
loops since they're nontrivial to rematerialize unlike integer
constants. This commit updates the `loop_hoist_level` calculation with
egraphs to have a nonzero default for instructions that have no
arguments (e.g. consts) which enables hoisting these instructions out of
loops.

Note, though, that for now I've listed the maximum as hoisting outside
of one loop, but not all of them. While theoretically vconsts could move
up to the top of the function I'd be worried about their impact on
register pressure and having to save/restore around calls or similar, so
hopefully if the hot part of a program is a single loop then hoisting
out of one loop is a reasonable-enough heuristic for now.

Locally on x64 with a benchmark that just encodes binary to hex this saw
a 15% performance improvement taking hex encoding from ~6G/s to ~6.7G/s.
@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Mar 2, 2023
Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an intuition off-hand for how to trade off register pressure against reloading large constants, so thank you for writing comments on the intuition you're following here. It seems reasonable to me.

I appreciate the filetest showing that vconst gets hoisted out of a loop. I'd love to also have a filetest verifying that vconst is only hoisted one loop level. But I think this is correct, and fine to merge as-is.

@cfallin
Copy link
Member

cfallin commented Mar 2, 2023

I have a vague sense that it would be nice to come up with primitives at the ISLE level of some sort so we could write rules about vconst hoisting specifically, but I won't block anything here on that. This is a totally reasonable approach and heuristic for now, I think.

I was worried about impact on SpiderMonkey.wasm in particular though, because constant rematerialization (of the integer variety) was important there and there is one giant loop in particular, the interpreter loop, with high register pressure. However I think we're fine because remat is a separate mechanism that overrides LICM, and the unchanged filetests in licm.clif seem to confirm this. Just to be extra-safe though, would you mind testing SM via Sightglass before we merge?

@alexcrichton
Copy link
Member Author

Oh I wonder if this works....

/bench_x64

@jlb6740
Copy link
Contributor

jlb6740 commented Mar 2, 2023

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

Δ = 20386224.96 ± 9458874.85 (confidence = 99%)

commit.so is 1.05x to 1.13x faster than main.so!

[206699528 221541225.76 248828442] commit.so
[225941572 241927450.72 280238918] main.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[189136 206701.52 235040] commit.so
[191910 213628.96 382480] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[490272 522666.00 655030] commit.so
[482422 508389.36 534288] main.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[163288 183903.04 236106] commit.so
[165690 179144.64 191834] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[274605536 301715860.72 334694818] commit.so
[275865682 300141379.52 353410662] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[112537148 115649290.16 119222514] commit.so
[112444412 116049765.20 117620050] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[8872564 9065157.84 9461668] commit.so
[8896812 9034417.84 9306780] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[6765882046 6891988617.28 7001096270] commit.so
[6782510656 6910459005.44 6988000084] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[1036212126 1046099038.96 1067491184] commit.so
[1033616880 1045758186.64 1057583540] main.so

@alexcrichton
Copy link
Member Author

Locally the numbers I got were:

  • main.so is 1.06x to 1.08x faster than licm.so! - compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
  • main.so is 1.05x to 1.05x faster than licm.so! - compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
  • licm.so is 1.01x to 1.05x faster than main.so! - execution :: cycles :: benchmarks/bz2/benchmark.wasm

with everything else at "no difference"

@cfallin
Copy link
Member

cfallin commented Mar 3, 2023

  • main.so is 1.06x to 1.08x faster than licm.so! - compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
  • main.so is 1.05x to 1.05x faster than licm.so! - compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

Hmm -- that's kind of unfortunate; I would've hoped for less impact. Probably from increased regalloc time due to higher register pressure? (perf diff could tell us?)

If so, one way we could lessen the impact here is to apply the heuristic only to certain opcodes -- define a predicate called something like Opcode::is_expensive_nullary() true only for vconst initially and use that to conditionalize your change to the hoist-level.

(That raises the question too: what other opcodes are being hoisted?)

@alexcrichton
Copy link
Member Author

Hm ok as I remeasure this I think that I'm just showing noise on my machine. The exact same *.cwasm is produced before and after this change for spidermonkey.wasm in the sightglass repository, and the input wasm module validates without the simd feature enabled so there's definitely no vconst anywhere in there.

I know @jameysharp you mentioned awhile back that you went to great lengths to measure small changes in performance, do you know if it'd be easy-ish to use that setup to measure a change like this?

@cfallin
Copy link
Member

cfallin commented Mar 3, 2023

The exact same *.cwasm is produced before and after this change for spidermonkey.wasm in the sightglass repository,

Given that, I'm happy to see this merge then! "Exactly the same code produced" is about as rigorous an argument for "no perf impact" as one can make :-)

@alexcrichton alexcrichton reopened this Mar 6, 2023
@alexcrichton alexcrichton enabled auto-merge March 6, 2023 15:08
@alexcrichton alexcrichton added this pull request to the merge queue Mar 6, 2023
Merged via the queue into bytecodealliance:main with commit 18ee645 Mar 6, 2023
@alexcrichton alexcrichton deleted the licm-vconst branch March 6, 2023 16:03
@jameysharp
Copy link
Contributor

For future reference, my notes on low-variance profiling are at https://github.com/bytecodealliance/sightglass/blob/main/docs/cpu-isolation.md if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants