Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: add "fastest-runtime" profile for runtime optimization #8554

Merged
merged 1 commit into from
May 6, 2024

Conversation

dundargoc
Copy link
Contributor

@dundargoc dundargoc commented May 5, 2024

This is extremely useful for cases where the default optimizations just
are not enough.

Background: neovim is interested to
add wasmtime support in neovim/neovim#28415 but
we noticed that including wasmtime, even when not using wasmtime
directly, heavily affects runtime performance. This is not only
reflected in the increased startuptime but affects the runtime
performance overall.

Here are the benchmarks for startuptimes for different configurations.
Important to note is that all of runtime is affected, but the
startuptime is a decent proxy to measure runtime performance:

No wasm
  Time (mean ± σ):      50.5 ms ±   1.5 ms    [User: 32.8 ms, System: 12.3 ms]
  Range (min … max):    48.3 ms …  54.4 ms    56 runs

Wasm, lto=thin
  Time (mean ± σ):     104.9 ms ±   3.5 ms    [User: 86.5 ms, System: 12.7 ms]
  Range (min … max):    99.5 ms … 111.1 ms    26 runs

Wasm, lto=true
  Time (mean ± σ):      83.8 ms ±   2.5 ms    [User: 65.8 ms, System: 12.1 ms]
  Range (min … max):    80.5 ms …  93.3 ms    31 runs

Wasm, lto=true, strip="none", incremental=false, codegen-units=1, panic="abort"
  Time (mean ± σ):      53.1 ms ±   1.0 ms    [User: 35.5 ms, System: 12.5 ms]
  Range (min … max):    50.6 ms …  55.5 ms    54 runs

@dundargoc dundargoc requested review from a team as code owners May 5, 2024 12:16
@dundargoc dundargoc requested review from fitzgen and removed request for a team May 5, 2024 12:16
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
@dundargoc dundargoc force-pushed the build/performance branch 2 times, most recently from 93a0b10 to 01781b2 Compare May 5, 2024 13:22
@dundargoc dundargoc requested a review from bjorn3 May 5, 2024 15:07
@dundargoc dundargoc force-pushed the build/performance branch from 01781b2 to 1d88acc Compare May 5, 2024 15:57
@github-actions github-actions bot added the wasmtime:c-api Issues pertaining to the C API. label May 5, 2024
@alexcrichton
Copy link
Member

Thanks for the PR here! Is it possible perhaps for CMake to set environment variables for other steps? That's probably the best way to do this rather than adding a static profile. That way you'd be able to set CARGO_PROFILE_RELEASE_* to appropriate values and it can be customized per-project. Or perhaps could knobs be added to the CMake project in Wasmtime to set the env vars?

Otherwise though I also wanted to poke a bit about what the benchmark is. This is just starting a program that's not actually using Wasmtime at all? It's just changing how Wasmtime itself is linked in? If so it sounds like the main issue is that Wasmtime has lots of symbols and they may be all getting processed by the dynamic linker at startup. LTO will reduce the number of symbols, but there should be other knobs to internalize symbols in an executable or dynamic library such that LTO wouldn't be required to get the same performance benefits.

@clason
Copy link

clason commented May 6, 2024

Thanks for the PR here! Is it possible perhaps for CMake to set environment variables for other steps? That's probably the best way to do this rather than adding a static profile. That way you'd be able to set CARGO_PROFILE_RELEASE_* to appropriate values and it can be customized per-project.

That's what we did before, but we thought that the preset could be used by other projects with similar needs.

Otherwise though I also wanted to poke a bit about what the benchmark is. This is just starting a program that's not actually using Wasmtime at all? It's just changing how Wasmtime itself is linked in?

That is correct.

If so it sounds like the main issue is that Wasmtime has lots of symbols and they may be all getting processed by the dynamic linker at startup. LTO will reduce the number of symbols, but there should be other knobs to internalize symbols in an executable or dynamic library such that LTO wouldn't be required to get the same performance benefits.

This is our thinking as well; if you know of such knobs, we'd be more than happy for the hint. Our own interface to wasmtime is extremely small (but we do link to tree-sitter, which uses more of it).

@alexcrichton
Copy link
Member

Ok sounds good, and a preset also sounds good to me!

Is this on macOS, Linux, Windows, or all three? This would likely be flags to the linker invocation that creates the neovim executable itself which would be along the lines of "only export this list of symbols and no more" and that would internalize everything else. Is it possible to construct a list like that, for example is main the only symbol that's necessary? Or are plugins dlopen'd which need to reach into internal symbols?

@clason
Copy link

clason commented May 6, 2024

All three and more (as many as we can build on), with a wide range of compilers. We use only a very limited set (basically, wasm_engine_{new,delete}), but we link to tree-sitter which uses more (which I don't know off the top of my head).

@alexcrichton
Copy link
Member

Oh sorry, to clarify, is the performance regression you're seeing on all platforms? Or only some?

@clason
Copy link

clason commented May 6, 2024

So far we have mostly tested with macOS; of course the exact impact varies with platform and toolchain.

@alexcrichton
Copy link
Member

I'm trying to test my theory with a small script but it's not too successful. The big binaries are slower but they don't get faster when removing the symbols so I don't think this going to bear fruit.

Regardless though I think it's fine to land this, but on reviewing it I think it might be best to call this "lto" in the various places instead of "optimized". I want to avoid confusion between "optimized" and --release builds since both are runtime-wise pretty equally optimized, and "lto" is the main differentiating factor here.

@clason
Copy link

clason commented May 6, 2024

I appreciate your testing! Not married to the name, of course, but note that full LTO is only half the game here -- you also need the other settings for the "optimal" benefit. Unless you consider codegen-units to be part of LTO? Then it'd obviously make sense.

@alexcrichton
Copy link
Member

That's true yeah. I suppose I'm mostly looking for a name that evokes something like "pull out all the stops on optimizing this I'm willing to wait for an hour". That typically includes lto and codegen-units=1. I personally consider "lto" as "take your time please optimize this as much as you can" but that's also just me.

Also, question for y'all, does codegen-units=1 help much here in isolation? If that setting is dropped do you see a difference in perf? I ask because that should only affect generated code quality, not symbol visibility. If codegen-units=1 does help then the mystery deepens in my head for what's going on here (PR can land either way of course, I'm just curious now!)

@clason
Copy link

clason commented May 6, 2024

Yes it does! (We benchmarked each option -- @dundargoc could you edit the PR description to add the individual steps?)

And it's very weird -- but then I don't know any other large C project that statically links a Rust dependency (and we have been quite reticent about it im the past).

If LTO has the full connotation for you, that's fine with us of course!

@alexcrichton
Copy link
Member

Odd! Well in any case happy to r+ and flag for merge with a rename.

Otherwise though if y'all have a series of instructions to reproduce this slowdown I'd be happy to try to dig further in my spare time later.

@dundargoc
Copy link
Contributor Author

dundargoc commented May 6, 2024

This is the benchmarking we did to determine which of the flags seemed to make a difference:

--release
  Time (mean ± σ):     132.1 ms ±   3.3 ms    [User: 114.5 ms, System: 12.3 ms]
  Range (min … max):   122.9 ms … 137.0 ms    22 runs

lto=true
  Time (mean ± σ):      83.5 ms ±   1.9 ms    [User: 66.3 ms, System: 12.0 ms]
  Range (min … max):    80.1 ms …  86.8 ms    34 runs

lto=true, codegen-units=1
  Time (mean ± σ):      80.7 ms ±   1.9 ms    [User: 63.6 ms, System: 11.9 ms]
  Range (min … max):    77.3 ms …  84.3 ms    34 runs

lto=true, codegen-units=1, strip="none"
  Time (mean ± σ):      80.9 ms ±   2.0 ms    [User: 64.0 ms, System: 12.0 ms]
  Range (min … max):    77.1 ms …  84.7 ms    34 runs

lto=true, codegen-units=1, strip="none", panic="abort"
  Time (mean ± σ):      52.0 ms ±   1.2 ms    [User: 35.2 ms, System: 11.8 ms]
  Range (min … max):    49.9 ms …  56.4 ms    51 runs

lto=true, codegen-units=1, strip="none", panic="abort", incremental=false
  Time (mean ± σ):      52.9 ms ±   1.1 ms    [User: 35.4 ms, System: 12.0 ms]
  Range (min … max):    50.4 ms …  54.9 ms    52 runs

So from this we were able to see that lto and panic makes the biggest difference, and that codegen-units=1 does make a small yet noticable difference as well. Note that @clason noticed a bigger difference for codegen-units=1 on his intel macos, so it could be platform-dependent as well.


As for the profile name, I am fine with whatever. lto is just fine. For reference, we got these flags from cargo-wizard and the configuration for this was called fast-runtime. So perhaps that is also a viable profile name? In any case, I'm fine with either, just lemme know what you want me to change it to.

@dundargoc
Copy link
Contributor Author

Oh, right. Reproduction steps. We build neovim in this PR with the following command:

make distclean #just to remove old builds
make CMAKE_EXTRA_FLAGS=-DENABLE_WASMTIME=ON DEPS_CMAKE_FLAGS=-DENABLE_WASMTIME=ON CMAKE_BUILD_TYPE=Release
cmake --install build --prefix bin
hyperfine --warmup 2 "bin/bin/nvim +'quit'"

I point to a local path of wasmtime in cmake.deps/deps.txt so it says

WASMTIME_URL /Users/dundargoc/programs/wasmtime

It's not the most graceful reproductions steps I'm afraid but that's the current tech we're working with at the moment 😅

@alexcrichton
Copy link
Member

"fast-runtime" sounds reasonable to me, although I might bikeshed to "fastest-runtime" perhaps to help head off confusion between the default --release and "fast-runtime" where they're both supposed to be fast

@alexcrichton
Copy link
Member

And no worries! I'll try to kick the tires there and see if I can't bottom something out as to how Wasmtime is affecting things

This is extremely useful for cases where the default optimizations just
are not enough.

Background: [neovim](https://github.com/neovim/neovim) is interested to
add wasmtime support in neovim/neovim#28415 but
we noticed that including wasmtime, even when not using wasmtime
directly, heavily affects runtime performance. This is not only
reflected in the increased startuptime but affects the runtime
performance overall.

Here are the benchmarks for startuptimes for different configurations.
Important to note is that all of runtime is affected, but the
startuptime is a decent proxy to measure runtime performance:

```
No wasm
  Time (mean ± σ):      50.5 ms ±   1.5 ms    [User: 32.8 ms, System: 12.3 ms]
  Range (min … max):    48.3 ms …  54.4 ms    56 runs

Wasm, lto=thin
  Time (mean ± σ):     104.9 ms ±   3.5 ms    [User: 86.5 ms, System: 12.7 ms]
  Range (min … max):    99.5 ms … 111.1 ms    26 runs

Wasm, lto=true
  Time (mean ± σ):      83.8 ms ±   2.5 ms    [User: 65.8 ms, System: 12.1 ms]
  Range (min … max):    80.5 ms …  93.3 ms    31 runs

Wasm, lto=true, strip="none", incremental=false, codegen-units=1, panic="abort"
  Time (mean ± σ):      53.1 ms ±   1.0 ms    [User: 35.5 ms, System: 12.5 ms]
  Range (min … max):    50.6 ms …  55.5 ms    54 runs
```
@dundargoc dundargoc force-pushed the build/performance branch from 1d88acc to 5d06b1f Compare May 6, 2024 22:57
@dundargoc dundargoc changed the title build: add "optimized" profile for runtime optimization build: add "fastest-runtime" profile for runtime optimization May 6, 2024
@alexcrichton alexcrichton added this pull request to the merge queue May 6, 2024
Merged via the queue into bytecodealliance:main with commit ddde6f6 May 6, 2024
24 checks passed
@dundargoc dundargoc deleted the build/performance branch May 7, 2024 09:27
dundargoc added a commit to dundargoc/wasmtime that referenced this pull request May 7, 2024
…dealliance#8554)

This is extremely useful for cases where the default optimizations just
are not enough.

Background: [neovim](https://github.com/neovim/neovim) is interested to
add wasmtime support in neovim/neovim#28415 but
we noticed that including wasmtime, even when not using wasmtime
directly, heavily affects runtime performance. This is not only
reflected in the increased startuptime but affects the runtime
performance overall.

Here are the benchmarks for startuptimes for different configurations.
Important to note is that all of runtime is affected, but the
startuptime is a decent proxy to measure runtime performance:

```
No wasm
  Time (mean ± σ):      50.5 ms ±   1.5 ms    [User: 32.8 ms, System: 12.3 ms]
  Range (min … max):    48.3 ms …  54.4 ms    56 runs

Wasm, lto=thin
  Time (mean ± σ):     104.9 ms ±   3.5 ms    [User: 86.5 ms, System: 12.7 ms]
  Range (min … max):    99.5 ms … 111.1 ms    26 runs

Wasm, lto=true
  Time (mean ± σ):      83.8 ms ±   2.5 ms    [User: 65.8 ms, System: 12.1 ms]
  Range (min … max):    80.5 ms …  93.3 ms    31 runs

Wasm, lto=true, strip="none", incremental=false, codegen-units=1, panic="abort"
  Time (mean ± σ):      53.1 ms ±   1.0 ms    [User: 35.5 ms, System: 12.5 ms]
  Range (min … max):    50.6 ms …  55.5 ms    54 runs
```
@alexcrichton
Copy link
Member

I did a bit of debugging and unfortunately I can't figure out what's going on. The slowdown seems related to symbols but also not related to symbols. I suspect there's bits-and-pieces of the mach-o format that I don't understand which are coming into play here. I couldn't reproduce this slowdown on Linux, only on macOS, and profiling there I find a bit more difficult through instruments rather than perf. Alas!

alexcrichton pushed a commit that referenced this pull request May 7, 2024
* fix(c-api): give a cleaner error message if cargo isn't found (#8497)

Instead of

> Performing build step for
'wasmtime-crate''WASMTIME_CARGO_BINARY-NOTFOUND' is not recognized as an
internal or external command, operable program or batch file.

this will now instead output

> "cargo" was not found. Ensure "cargo" is in PATH. Aborting...

* c-api: use `--release` when MinSizeRel and RelWithDebInfo is used (#8549)

* build: add "fastest-runtime" profile for runtime optimization (#8554)

This is extremely useful for cases where the default optimizations just
are not enough.

Background: [neovim](https://github.com/neovim/neovim) is interested to
add wasmtime support in neovim/neovim#28415 but
we noticed that including wasmtime, even when not using wasmtime
directly, heavily affects runtime performance. This is not only
reflected in the increased startuptime but affects the runtime
performance overall.

Here are the benchmarks for startuptimes for different configurations.
Important to note is that all of runtime is affected, but the
startuptime is a decent proxy to measure runtime performance:

```
No wasm
  Time (mean ± σ):      50.5 ms ±   1.5 ms    [User: 32.8 ms, System: 12.3 ms]
  Range (min … max):    48.3 ms …  54.4 ms    56 runs

Wasm, lto=thin
  Time (mean ± σ):     104.9 ms ±   3.5 ms    [User: 86.5 ms, System: 12.7 ms]
  Range (min … max):    99.5 ms … 111.1 ms    26 runs

Wasm, lto=true
  Time (mean ± σ):      83.8 ms ±   2.5 ms    [User: 65.8 ms, System: 12.1 ms]
  Range (min … max):    80.5 ms …  93.3 ms    31 runs

Wasm, lto=true, strip="none", incremental=false, codegen-units=1, panic="abort"
  Time (mean ± σ):      53.1 ms ±   1.0 ms    [User: 35.5 ms, System: 12.5 ms]
  Range (min … max):    50.6 ms …  55.5 ms    54 runs
```
alexcrichton pushed a commit to alexcrichton/wasmtime that referenced this pull request May 7, 2024
…dealliance#8554)

This is extremely useful for cases where the default optimizations just
are not enough.

Background: [neovim](https://github.com/neovim/neovim) is interested to
add wasmtime support in neovim/neovim#28415 but
we noticed that including wasmtime, even when not using wasmtime
directly, heavily affects runtime performance. This is not only
reflected in the increased startuptime but affects the runtime
performance overall.

Here are the benchmarks for startuptimes for different configurations.
Important to note is that all of runtime is affected, but the
startuptime is a decent proxy to measure runtime performance:

```
No wasm
  Time (mean ± σ):      50.5 ms ±   1.5 ms    [User: 32.8 ms, System: 12.3 ms]
  Range (min … max):    48.3 ms …  54.4 ms    56 runs

Wasm, lto=thin
  Time (mean ± σ):     104.9 ms ±   3.5 ms    [User: 86.5 ms, System: 12.7 ms]
  Range (min … max):    99.5 ms … 111.1 ms    26 runs

Wasm, lto=true
  Time (mean ± σ):      83.8 ms ±   2.5 ms    [User: 65.8 ms, System: 12.1 ms]
  Range (min … max):    80.5 ms …  93.3 ms    31 runs

Wasm, lto=true, strip="none", incremental=false, codegen-units=1, panic="abort"
  Time (mean ± σ):      53.1 ms ±   1.0 ms    [User: 35.5 ms, System: 12.5 ms]
  Range (min … max):    50.6 ms …  55.5 ms    54 runs
```
alexcrichton added a commit that referenced this pull request May 7, 2024
…#8575)

This is extremely useful for cases where the default optimizations just
are not enough.

Background: [neovim](https://github.com/neovim/neovim) is interested to
add wasmtime support in neovim/neovim#28415 but
we noticed that including wasmtime, even when not using wasmtime
directly, heavily affects runtime performance. This is not only
reflected in the increased startuptime but affects the runtime
performance overall.

Here are the benchmarks for startuptimes for different configurations.
Important to note is that all of runtime is affected, but the
startuptime is a decent proxy to measure runtime performance:

```
No wasm
  Time (mean ± σ):      50.5 ms ±   1.5 ms    [User: 32.8 ms, System: 12.3 ms]
  Range (min … max):    48.3 ms …  54.4 ms    56 runs

Wasm, lto=thin
  Time (mean ± σ):     104.9 ms ±   3.5 ms    [User: 86.5 ms, System: 12.7 ms]
  Range (min … max):    99.5 ms … 111.1 ms    26 runs

Wasm, lto=true
  Time (mean ± σ):      83.8 ms ±   2.5 ms    [User: 65.8 ms, System: 12.1 ms]
  Range (min … max):    80.5 ms …  93.3 ms    31 runs

Wasm, lto=true, strip="none", incremental=false, codegen-units=1, panic="abort"
  Time (mean ± σ):      53.1 ms ±   1.0 ms    [User: 35.5 ms, System: 12.5 ms]
  Range (min … max):    50.6 ms …  55.5 ms    54 runs
```

Co-authored-by: dundargoc <33953936+dundargoc@users.noreply.github.com>
carlocab added a commit to Homebrew/homebrew-core that referenced this pull request Sep 8, 2024
This should lead to significat speedups for users. See
bytecodealliance/wasmtime#8554.

While we're here, let's clean up the `cmake` invocation a bit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasmtime:c-api Issues pertaining to the C API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants