Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Baseline compilation in Wasmtime #28

Merged

Conversation

saulecabrera
Copy link
Member

Rendered

This RFC proposes the addition of a new WebAssembly (Wasm) compiler to Wasmtime:
a single-pass or “baseline” compiler. A baseline compiler works to improve
overall compilation performance, yielding faster startup times, at the cost of
less optimized code.

This RFC describes:

  • The motivation for a baseline compiler
  • A high-level overview and description of the structure of the baseline
    compiler
  • A high-level overview of its integration with Wasmtime

cc @cfallin @fitzgen @alexcrichton @bnjbvr

This RFC proposes the addition of a new WebAssembly (Wasm) compiler to Wasmtime:
a single-pass or “baseline” compiler. A baseline compiler works to improve
overall compilation performance, yielding faster startup times, at the cost of
less optimized code.

This RFC describes:

* The motivation for a baseline compiler
* A high-level overview and description of the structure of the baseline
  compiler
* A high-level overview of its integration with Wasmtime
@alexcrichton
Copy link
Member

Personally this all sounds great to me, I'm excited to see this take shape! I'm think that reusing Cranelift backends is a great idea and I think the integration strategy chosen here (no tiering yet) is the right answer for now.

I suspect that various abstractions in Wasmtime which baseline compilation would want to use are not as optimal as they could be as-implemented today. We haven't focused much on compile time performance and Cranelift often dwarfs any cost in Wasmtime or other parts of the system, but I think having a baseline compiler push on these abstractions to optimize them is a great forcing function to have. Mostly just saying this as a word of caution that there may be some surprsing bottlenecks come time for performance measurements but it should all be easy enough to handle.

Lastly one thing I might recommend is if possible getting AArch64 support up and running as soon as possible. In such an architecture-specific context as a baseline compiler adding a second architecture I find can often radically change existing designs so getting that forcing function sooner rather than later may be helpful. That being said I don't think it's necessary to push on it immediately and if y'all would prefer to defer it to the second phase as you've mentioned here I also think that's ok.

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting!

I think reusing Cranelift's instruction encoding but not vcode containers, and therefore avoiding any IR materialization between Wasm and machine code, is a great choice, practical, and complementary to Cranelift's position in the design space.

accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
cfallin added a commit to cfallin/meetings that referenced this pull request Aug 3, 2022
- Discussion of vector endianness (#4566)
- Notification of plans to factor out part of Cranelift's base layer to
  share with Winch, a baseline compiler (bytecodealliance/rfcs#28)
cfallin added a commit to cfallin/meetings that referenced this pull request Aug 3, 2022
- Discussion of vector endianness (bytecodealliance/wasmtime#4566)
- Notification of plans to factor out part of Cranelift's base layer to
  share with Winch, a baseline compiler (bytecodealliance/rfcs#28)
@bjorn3
Copy link
Contributor

bjorn3 commented Aug 4, 2022

This paper may be useful:

https://arxiv.org/abs/2011.13127: Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode

Our WebAssembly compiler generates code 4.9X-6.5X faster than Liftoff, the WebAssembly baseline compiler in Google Chrome. The generated code also outperforms Liftoff's by 39%-63% on the Coremark and PolyBenchC WebAssembly benchmarks.

In the future if an interpreter tier is added below the baseline compiler the following is also interesting:

https://arxiv.org/abs/2205.01183: A fast in-place interpreter for WebAssembly

Our evaluation shows that in-place interpretation of Wasm code is space-efficient and fast, achieving performance on-par with interpreting a custom-designed internal format.

Move initial aarch64 support to phase 1 and move reference types to
phase 2
@cfallin
Copy link
Member

cfallin commented Aug 4, 2022

This paper may be useful:

https://arxiv.org/abs/2011.13127: Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode

Our WebAssembly compiler generates code 4.9X-6.5X faster than Liftoff, the WebAssembly baseline compiler in Google Chrome. The generated code also outperforms Liftoff's by 39%-63% on the Coremark and PolyBenchC WebAssembly benchmarks.

The paper is indeed really interesting, but the approach it takes is not terribly compatible with the goals that we have; in particular we don't want to depend on a C++ compiler and template specialization, and relocation magic, and then gluing together the machine code that comes from that separate offline process, to generate the actual bits of code. An equally important goal of the baseline compiler is to be straightforward to understand and modify; as such, a very simple "emit these three instructions for this Wasm opcode" mapping is ideal for maintainability.

@saulecabrera
Copy link
Member Author

During today's wasmtime meeting, we discussed @fitzgen's and @alexcrichton's feedback around aarch64 support; we concluded that lowering the priority of Reference Types and bumping the priority of supporting at least a subset of aarch64 is probably the best path forward for the initial implementation. I've made the necessary changes to reflect this decision.

@bjorn3
Copy link
Contributor

bjorn3 commented Aug 5, 2022

in particular we don't want to depend on a C++ compiler and template specialization, and relocation magic, and then gluing together the machine code that comes from that separate offline process, to generate the actual bits of code.

Would it be possible to replace C++ with Cranelift IR? It could then be compiled at build time and at runtime.

@saulecabrera
Copy link
Member Author

saulecabrera commented Aug 5, 2022

in particular we don't want to depend on a C++ compiler and template specialization, and relocation magic, and then gluing together the machine code that comes from that separate offline process, to generate the actual bits of code.

Would it be possible to replace C++ with Cranelift IR? It could then be compiled at build time and at runtime.

Personally I think that even if possible, this approach would still not be compatible with the design principles that we've laid out, mainly around (i) simplicity and (ii) no intermediate representation. In general I think that it might be worth thinking about keeping the implementation as simple and straightforward as possible.

@saulecabrera saulecabrera force-pushed the wasmtime-baseline-compilation branch from 2df33ed to 141a3c6 Compare August 5, 2022 12:59
Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite excited by this proposal! Having a baseline compiler would help with experimenting with new wasm proposals faster, more easily support some dev-facing features (e.g. debugging), and be quite useful in use-cases where compile-time speed matters more than runtime speed.

I've added a few comments that should help with clarity, but I think this proposal is sound as it stands right now. Thanks for writing this up.

accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
accepted/wasmtime-baseline-compilation.md Outdated Show resolved Hide resolved
@jameysharp
Copy link

I want to expand on @bjorn3's suggestions. But, to be clear, I'm not going to argue that this alternative should replace the current RFC. If this RFC is the direction that you want to pursue, @saulecabrera, then I see no reason the rest of us shouldn't support you in that. That said, here's an alternative that I think has some advantages.

Imagine that for each WebAssembly instruction, we write down a sequence of Cranelift instructions to implement it, with holes for registers and labels. (To keep this short, I'm hand-waving things like control-flow.) That's basically what cranelift/wasm/src/code_translator.rs is, if you squint.

We currently get an optimizing WebAssembly compiler by filling those CLIF sequences with virtual registers from abstract interpretation, concatenating the snippets, and then handing the result off to Cranelift.

We could instead get a baseline compiler from those snippets of CLIF by compiling each snippet to native code in isolation, and then concatenating the compiled snippets. Compiling the snippets to native code can happen during the wasmtime build if desired. The number of variations we need to compile of each snippet is bounded by the number of physical registers used to shadow the wasm stack. This is essentially Copy-and-Patch, but using the compiler we already maintain instead of a C++ compiler.

And we could also get an interpreter by inserting those CLIF snippets into a dispatch loop, also written in CLIF, with register holes filled in by loads and stores to an explicit value stack. The "fast in-place interpreter" paper implemented the interpreter in native assembly because they couldn't control the generated code well enough from a high-level language, but I believe Cranelift IR gives us everything we'd need while being able to compile the interpreter for multiple architectures. (Chris tells me this resembles Spidermonkey's "baseline interpreter".)

I like this approach because we'd get three tiers, all supporting every target architecture, while only having to implement new instructions once. That appeals to me for giving us confidence that the tiers are all equally correct implementations.

@fitzgen
Copy link
Member

fitzgen commented Sep 1, 2022

Compiling the snippets to native code can happen during the wasmtime build if desired. The

I think this would be necessary to meet performance goals, since otherwise we would be materializing a bunch of IR that we wouldn't need to if we didn't do this approach.

Not clear what kind of impact this would have on the Wasmtime crate's compile times, but that is something I would be concerned about.

I like this approach because we'd get three tiers, all supporting every target architecture, while only having to implement new instructions once. That appeals to me for giving us confidence that the tiers are all equally correct implementations.

Yes, this is pretty nice. Tempting. I like using CLIF to side step using a C++ compiler and making a super hacky post-processing linker that chops off prologues and epilogues and patches control flow together. I like having control over the shape of the code that we would paste together, so we can be sure we aren't messing up edge cases.

But also, this is a bit more work to get to an initial baseline compiler implementation than what the RFC proposes.

@cfallin
Copy link
Member

cfallin commented Sep 1, 2022

I do like @jameysharp 's idea here, from first principles: it would be a nice way to be able to obtain a series of compilers from one source-of-truth. We should experiment with this in the future, I think.

The tradeoff space I see right now is, on the one hand, a "handwritten straightforward baseline compiler (this RFC) with the advantages of:

  • Simple to verify by looking: one sees exactly the instructions that will be emitted for any given opcode, locked down in the source and not indirectly from any other toolchain or generation process;
  • Simpler to add new opcodes to than Cranelift. While on the one hand if we assume that all features will be in Cranelift so leveraging this work is "free", on the other hand in practice engines that have a separate baseline tier (like SpiderMonkey) often add features to the baseline compiler first, because it's an order of magnitude simpler. This could allow us to build out early support for some features and experiment more freely before we work out the details of a high-performance, optimizing-tier implementation.
  • A separate implementation adds "verification diversity": we can differentially fuzz Cranelift against this new backend and possibly find issues.

On the other hand:

  • A copy-and-patch-like approach that stitches together Cranelift-compiled fragments can much more flexibly grow its "library" into many op-combining possibilities. This would be a natural design point between an op-by-op baseline compiler / template-JIT (this RFC) and a full optimizing tier. "Macro-op fusion" is still a single-pass approach so retains compile speed while possibly getting pretty decent isel.
  • If we assume that the Cranelift implementation is a sunk cost, then this does give us baseline compiler support for new features as we add them for "free". We just need to pay for the complexity and effort to build the framework once.

also, to amplify and add a bit to @fitzgen's point here

Not clear what kind of impact this would have on the Wasmtime crate's compile times, but that is something I would be concerned about.

I'd be concerned about build complexity as well: this implies a "bootstrap compiler" that is built during the compiler build process to compile another compiler (or data that is embedded into it); that's probably nontrivial cost. And it implies that Cranelift itself is at least a dev-dependency of the baseline compiler, which may or may not be desirable.

On balance I think I still favor the approach in this RFC, but these ideas are intriguing for a possible later exploration (so thanks again for raising them!).

@jameysharp
Copy link

Thanks, Nick and Chris! I generally agree with both of you, although I'd quibble about the value of implementation diversity: we already have plenty of wasm implementations to use for differential testing. I have more thoughts on engineering details such as build time and build complexity, but I don't want to sidetrack discussion on this RFC further. So unless consensus swings towards implementing the copy-and-patch thing, I'll leave it at that.

@saulecabrera
Copy link
Member Author

Thanks for expanding on this @jameysharp; I find your thoughts interesting here, mostly from the perspective of maximizing reusability and correctness. I must admit that one of the biggest unknowns for me, is the build-time complexity that this approach will/might introduce and what will be the ROI of that cost; in paper the ideas seem compelling, but I find it a bit hard to quantify in the scope of Wasmtime without more investment/investigation in that direction.

Expanding on @cfallin's point above around the simplicity to add new features:

One side-effect that I'm hoping to get with the current proposal (which is not explicit in the RFC and perhaps I should add it), is to make easier to add and maintain developer-facing features, the main feature that I have in mind is debugging. I feel that with the Copy-and-Patch approach, this will be more difficult to achieve (this is purely an assumption, so I might be wrong here).

On correctness: FWIW, we discussed (and the RFC states) integrating fuzzing in phase 1 to ensure a correctness-first approach; this doesn't void the fact that having a single source of truth is potentially less error-prone.

In conclusion my feeling is that switching the approach has the potential to bring in a set of desirable benefits, but we'd need to deal with more unknowns/complexity. The approach that we have today, offers a higher level of certainty in terms of maintenance, compilation and runtime performance in the scope of Wasmtime. That said, if there's a way to leave the door open to incrementally investigate this down the road, on top of the current proposal, I think we should consider it.

@jameysharp
Copy link

In conclusion my feeling is that switching the approach has the potential to bring in a set of desirable benefits, but we'd need to deal with more unknowns/complexity. The approach that we have today, offers a higher level of certainty in terms of maintenance, compilation and runtime performance in the scope of Wasmtime.

I agree. My guess is that copy-and-patch would let us maintain less total code in the end, but compared to the RFC, it's definitely not as clear how we get from here to there.

That said, if there's a way to leave the door open to incrementally investigate this down the road, on top of the current proposal, I think we should consider it.

I don't think there's anything about the current RFC that makes it hard to do something like copy-and-patch later. As long as Wasmtime has a path that still uses Cranelift, we can try slicing it up any time. We absolutely don't have to choose one approach or the other.

On the other hand, I don't think the work outlined in this RFC helps much with implementing copy-and-patch. So, if folks felt like that was something we want to do in the short term, then we could save some effort overall by going straight there. But since copy-and-patch has more uncertainty and engineering risk, I think it's reasonable to follow the RFC's plan first.

One side-effect that I'm hoping to get with the current proposal (which is not explicit in the RFC and perhaps I should add it), is to make easier to add and maintain developer-facing features, the main feature that I have in mind is debugging.

I think that would be a great note to add to the RFC, as well as these points that Chris made:

  • Simple to verify by looking: one sees exactly the instructions that will be emitted for any given opcode, locked down in the source and not indirectly from any other toolchain or generation process;

  • Simpler to add new opcodes to than Cranelift. [...] in practice engines that have a separate baseline tier (like SpiderMonkey) often add features to the baseline compiler first, because it's an order of magnitude simpler. This could allow us to build out early support for some features and experiment more freely before we work out the details of a high-performance, optimizing-tier implementation.

@saulecabrera
Copy link
Member Author

saulecabrera commented Jan 18, 2023

The discussion here has stabilized and some considerable pieces of the compiler have landed in Wasmtime. With this, I would like to make a

Motion to finalize with disposition to merge

I've updated the stakeholder list to contain all stakeholders that have participated in Winch related discussions.

Embark Studios

Fastly

Intel

Shopify

Unaffiliated

@saulecabrera
Copy link
Member Author

Given the approvals, I'll move the RFC to the 10-day final comment period which spans until January 28th, 2023.

@cfallin
Copy link
Member

cfallin commented Jan 30, 2023

The FCP has elapsed without any objections; therefore, we're now merging this RFC! Thanks @saulecabrera for defining this and doing the implementation work so far; it's exciting to see.

@cfallin cfallin merged commit 3f0ae11 into bytecodealliance:main Jan 30, 2023
@saulecabrera saulecabrera deleted the wasmtime-baseline-compilation branch January 30, 2023 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants