-
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hand off code to a preinstalled optimized runtime if available #2
Comments
This seems like it should be pretty straightforward to implement. You just need some way to RPC with the tool... which could just be passing token streams to STDIN and reading output / errors from STDOUT / STDERR. It would also be possible to use an entirely different runtime for this, such as Oh also, the tool should have some form of version check built-in. I might be able to poke at this next weekend. |
I am on board with using a JIT runtime for the precompiled one, but we should make sure that it caches the JIT artifacts. In typical usage you might invoke the same macro many times, and we don't want the JIT to need to run on the same code more than once. |
wasmtime does indeed have a code cache, fwiw. +cc @sunfishcode |
First wanted to say thanks for exploring this space @dtolnay, this is all definitely super useful user-experience for eventual stabilization in rustc/cargo themselves! On the topic of an optimized runtime, I'd probably discourage making a watt-specific runtime since running WebAssembly at speed in the limit of time can be a very difficult project to keep up with. WebAssembly is evolving (albeit somewhat slowly) and as rustc/LLVM keep up it might be a pain to have another runtime to have to keep up to date and all. Would you be up for having some exploration done to see if The On a technical level it should be possible with using wasi APIs to communicate either with stdin/stdout or files. With wasi/wasmtime it's still somewhat early days so we can add features there too as necessary! I wouldn't mind setting aside some time to investigate all this if this all sounds reasonable to you @dtolnay? |
What I would have in mind by a watt-specific runtime isn't a whole new implementation of WebAssembly from scratch, but some existing maintained runtime like wasmtime wrapped with any additional proc macro specific logic we want compiled in release mode. Maybe that additional logic is nothing and we can use a vanilla wasmtime binary -- I just want to make sure we are running as little as possible in our debug-mode shim because the performance difference is extreme. @alexcrichton what you wrote sounds reasonable to me and I would love if you had time to investigate further. Thanks! |
I think an amazing milestone would be when proc macros built for Watt running in Wasmtime are faster than natively compiled proc macros in a typical |
Question: would it be possible to just bundle platform-specific binaries with Watt? You could make a bunch of |
I've settled on a strategy where my thinking is that to at least prove this out I'm going to attempt to dlopen @dtolnay do you have some benchmarks in mind already to game out? Or are you thinking of "let's just compile some crates with serde things" |
A good benchmark to start off would be derive(Deserialize) on some simple struct with 6 fields, using the wasm file published in wa-serde-derive. |
@alexcrichton do you know if it would it be possible to bundle platform-specific |
Ok this ended up actually being a lot easier than I thought it would be! Note that I don't actually have a "fallback mode" back to the interpreter, I just changed the whole crate and figured that if this panned out we could figure out how to have source code that simultaneously supports both later. The jit code all lives on this branch, but it's a bit of a mess. Requires using I compiled with #![allow(dead_code)]
#[derive(wa_serde_derive::Deserialize)]
struct Foo {
a: f32,
b: String,
c: (String, i32),
d: Option<u32>,
e: u128,
f: f64,
}
#[derive(serde_derive::Deserialize)]
struct Bar {
a: f32,
b: String,
c: (String, i32),
d: Option<u32>,
e: u128,
f: f64,
}
fn main() {
println!("Hello, world!");
} I also instrumented a local checkout of
The breakdown of the jit looks like:
Next I was curious about the compile time for the entire project. Here I just included one of the impls above and measured the compile time of
and finally, the compile time of the
Some conclusions:
Overall seems promising! Not quite ready for prime time (but then again none of this really is per se), but I think this is a solid path forward. @kazimuth sorry meant to reply earlier but forgot! I do think we can certainly distribute precompiled |
This was used when [prototyping] but I found it wasn't implemented yet! [prototyping]: dtolnay/watt#2 (comment)
Ok dug in a bit more with the help of some wasmtime folks. The wasmtime crate has support for a local code cache on your system, keyed off basically the checksum of the wasm module blob (afaik). That code cache vastly accelerates the instantiation phase since no compilation needs to happen. Above an instantiation on my machine took 700ms or so, but with the cache enabled it takes 45ms. That means with a cached module expansion as a whole takes 65.97ms, which looks like a split between the loading of the cache (45ms), calling the exported function (16ms), creating the import map (3ms), and various change elsewhere. Looks like loading the cache isn't going to be easy to change much, its 45ms breakdown is roughly:
This also doesn't take into account execution time of the macro which is still slower than the debug mode version, clocking in at 24-20ms vs the 9ms for serde in debug mode. My read from this is that we'll want to heavily cache things (wasmtime's cache, cache instances in-process for lots of |
That's awesome! I am away at Rust Belt Rust until Sunday so I won't have a lot of time to engage with this until later, but I would be happy to start landing changes in this repo where it makes sense, for example all the simplified signatures in sym.rs in 5925e60. I've added @alexcrichton as a collaborator. |
This was used when [prototyping] but I found it wasn't implemented yet! [prototyping]: dtolnay/watt#2 (comment)
Adding my two cents regarding the Wasmtime cache system:
So, the things above might slightly affect the performance. I'll take a look at the SecondaryMap serialization. |
@alexcrichton when I was considering if Wasmtime cache needs compression, the uncompressed cache had some places with really low entropy. I haven't investigated it, but my guess was that |
Thanks for the info @mrowqa! It's good to know that we've got a lot of knobs if necessary when playing around with the cache here, and we can definitely investigate them trying to go forward! One of my main worries now at this point for any viability whatsoever is to understand why the execution of a wasm optimized procedural macro is 2x slower than the execution of the native unoptimized version |
Sorry for the radio silence here I haven't forgotten about this. I still want to dig in more to investigate the peformance of wasm code vs not. It's gonna take some more time though, I haven't had a chance to start. |
Some of the poor performance may be caused by the shape of the wasm/native ffi boundary. For example, until #10, strings were copied into wasm byte-by-byte. As string passing is used frequently to convert things like It might also be desirable to use a fork of |
Ok back to some benchmarking. This is based on #11 to gather timing information so it rules out the issue of bulk-data transfers. The benchmark here is: #[derive(Serialize)]
struct S(f32, f32, f32, /* 1000 `f32` fields in total ..*/); Here's the timings I'm getting:
So it looks like almost all the time is spent in the imported functions. Taking a look at those with some instrumentation we get:
My conclusion from this is that there's probably lower hanging fruit than further optimizing the wasm runtime. It appears that we basically get all the bang for the buck necessary with @dtolnay or @mystor do you have ideas perhaps looking at this profile of ways that the watt APIs could be improved? |
I should also mention that for this benchmark the interpreter takes 10.99s in debug mode and 1.15s in release mode. If the runtime API calls are themselves optimized then I think it's definitely be apparent that (as expected) the JIT is at least one order of magnitude faster than the interpreter, if not multiple. (debug ~10s in wasm vs 77ms, and release ~500ms in wasm vs 48.29ms) |
Wow this is great! Question about the "time in wasm" measurements -- how come there is a 60% difference between debug mode (78ms) and release mode (48ms)? Shouldn't everything going on inside the JIT runtime be the same between those two? Is it including some part of the overhead from the hostfunc calls?
I agree. My first thought for optimizing the boundary is: Right now we are routing every proc_macro API call individually out of the JIT. It would be good to experiment with how not to do that. For example we could provide a WASM compiled version of proc-macro2's fallback implementation that we hand off together with the caller's WASM into the JIT, such that the caller's macro runs against the emulated proc macro library and not real proc_macro calls. Then when their macro returns we translate the resulting emulated TokenStream into a proc_macro::TokenStream. Basically the only tricky bit is indexing all the spans in the input and remapping each output span into which one of the input spans it corresponds to. The emulated Span type would hold just an index into our list of all input spans. I believe this would be a large performance improvement because native serde_derive executes in 163ms while wa-serde-derive spends 912ms in hostfuncs -- the effect of this redesign would be that all our hostfunc time is replaced by doing a subset of the work that native serde_derive does, so I would expect the time for the translation from emulated TokenStream to real TokenStream to be less than 163ms in debug mode. |
Yeah I was sort of perplexed at that myself. I did a quick check though and nothing appears awry so it's either normal timing differences (30ms even is basically just variance unless you run it a bunch of times) or as you mentioned the various surrounding "cruft". There's a few small pieces before/after the timing locations which could have attributed more to the wasm than was actually spent in wasm in debug mode, I was just sort of crudely timing things by instrumenting all calls with I agree with your intuition as well, that makes sense! To achieve that goal I don't think use watt_proc_macro2::TokenStream; // not a shadow of `proc-macro2`
#[no_mangle]
pub extern "C" fn my_macro(input: TokenStream) -> TokenStream {
// not necessary since `watt_proc_macro2` has a statically known initialization symbol
// we call first before we call `my_macro`, and that initialization function does this.
// proc_macro2::set_wasm_panic_hook();
let input = input.into_proc_macro2(); // creates a real crates.io `proc_macro2::TokenStream`
// .. do the real macro on `proc_macro2::TokenStream`, as you usually do
let ret = ...;
// and convert back into a watt token stream
input.into()
} The conversion from a Furthermore you could actually imagine this being on steroids: use proc_macro2::TokenStream;
use watt::prelude::*;
#[watt::proc_macro]
pub fn my_macro(input: TokenStream) -> TokenStream {
// ...
} Basically Anyway I digress. Basically my main point is that the wasm blob I think will want the translation baked into it. We could play with a few different deserialization/serialization strategies as well to see which is fastest, but it would indeed be pretty slick if everything was internal to the wasm blob until only the very edges of the wasm. Some of this may require coordination in |
That sounds good! I don't mind relying on [patch] so much for now, since it's only on the part of macro authors and not consumers. I think once the performance is in good shape we can revisit everything from the tooling and integration side. |
👍 Ok I'll tinker with this and see what I can come up with |
I think the most important thing is improving the transparency of the API to the optimizer. Many of the specific methods where a lot of time is being spent seem like obvious easy-to-optimize places, so it may be possible to make good progress with a better API ( My first reservation about the "send all of the data into wasm eagerly" approach was that extra data, like the text of unused As mentioned, one of the biggest issues there would be On the wasm side, the data would probably look similar to how it looks today, but with I'm not sure how much other hidden data is associated with individual tokens beyond spans, but we'd also lose any such information with this model. I'm guessing that there is little enough of that for it to not matter. |
Ok so it turns out that the API of
So basically a macro looks like "serialize everything to a binary blob" which retains The timings are looking impressive!
And for each imported function:
This is dramatically faster by minimizing the time spent crossing a chatty boundary. We're spending 9x less time in imported code in debug mode adn ~6x in release mode. It sort of makes sense here that the deserialization of what's probably like a megabyte of source code takes 110ms in debug mode. The "time in wasm" should be break down as (a) deserialize the input, (b) do the processing, and (c) serialize the input. I would expect that (b) should be close to the release mode execution time within a small margin (ish), but (a) and (c) are extra done today. If my assertion about (b) is true (which it probably isn't since I think Cranelift is still working on perf) then there's room to optimize in (a) and (c). For example @mystor's idea for perhaps leaving literals as handles to the watt runtime might make sense, once you create a From this I think I would conclude:
Overall this looks like a great way forward. I suspect further tweaking like @mystor mentions in trying to keep as much string-like data on the |
|
So actually, as usual, experimenting is faster than actually typing up the comment saying what we may want to experiment with. Here's timing information where
And for each imported function:
So that was an easy 30ms win! @dtolnay to answer your question about the signature, would you be opposed to a macro? Something like |
It shouldn't require an attribute macro though, right? We control exactly what argument the main entry point receives here. I am imagining something like (pseudocode): let raw_token_stream = Val::i32(d.tokenstream.push(input) as i32);
let input_token_stream = raw_to_pm2.call(&[raw_token_stream]).unwrap()[0];
let output_token_stream = main.call(&[input_token_stream]).unwrap()[0];
let raw_token_stream = pm2_to_raw.call(&[output_token_stream]).unwrap()[0];
return d.tokenstream[raw_token_stream].clone(); where |
That's possible but would require specifying the ABI of |
Ah, makes sense. Yes I would be on board with an attribute macro to hide the ABI. |
FWIW I experimented a bit, a while ago, with some really hacky macros around |
Ok I've sent the culmination of all of this in as #14 |
I was curious to see the impact of Wasmtime's recent development since I last added the `WATT_JIT` env var feature to `watt` a few years ago since quite a lot has changed about Wasmtime in the meantime. The changes in this PR account for some ABI changes which have happened in the C API which doesn't account for anything major. Taking my old benchmark of `#[derive(Serialize)]` on `struct S(f32, ... /* 1000 times */)` the timings I get for the latest version of `serde_derive` are: | | native | watt | watt (cached) | |---------|--------|-------|---------------| | debug | 156ms | 280ms | 125ms | | release | 70ms | 257ms | 100ms | Using instead `#[derive(Serialize)] struct S(f32)` the timings I get are: | | native | watt | watt (cached) | |---------|--------|-------|---------------| | debug | 1ms | 241ms | 41ms | | release | 387us | 205ms | 46ms | So for large inputs jit-compiled WebAssembly can be faster than the native `serde_derive` when serde is itself compiled in debug mode. Note that this is almost always the default nowadays since `cargo build --release` will currently build build-dependencies with no optimizations. Only through explicit profile configuration can `serde_derive` be built in optimized mode (as I did to collect the above numbers). The `watt (cached)` column is where I enabled Wasmtime's global compilation cache to avoid recompiling the module every time the proc-macro is loaded which is why the timings are much lower. The difference between `watt` and `watt (cached)` is the compile time of the module itself. The 40ms or so in `watt (cached)` is almost entirely overhead of loading the module from cache which involves decompressing the module from disk and additionally sloshing bytes around. More efficient storage mediums exist for Wasmtime modules which means that it would actually be pretty easy to shave off a good chunk of time from that. Additionally Wasmtime has a custom C API which significantly differs from the one used in this repository which would also be significantly faster for calling into the host from wasm. Of the current ~3ms runtime in wasm itself that could probably be reduced further with more optimized calls. Overall this seems like pretty good progress made on Wasmtime in the interim since all my initial work in dtolnay#2. In any case I wanted to post this to get the `WATT_JIT` feature at least working again since otherwise it's segfaulting right now, and perhaps in the future if necessary more perf work can be done!
From some rough tests, Watt macro expansion when compiling the runtime in release mode is about 15x faster than when the runtime is compiled in debug mode.
Maybe we can set it up such that users can run something like
cargo install watt-runtime
and then our debug-mode runtime can detect whether that optimized runtime is installed; if it is, then handing off the program to it.The text was updated successfully, but these errors were encountered: