Cranelift: Remove `ArgumentPurpose::StructReturn` #4618

fitzgen · 2022-08-04T21:43:51Z

We used to legalize multi-value returns to using struct-return pointers where
callees would store result values into the given struct-return buffer and
callers would provide the struct-return buffer when calling and then load the
results out of it. We haven't done that for a while and just rely on the calling
convention's normal method of returning multiple values now. The only special
casing that ArgumentPurpose::StructReturn has now is

We legalize signatures that have a StructReturn parameter but no
StructReturn result to add the missing StructReturn result
We automatically insert a copy from a function's StructReturn argument to
its StructReturn result value

This isn't really useful for Cranelift embedders anymore since it doesn't even
handle putting the return values into the struct-return buffer or getting them
out again, has maintenance and cruft overhead for Cranelift hackers, and the
above signature legalization in (1) also imposes performance overhead on all
Cranelift compiles regardless of whether they use struct returns or not. It's
time we removed the vestigial ArgumentPurpose::StructReturn.

Finally, here are the Sightglass benchmark wins we get from this removal:

compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 214956202.90 ± 31700992.96 (confidence = 99%)

  main.so is 0.91x to 0.94x faster than no-sret.so!
  no-sret.so is 1.07x to 1.09x faster than main.so!

  [2765571620 2866580329.79 3085702646] main.so
  [2396129997 2651624126.89 2923726602] no-sret.so

compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 4176509.17 ± 2835408.05 (confidence = 99%)

  main.so is 0.95x to 0.99x faster than no-sret.so!
  no-sret.so is 1.01x to 1.05x faster than main.so!

  [115737735 133448206.82 149712338] main.so
  [108735836 129271697.65 166386156] no-sret.so

compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [77356671 85735828.56 96331117] main.so
  [75824588 84176414.51 94308652] no-sret.so

github-actions · 2022-08-04T23:12:46Z

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

cfallin

This is a good speedup for sure!

I'm not terribly fond of the externalization of the legalize step here -- it seems error-prone to require this at each callsite. Is there a way to avoid the clone in another way, perhaps by returning Cow from ensure_struct_return_ptr_is_returned and only cloning the sig if it's actually mutated?

fitzgen · 2022-08-04T23:24:12Z

Returning a Cow doesn't work because the signature gets stored in ABICalle{e,r}Impl and I tried threading a lifetime through those structs but then run into borrow errors where something wants unique access to the thing the signature is inside but the signature is shared borrowed by ABICalle{e,r}Impls.

fitzgen · 2022-08-04T23:24:58Z

That said, I am investigating moving this legalization out of the mach inst code and into the actual legalizer pass we have, which should just modify the signatures in place, instead of doing any clones at all

cfallin · 2022-08-04T23:26:35Z

Alternately: give ABISig an iterator-only interface, and an implementation that sneakily inserts the extra arg in the case where we're mutating now. That's the most efficient I think, at the cost of some refactoring at use-sites...

fitzgen · 2022-08-05T16:53:16Z

I've rebased this on top of #4621 and rewritten it to do this signature legalization in-place in the legalization pass. This gives a nice phase separation and puts responsibility for legalization in the expected place. It is also an even larger speed up (against the remove-unused-fields PR) than the original PR was (against main):

compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 181420458.93 ± 26515667.61 (confidence = 99%)

  scratch/libwasmtime_bench_api.main.so is 0.93x to 0.95x faster than wasmtime/target/release/libwasmtime_bench_api.so!
  wasmtime/target/release/libwasmtime_bench_api.so is 1.06x to 1.08x faster than scratch/libwasmtime_bench_api.main.so!

  [2819380338 2928780796.54 3289265318] scratch/libwasmtime_bench_api.main.so
  [2417114451 2747360337.61 3380763132] wasmtime/target/release/libwasmtime_bench_api.so

compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 2749394.66 ± 1902797.53 (confidence = 99%)

  scratch/libwasmtime_bench_api.main.so is 0.97x to 0.99x faster than wasmtime/target/release/libwasmtime_bench_api.so!
  wasmtime/target/release/libwasmtime_bench_api.so is 1.01x to 1.04x faster than scratch/libwasmtime_bench_api.main.so!

  [115751117 134432921.19 157572316] scratch/libwasmtime_bench_api.main.so
  [113527530 131683526.53 161295688] wasmtime/target/release/libwasmtime_bench_api.so

compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [76366357 86553210.56 97936025] scratch/libwasmtime_bench_api.main.so
  [76233349 85885027.75 100047281] wasmtime/target/release/libwasmtime_bench_api.so

However, I starting looking into how StructReturn was actually used, and... it isn't? Originally, we used to legalize multi-value returns into StructReturn (I originally wrote that code way back when) but that hasn't happened for a while. So I'm wondering: why do we even have StructReturn anymore? Can we completely remove it? It does seem like cg_clif uses it, but it isn't clear to me that cg_clif can't move to regular multi-value returns and let the backends' normal calling convention code handle this? Or if it needs something custom, then it can legalize calls to use struct return pointers when generating the input CLIF. It really seems like StructReturn is useless since cranelift isn't loading return values out of the struct-return space after the call for embedders. Literally the only special handling of it that Cranelift does is to generate a move from the struct return argument to the struct return result. Doesn't seem like this is providing much value to embedders, and it is a maintenance/performance burden for Cranelift itself...

Or if removing StructReturn is unworkable for some reason, we can probably just make it a CLIF error (and enforce this in the verifier) to have a StructReturn parameter but not a StructReturn result, so that we don't have to do this signature legalization anymore. Would prefer to completely remove StructReturn if possible, though.

cc @bjorn3

fitzgen · 2022-08-05T21:47:17Z

Updating this PR to be about removing ArgumentPurpose::StructReturn.

We used to legalize multi-value returns to using struct-return pointers where
callees would store result values into the given struct-return buffer and
callers would provide the struct-return buffer when calling and then load the
results out of it. We haven't done that for a while and just rely on the calling
convention's normal method of returning multiple values now. The only special
casing that ArgumentPurpose::StructReturn has now is

We legalize signatures that have a StructReturn parameter but no
StructReturn result to add the missing StructReturn result
We automatically insert a copy from a function's StructReturn argument to
its StructReturn result value

This isn't really useful for Cranelift embedders anymore since it doesn't even
handle putting the return values into the struct-return buffer or getting them
out again, has maintenance and cruft overhead for Cranelift hackers, and the
above signature legalization in (1) also imposes performance overhead on all
Cranelift compiles regardless of whether they use struct returns or not. It's
time we removed the vestigial ArgumentPurpose::StructReturn.

Finally, here are the Sightglass benchmark wins we get from this removal:

compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 214956202.90 ± 31700992.96 (confidence = 99%)

  main.so is 0.91x to 0.94x faster than no-sret.so!
  no-sret.so is 1.07x to 1.09x faster than main.so!

  [2765571620 2866580329.79 3085702646] main.so
  [2396129997 2651624126.89 2923726602] no-sret.so

compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 4176509.17 ± 2835408.05 (confidence = 99%)

  main.so is 0.95x to 0.99x faster than no-sret.so!
  no-sret.so is 1.01x to 1.05x faster than main.so!

  [115737735 133448206.82 149712338] main.so
  [108735836 129271697.65 166386156] no-sret.so

compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [77356671 85735828.56 96331117] main.so
  [75824588 84176414.51 94308652] no-sret.so

bjorn3 · 2022-08-05T22:05:21Z

The sysv x86_64 abi handles large struct returning by passing a pointer in a specific register, writing the return value to this buffer and returning the pointer again in a different register. This can't as easily be done if StructReturn is removed. Currently simply declaring an argument as StructReturn is enough, but with that removed you did have to pass it as first argument and make sure to return it again as first return value. I expect most people to forget that last step. In addition architectures other than x86_64 may have different requirements (possibly not even expressible without StructReturn), thus making Cranelift IR less target independent and pushing complexity for choosing the right way towards Cranelift users.

We used to legalize multi-value returns to using struct-return pointers where callees would store result values into the given struct-return buffer and callers would provide the struct-return buffer when calling and then load the results out of it. We haven't done that for a while and just rely on the calling convention's normal method of returning multiple values now. The only special casing that `ArgumentPurpose::StructReturn` has now is 1. We legalize signatures that have a `StructReturn` parameter but no `StructReturn` result to add the missing `StructReturn` result 2. We automatically insert a copy from a function's `StructReturn` argument to its `StructReturn` result value This isn't really useful for Cranelift embedders anymore since it doesn't even handle putting the return values into the struct-return buffer or getting them out again, has maintenance and cruft overhead for Cranelift hackers, and the above signature legalization in (1) also imposes performance overhead on all Cranelift compiles regardless of whether they use struct returns or not. It's time we removed the vestigial `ArgumentPurpose::StructReturn`. Finally, here are the Sightglass benchmark wins we get from this removal: ``` compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm Δ = 214956202.90 ± 31700992.96 (confidence = 99%) main.so is 0.91x to 0.94x faster than no-sret.so! no-sret.so is 1.07x to 1.09x faster than main.so! [2765571620 2866580329.79 3085702646] main.so [2396129997 2651624126.89 2923726602] no-sret.so compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 4176509.17 ± 2835408.05 (confidence = 99%) main.so is 0.95x to 0.99x faster than no-sret.so! no-sret.so is 1.01x to 1.05x faster than main.so! [115737735 133448206.82 149712338] main.so [108735836 129271697.65 166386156] no-sret.so compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm No difference in performance. [77356671 85735828.56 96331117] main.so [75824588 84176414.51 94308652] no-sret.so ```

fitzgen · 2022-08-05T22:14:23Z

ArgumentPurpose::StructReturn doesn't ensure that the argument matches the calling conventions at all though, all it does is copy the value to the returns if missing. No check that it is the first argument or first return. CLIF producers are already on their own here.

cfallin · 2022-08-05T22:19:46Z

@bjorn3 I think that if Cranelift natively had struct types (as LLVM does), it would make more sense to worry about closely following the ABI for aggregate values. In such a hypothetical case, Cranelift would be responsible for everything related to the struct's handling, and the IR producer would simply create the struct and fill it in or consume it.

However, CLIF is at a lower abstraction level: it provides building blocks, but does not actually have support for struct types. Sometimes these building blocks are "fundamental" and cannot be built from others. For example, args that need to be memcpy'd before a call are difficult to handle by generating CLIF instead.

But I think that sret is not such a feature: it simply has the semantics "return this one arg". One can just as well generate CLIF that does that. And as @fitzgen notes, there are many other details the CLIF producer has to get right; removing one building block does not cause a phase-change in difficulty from "everything provided" to "many fiddly details", it just adds one detail to an already large pile.

So then we ask why remove the building block: in this case, it is a small but not insignificant simplification in the ABI code, because it removes the only kind of "legalization" that we currently need to do to the signature. Having one signature that passes all the way through the pipeline, and then lowering ABI-handling code directly from that signature, is a nice simplicity + correctness win. And, performance as well, as @fitzgen shows here.

So unless there is a reason it cannot be done at the CLIF level at all (please do speak up if this is actually the case and we've missed something), I think it makes sense to follow through with this. Does that make some sense at least?

bjorn3 · 2022-08-06T14:46:12Z

I think that if Cranelift natively had struct types (as LLVM does), it would make more sense to worry about closely following the ABI for aggregate values.

LLVM has struct types lowered to an sret argument at IR level too.

But I think that sret is not such a feature: it simply has the semantics "return this one arg".

It has the semantics of pass this argument in the right register and return it in the right register again on x86_64. This is not the case on all architectures:

On x86 there seems to be a difference between what works on x86_64 and what is actually done in C: https://godbolt.org/z/6dnEhK1vK

Arm64 windows uses x8 instead of x0 as implicit return pointer: https://godbolt.org/z/fe95Y8Pdv

Power has an extra crxor 6,6,6 instruction which if I read the manual correctly clear bit 6 of the condition register: https://rust.godbolt.org/z/d66P6q3z4

In other words removing sret will make it impossible to correctly support those architectures in the future.

So then we ask why remove the building block: in this case, it is a small but not insignificant simplification in the ABI code, because it removes the only kind of "legalization" that we currently need to do to the signature.

I think instead of implementing it as legalization, it should be a direct part of the abi handling code for the respective architecture. Especially as it differs between architectures.

bjorn3 · 2022-08-07T15:09:03Z

I have been debugging an abi incompatibility on aarch64 found in https://github.com/bjorn3/rustc_codegen_cranelift/pull/1255. Turns out sret arguments must be passed in the x8 register, while Cranelift currently passes it in x0 like a regular argument. It is not possible to pass a regular argument in x8, you need sret for this.

Edit: Opened #4634

cfallin · 2022-08-08T18:07:40Z

@bjorn3 thanks for the additional details (and for hunting the linked bug!); I agree that given this new context, sret args are needed at the IR level.

jameysharp · 2022-08-10T23:51:10Z

I had a branch lying around changing the return type of ensure_struct_return_ptr_is_returned to a Cow so the clone can at least be deferred to the caller, since there was one caller where a borrow is fine. After rebasing on #4621, now three of the four callers can use a borrow; only ABICalleeImpl::new has to clone. I'd set it aside because I was having trouble figuring out whether it was actually a performance improvement, but maybe I should pick it up again?

cfallin · 2022-08-11T00:23:22Z

@jameysharp that could be a good change; I'd be curious about the performance. @fitzgen has some plans regarding interning/sharing ABISigs as well; it may be worth coordinating plans between the two of you.

bjorn3 · 2022-08-11T09:34:33Z

I think it makes sense to remove ensure_struct_return_ptr_is_returned and instead directly handle sret in the abi impls. That is more flexible and should allow avoiding this allocation unconditionally.

fitzgen · 2022-08-11T20:41:40Z

I agree that doing the legalization on-the-fly is probably best.

I investigated making it Cow but that doesn't end up working out well because you'd need to start threading a lifetime param through ABICalle{e,r}Impl and a bunch of places and it quickly runs into shared vs exclusive borrows of the same stuff.

Working on refactoring all of signature stuff right now.

jameysharp · 2022-08-11T22:17:09Z

My trick was I didn't change any types except the return type of ensure_struct_return_ptr_is_returned. That meant I didn't have to thread lifetimes around anywhere, and the only downside was that sometimes the value would get cloned anyway because one of the other types needed to own a copy. That still gave plenty of opportunities to avoid allocations, even if it didn't eliminate all of them.

Here's my cow-sig branch in case anyone wants to look at it. I'm not currently proposing merging it since it sounds like we should do something different. So I haven't re-evaluated whether it has any performance impact.

fitzgen requested a review from cfallin August 4, 2022 21:43

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Aug 4, 2022

cfallin reviewed Aug 4, 2022

View reviewed changes

fitzgen force-pushed the dont-canonicalize-multiple-times branch from f099952 to a180a41 Compare August 5, 2022 16:19

fitzgen force-pushed the dont-canonicalize-multiple-times branch 2 times, most recently from 2baf5ba to db76f7b Compare August 5, 2022 21:45

fitzgen requested a review from cfallin August 5, 2022 21:46

fitzgen changed the title ~~Cranelift: ABISig::from_func_sig should take an already-canonicalized signature~~ Cranelift: Remove ArgumentPurpose::StructReturn Aug 5, 2022

github-actions bot added the cranelift:area:aarch64 Issues related to AArch64 backend. label Aug 5, 2022

fitzgen force-pushed the dont-canonicalize-multiple-times branch from db76f7b to 53b8344 Compare August 5, 2022 22:08

github-actions bot added the cranelift:docs label Aug 5, 2022

afonso360 mentioned this pull request Aug 7, 2022

Fix sret for AArch64 #4634

Merged

fitzgen closed this Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift: Remove `ArgumentPurpose::StructReturn` #4618

Cranelift: Remove `ArgumentPurpose::StructReturn` #4618

fitzgen commented Aug 4, 2022 •

edited

Loading

github-actions bot commented Aug 4, 2022

cfallin left a comment

fitzgen commented Aug 4, 2022

fitzgen commented Aug 4, 2022

cfallin commented Aug 4, 2022

fitzgen commented Aug 5, 2022

fitzgen commented Aug 5, 2022

bjorn3 commented Aug 5, 2022 •

edited

Loading

fitzgen commented Aug 5, 2022

cfallin commented Aug 5, 2022

bjorn3 commented Aug 6, 2022

bjorn3 commented Aug 7, 2022 •

edited

Loading

cfallin commented Aug 8, 2022

jameysharp commented Aug 10, 2022

cfallin commented Aug 11, 2022

bjorn3 commented Aug 11, 2022

fitzgen commented Aug 11, 2022

jameysharp commented Aug 11, 2022

Cranelift: Remove ArgumentPurpose::StructReturn #4618

Cranelift: Remove ArgumentPurpose::StructReturn #4618

Conversation

fitzgen commented Aug 4, 2022 • edited Loading

github-actions bot commented Aug 4, 2022

Subscribe to Label Action

cfallin left a comment

Choose a reason for hiding this comment

fitzgen commented Aug 4, 2022

fitzgen commented Aug 4, 2022

cfallin commented Aug 4, 2022

fitzgen commented Aug 5, 2022

fitzgen commented Aug 5, 2022

bjorn3 commented Aug 5, 2022 • edited Loading

fitzgen commented Aug 5, 2022

cfallin commented Aug 5, 2022

bjorn3 commented Aug 6, 2022

bjorn3 commented Aug 7, 2022 • edited Loading

cfallin commented Aug 8, 2022

jameysharp commented Aug 10, 2022

cfallin commented Aug 11, 2022

bjorn3 commented Aug 11, 2022

fitzgen commented Aug 11, 2022

jameysharp commented Aug 11, 2022

Cranelift: Remove `ArgumentPurpose::StructReturn` #4618

Cranelift: Remove `ArgumentPurpose::StructReturn` #4618

fitzgen commented Aug 4, 2022 •

edited

Loading

bjorn3 commented Aug 5, 2022 •

edited

Loading

bjorn3 commented Aug 7, 2022 •

edited

Loading