Skip to content

Cranelift: Remove ArgumentPurpose::StructReturn #4618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Aug 4, 2022

We used to legalize multi-value returns to using struct-return pointers where
callees would store result values into the given struct-return buffer and
callers would provide the struct-return buffer when calling and then load the
results out of it. We haven't done that for a while and just rely on the calling
convention's normal method of returning multiple values now. The only special
casing that ArgumentPurpose::StructReturn has now is

  1. We legalize signatures that have a StructReturn parameter but no
    StructReturn result to add the missing StructReturn result

  2. We automatically insert a copy from a function's StructReturn argument to
    its StructReturn result value

This isn't really useful for Cranelift embedders anymore since it doesn't even
handle putting the return values into the struct-return buffer or getting them
out again, has maintenance and cruft overhead for Cranelift hackers, and the
above signature legalization in (1) also imposes performance overhead on all
Cranelift compiles regardless of whether they use struct returns or not. It's
time we removed the vestigial ArgumentPurpose::StructReturn.

Finally, here are the Sightglass benchmark wins we get from this removal:

compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 214956202.90 ± 31700992.96 (confidence = 99%)

  main.so is 0.91x to 0.94x faster than no-sret.so!
  no-sret.so is 1.07x to 1.09x faster than main.so!

  [2765571620 2866580329.79 3085702646] main.so
  [2396129997 2651624126.89 2923726602] no-sret.so

compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 4176509.17 ± 2835408.05 (confidence = 99%)

  main.so is 0.95x to 0.99x faster than no-sret.so!
  no-sret.so is 1.01x to 1.05x faster than main.so!

  [115737735 133448206.82 149712338] main.so
  [108735836 129271697.65 166386156] no-sret.so

compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [77356671 85735828.56 96331117] main.so
  [75824588 84176414.51 94308652] no-sret.so

@fitzgen fitzgen requested a review from cfallin August 4, 2022 21:43
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Aug 4, 2022
@github-actions
Copy link

github-actions bot commented Aug 4, 2022

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good speedup for sure!

I'm not terribly fond of the externalization of the legalize step here -- it seems error-prone to require this at each callsite. Is there a way to avoid the clone in another way, perhaps by returning Cow from ensure_struct_return_ptr_is_returned and only cloning the sig if it's actually mutated?

@fitzgen
Copy link
Member Author

fitzgen commented Aug 4, 2022

Returning a Cow doesn't work because the signature gets stored in ABICalle{e,r}Impl and I tried threading a lifetime through those structs but then run into borrow errors where something wants unique access to the thing the signature is inside but the signature is shared borrowed by ABICalle{e,r}Impls.

@fitzgen
Copy link
Member Author

fitzgen commented Aug 4, 2022

That said, I am investigating moving this legalization out of the mach inst code and into the actual legalizer pass we have, which should just modify the signatures in place, instead of doing any clones at all

@cfallin
Copy link
Member

cfallin commented Aug 4, 2022

Alternately: give ABISig an iterator-only interface, and an implementation that sneakily inserts the extra arg in the case where we're mutating now. That's the most efficient I think, at the cost of some refactoring at use-sites...

@fitzgen fitzgen force-pushed the dont-canonicalize-multiple-times branch from f099952 to a180a41 Compare August 5, 2022 16:19
@fitzgen
Copy link
Member Author

fitzgen commented Aug 5, 2022

I've rebased this on top of #4621 and rewritten it to do this signature legalization in-place in the legalization pass. This gives a nice phase separation and puts responsibility for legalization in the expected place. It is also an even larger speed up (against the remove-unused-fields PR) than the original PR was (against main):

compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 181420458.93 ± 26515667.61 (confidence = 99%)

  scratch/libwasmtime_bench_api.main.so is 0.93x to 0.95x faster than wasmtime/target/release/libwasmtime_bench_api.so!
  wasmtime/target/release/libwasmtime_bench_api.so is 1.06x to 1.08x faster than scratch/libwasmtime_bench_api.main.so!

  [2819380338 2928780796.54 3289265318] scratch/libwasmtime_bench_api.main.so
  [2417114451 2747360337.61 3380763132] wasmtime/target/release/libwasmtime_bench_api.so

compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 2749394.66 ± 1902797.53 (confidence = 99%)

  scratch/libwasmtime_bench_api.main.so is 0.97x to 0.99x faster than wasmtime/target/release/libwasmtime_bench_api.so!
  wasmtime/target/release/libwasmtime_bench_api.so is 1.01x to 1.04x faster than scratch/libwasmtime_bench_api.main.so!

  [115751117 134432921.19 157572316] scratch/libwasmtime_bench_api.main.so
  [113527530 131683526.53 161295688] wasmtime/target/release/libwasmtime_bench_api.so

compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [76366357 86553210.56 97936025] scratch/libwasmtime_bench_api.main.so
  [76233349 85885027.75 100047281] wasmtime/target/release/libwasmtime_bench_api.so

However, I starting looking into how StructReturn was actually used, and... it isn't? Originally, we used to legalize multi-value returns into StructReturn (I originally wrote that code way back when) but that hasn't happened for a while. So I'm wondering: why do we even have StructReturn anymore? Can we completely remove it? It does seem like cg_clif uses it, but it isn't clear to me that cg_clif can't move to regular multi-value returns and let the backends' normal calling convention code handle this? Or if it needs something custom, then it can legalize calls to use struct return pointers when generating the input CLIF. It really seems like StructReturn is useless since cranelift isn't loading return values out of the struct-return space after the call for embedders. Literally the only special handling of it that Cranelift does is to generate a move from the struct return argument to the struct return result. Doesn't seem like this is providing much value to embedders, and it is a maintenance/performance burden for Cranelift itself...

Or if removing StructReturn is unworkable for some reason, we can probably just make it a CLIF error (and enforce this in the verifier) to have a StructReturn parameter but not a StructReturn result, so that we don't have to do this signature legalization anymore. Would prefer to completely remove StructReturn if possible, though.

cc @bjorn3

@fitzgen fitzgen force-pushed the dont-canonicalize-multiple-times branch 2 times, most recently from 2baf5ba to db76f7b Compare August 5, 2022 21:45
@fitzgen fitzgen requested a review from cfallin August 5, 2022 21:46
@fitzgen
Copy link
Member Author

fitzgen commented Aug 5, 2022

Updating this PR to be about removing ArgumentPurpose::StructReturn.

We used to legalize multi-value returns to using struct-return pointers where
callees would store result values into the given struct-return buffer and
callers would provide the struct-return buffer when calling and then load the
results out of it. We haven't done that for a while and just rely on the calling
convention's normal method of returning multiple values now. The only special
casing that ArgumentPurpose::StructReturn has now is

  1. We legalize signatures that have a StructReturn parameter but no
    StructReturn result to add the missing StructReturn result

  2. We automatically insert a copy from a function's StructReturn argument to
    its StructReturn result value

This isn't really useful for Cranelift embedders anymore since it doesn't even
handle putting the return values into the struct-return buffer or getting them
out again, has maintenance and cruft overhead for Cranelift hackers, and the
above signature legalization in (1) also imposes performance overhead on all
Cranelift compiles regardless of whether they use struct returns or not. It's
time we removed the vestigial ArgumentPurpose::StructReturn.

Finally, here are the Sightglass benchmark wins we get from this removal:

compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 214956202.90 ± 31700992.96 (confidence = 99%)

  main.so is 0.91x to 0.94x faster than no-sret.so!
  no-sret.so is 1.07x to 1.09x faster than main.so!

  [2765571620 2866580329.79 3085702646] main.so
  [2396129997 2651624126.89 2923726602] no-sret.so

compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 4176509.17 ± 2835408.05 (confidence = 99%)

  main.so is 0.95x to 0.99x faster than no-sret.so!
  no-sret.so is 1.01x to 1.05x faster than main.so!

  [115737735 133448206.82 149712338] main.so
  [108735836 129271697.65 166386156] no-sret.so

compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [77356671 85735828.56 96331117] main.so
  [75824588 84176414.51 94308652] no-sret.so

@fitzgen fitzgen changed the title Cranelift: ABISig::from_func_sig should take an already-canonicalized signature Cranelift: Remove ArgumentPurpose::StructReturn Aug 5, 2022
@github-actions github-actions bot added the cranelift:area:aarch64 Issues related to AArch64 backend. label Aug 5, 2022
@bjorn3
Copy link
Contributor

bjorn3 commented Aug 5, 2022

The sysv x86_64 abi handles large struct returning by passing a pointer in a specific register, writing the return value to this buffer and returning the pointer again in a different register. This can't as easily be done if StructReturn is removed. Currently simply declaring an argument as StructReturn is enough, but with that removed you did have to pass it as first argument and make sure to return it again as first return value. I expect most people to forget that last step. In addition architectures other than x86_64 may have different requirements (possibly not even expressible without StructReturn), thus making Cranelift IR less target independent and pushing complexity for choosing the right way towards Cranelift users.

We used to legalize multi-value returns to using struct-return pointers where
callees would store result values into the given struct-return buffer and
callers would provide the struct-return buffer when calling and then load the
results out of it. We haven't done that for a while and just rely on the calling
convention's normal method of returning multiple values now. The only special
casing that `ArgumentPurpose::StructReturn` has now is

1. We legalize signatures that have a `StructReturn` parameter but no
`StructReturn` result to add the missing `StructReturn` result

2. We automatically insert a copy from a function's `StructReturn` argument to
its `StructReturn` result value

This isn't really useful for Cranelift embedders anymore since it doesn't even
handle putting the return values into the struct-return buffer or getting them
out again, has maintenance and cruft overhead for Cranelift hackers, and the
above signature legalization in (1) also imposes performance overhead on all
Cranelift compiles regardless of whether they use struct returns or not. It's
time we removed the vestigial `ArgumentPurpose::StructReturn`.

Finally, here are the Sightglass benchmark wins we get from this removal:

```
compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 214956202.90 ± 31700992.96 (confidence = 99%)

  main.so is 0.91x to 0.94x faster than no-sret.so!
  no-sret.so is 1.07x to 1.09x faster than main.so!

  [2765571620 2866580329.79 3085702646] main.so
  [2396129997 2651624126.89 2923726602] no-sret.so

compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 4176509.17 ± 2835408.05 (confidence = 99%)

  main.so is 0.95x to 0.99x faster than no-sret.so!
  no-sret.so is 1.01x to 1.05x faster than main.so!

  [115737735 133448206.82 149712338] main.so
  [108735836 129271697.65 166386156] no-sret.so

compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [77356671 85735828.56 96331117] main.so
  [75824588 84176414.51 94308652] no-sret.so
```
@fitzgen fitzgen force-pushed the dont-canonicalize-multiple-times branch from db76f7b to 53b8344 Compare August 5, 2022 22:08
@fitzgen
Copy link
Member Author

fitzgen commented Aug 5, 2022

ArgumentPurpose::StructReturn doesn't ensure that the argument matches the calling conventions at all though, all it does is copy the value to the returns if missing. No check that it is the first argument or first return. CLIF producers are already on their own here.

@cfallin
Copy link
Member

cfallin commented Aug 5, 2022

@bjorn3 I think that if Cranelift natively had struct types (as LLVM does), it would make more sense to worry about closely following the ABI for aggregate values. In such a hypothetical case, Cranelift would be responsible for everything related to the struct's handling, and the IR producer would simply create the struct and fill it in or consume it.

However, CLIF is at a lower abstraction level: it provides building blocks, but does not actually have support for struct types. Sometimes these building blocks are "fundamental" and cannot be built from others. For example, args that need to be memcpy'd before a call are difficult to handle by generating CLIF instead.

But I think that sret is not such a feature: it simply has the semantics "return this one arg". One can just as well generate CLIF that does that. And as @fitzgen notes, there are many other details the CLIF producer has to get right; removing one building block does not cause a phase-change in difficulty from "everything provided" to "many fiddly details", it just adds one detail to an already large pile.

So then we ask why remove the building block: in this case, it is a small but not insignificant simplification in the ABI code, because it removes the only kind of "legalization" that we currently need to do to the signature. Having one signature that passes all the way through the pipeline, and then lowering ABI-handling code directly from that signature, is a nice simplicity + correctness win. And, performance as well, as @fitzgen shows here.

So unless there is a reason it cannot be done at the CLIF level at all (please do speak up if this is actually the case and we've missed something), I think it makes sense to follow through with this. Does that make some sense at least?

@bjorn3
Copy link
Contributor

bjorn3 commented Aug 6, 2022

I think that if Cranelift natively had struct types (as LLVM does), it would make more sense to worry about closely following the ABI for aggregate values.

LLVM has struct types lowered to an sret argument at IR level too.

But I think that sret is not such a feature: it simply has the semantics "return this one arg".

It has the semantics of pass this argument in the right register and return it in the right register again on x86_64. This is not the case on all architectures:

On x86 there seems to be a difference between what works on x86_64 and what is actually done in C: https://godbolt.org/z/6dnEhK1vK

Arm64 windows uses x8 instead of x0 as implicit return pointer: https://godbolt.org/z/fe95Y8Pdv

Power has an extra crxor 6,6,6 instruction which if I read the manual correctly clear bit 6 of the condition register: https://rust.godbolt.org/z/d66P6q3z4

In other words removing sret will make it impossible to correctly support those architectures in the future.

So then we ask why remove the building block: in this case, it is a small but not insignificant simplification in the ABI code, because it removes the only kind of "legalization" that we currently need to do to the signature.

I think instead of implementing it as legalization, it should be a direct part of the abi handling code for the respective architecture. Especially as it differs between architectures.

@bjorn3
Copy link
Contributor

bjorn3 commented Aug 7, 2022

I have been debugging an abi incompatibility on aarch64 found in https://github.com/bjorn3/rustc_codegen_cranelift/pull/1255. Turns out sret arguments must be passed in the x8 register, while Cranelift currently passes it in x0 like a regular argument. It is not possible to pass a regular argument in x8, you need sret for this.

Edit: Opened #4634

@afonso360 afonso360 mentioned this pull request Aug 7, 2022
@cfallin
Copy link
Member

cfallin commented Aug 8, 2022

@bjorn3 thanks for the additional details (and for hunting the linked bug!); I agree that given this new context, sret args are needed at the IR level.

@fitzgen fitzgen closed this Aug 8, 2022
@jameysharp
Copy link
Contributor

I had a branch lying around changing the return type of ensure_struct_return_ptr_is_returned to a Cow so the clone can at least be deferred to the caller, since there was one caller where a borrow is fine. After rebasing on #4621, now three of the four callers can use a borrow; only ABICalleeImpl::new has to clone. I'd set it aside because I was having trouble figuring out whether it was actually a performance improvement, but maybe I should pick it up again?

@cfallin
Copy link
Member

cfallin commented Aug 11, 2022

@jameysharp that could be a good change; I'd be curious about the performance. @fitzgen has some plans regarding interning/sharing ABISigs as well; it may be worth coordinating plans between the two of you.

@bjorn3
Copy link
Contributor

bjorn3 commented Aug 11, 2022

I think it makes sense to remove ensure_struct_return_ptr_is_returned and instead directly handle sret in the abi impls. That is more flexible and should allow avoiding this allocation unconditionally.

@fitzgen
Copy link
Member Author

fitzgen commented Aug 11, 2022

I agree that doing the legalization on-the-fly is probably best.

I investigated making it Cow but that doesn't end up working out well because you'd need to start threading a lifetime param through ABICalle{e,r}Impl and a bunch of places and it quickly runs into shared vs exclusive borrows of the same stuff.

Working on refactoring all of signature stuff right now.

@jameysharp
Copy link
Contributor

My trick was I didn't change any types except the return type of ensure_struct_return_ptr_is_returned. That meant I didn't have to thread lifetimes around anywhere, and the only downside was that sometimes the value would get cloned anyway because one of the other types needed to own a copy. That still gave plenty of opportunities to avoid allocations, even if it didn't eliminate all of them.

Here's my cow-sig branch in case anyone wants to look at it. I'm not currently proposing merging it since it sounds like we should do something different. So I haven't re-evaluated whether it has any performance impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift:docs cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants