-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: Simplify leaf functions that do not use the stack #2960
Conversation
@akirilov-arm I haven't looked into your patch in detail yet, but the s390x back-end already doesn't allocate any stack frame if it is not needed. The
( |
bc37b5f
to
d4633dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for tackling this! It's been on the "nice-to-have" todo-list for a long time; I'm happy to see it finally taken care of.
I think that we probably want to address this on aarch64 and x64 at the same time; it seems that almost all the pieces are there, since the ABI implementation is largely shared, unless I am missing something. (If it turns out to be significantly more work, then of course we can get to it later!)
The only uncertainty I have regarding this optimization is how stack unwinding / backtraces are maintained; I see this was discussed some already in #1148. I think that if we have no unwind instructions, the default CFA definition is sufficient since we never adjust SP, at least on SysV platforms. The same should be true for Windows, I think. The metadata that allows a backtrace to map PC to the current (leaf) function should still be present. But we should verify this: could you take one of the tests that depends on unwinding and stackmaps, such as the GC smoketest, and check that it is testing both the no-frame and with-frame cases?
Thanks again for this!
I have enabled a couple of additional tests and made some minor changes, but I haven't checked how much work the x64 backend is going to require, which I am going to do next, so, please, do not merge yet. @cfallin It turns out that there are already tests that use the unwinding information in a suitable way - in particular, As for whether the defaults are sufficient in the AArch64 case - I checked the code that generated the DWARF Common Information Entry (CIE), and it sets both the Canonical Frame Address (CFA) and the return address correctly (among other things), so an empty Frame Description Entry (FDE) should be fine. In fact, I compiled a simple C function and looked at the unwinding information with |
7a9345e
to
50b11fe
Compare
@cfallin It looks like doing this optimization in the x64 backend fails on macOS. I am unable to work on this, so would it be acceptable to submit just the AArch64 part? |
|
||
if flags.unwind_info() { | ||
if flags.unwind_info() && setup_frame { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I have looked into this much but isn't there some way for the unwind info to still unwind leaf functions? Looking at the test where MacOS fails, I think this problem will also exist for aarch64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, doesn't the comment next to the code imply that unwinding on Apple silicon is broken irrespective of my changes, so any potential breakages introduced by my patch on that platform shouldn't be a blocker to progressing with the PR (and we don't test that configuration in CI anyway)?
On the other hand, AFAIK there are no known issues with unwinding on x86-64 macOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unwinding on Apple silicon is broken
Right now, unwinding on M1 doesn't panic/crash, it is just materializing an incomplete backtrace, and I would be interested in at least keeping it not crashing. So I'd be happy to try your patch on Apple Silicon, fwiw! (Also it makes sense to me to get this optimization in for aarch64 only at start, and then trying to untangle what's going on on x64.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akirilov-arm this looks good, and it's fine to pass on x64 for now -- the infrastructure is now in place, in any case, so it should be much easier after all the work you've done here. Thanks!
As an aside, IMHO we should do something about the expected outputs in compile tests having to change with every cross-cutting change (like frame setup or regalloc or ...); but that's a thought for another issue :-) Sorry about the tedium of updating them here.
I do think we need to resolve the macOS unwinding question -- I'll leave it to @bnjbvr to test and give another r+ on this if it looks good (thanks!).
@cfallin Thanks for the review! I just reverted the x64 changes, but left the commit in the history in case anyone else would like to use it as a starting point in the future. |
Will check first thing tomorrow! |
Leaf functions that do not use the stack (e.g. do not clobber any callee-saved registers) do not need a frame record. Copyright (c) 2021, Arm Limited.
Copyright (c) 2021, Arm Limited.
…ack" This reverts commit a531d78. Copyright (c) 2021, Arm Limited.
44ade30
to
5a39e97
Compare
Unfortunately can't check because of #3256 (wasmtime testing on aarch64-darwin is busted before your PR), so let's not block it on that, and merge this one; we can get back to it later. Thanks :) |
Cranelift has had the ability for some time to identify leaf functions; by Cranelift's definition, a leaf function is one that knows of no other call signatures. bytecodealliance#1148 noted how it would be a good idea to avoid extra frame setup work in leaf functions and bytecodealliance#2960 implemented this for aarch64 and s390x. This improvement was not made for x64 due to some test failures. This change avoids any frame setup for non-stack-using leaf functions in x64.
Cranelift has had the ability for some time to identify leaf functions; by Cranelift's definition, a leaf function is one that knows of no other call signatures. bytecodealliance#1148 noted how it would be a good idea to avoid extra frame setup work in leaf functions and bytecodealliance#2960 implemented this for aarch64 and s390x. This improvement was not made for x64 due to some test failures. This change avoids any frame setup for non-stack-using leaf functions in x64.
Cranelift has had the ability for some time to identify leaf functions; by Cranelift's definition, a leaf function is one that knows of no other call signatures. bytecodealliance#1148 noted how it would be a good idea to avoid extra frame setup work in leaf functions and bytecodealliance#2960 implemented this for aarch64 and s390x. This improvement was not made for x64 due to some test failures. This change avoids any frame setup for non-stack-using leaf functions in x64.
Leaf functions that do not use the stack (e.g. do not clobber any callee-saved registers) do not need a frame record; this has been discussed in issue #1148. I am not familiar with the ABIs of other architectures, so I don't know if it is safe to apply the same optimization, and that's why only the AArch64 backend does it.
@cfallin I'd appreciate any feedback on how these changes interact with unwinding; in particular, do we need an
Inst::Unwind
pseudo-instruction for the simple leaf functions we are optimizing?cc @abrown @bnjbvr @uweigand