-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tail opcode being emitted for normal methods, destroys JIT optimizations #6329
Comments
As a note, this is the performance benchmark that was being used: https://github.com/EBrown8534/Logging-Benchmark Given a workaround from @NinoFloris the results looked as follows:
However, pre-workaround we were seeing large disparities between C# and F# functions:
You'll note, that in benchmark set 1, the
For the first, and:
For the second. (Each method was modified to include or exclude the Across all the benchmarks this one difference changed execution times dramatically. |
This has been a bugbear of mine for many years! Mix the tail instruction with arguments with value type > 64 bits and performance is terrible (the > 64 bit thing is because arguments can't be passed by registers and so the JIT goes to a lot of effort to set up a stack frame that can be reused) I think it was a mistake to be default insert |
@manofstick How would I go about turning that off for a project? I never have mutually-recursive functions, so it'd be great if I could turn this off because we have some F# code along very hot paths that I'd like to take advantage of this with, without needing to add a manual |
|
As well as
|
@manofstick problem is that these also get output during debug (see sharplab example, just switch to release and check bottom IL methods). In this case the compiler is just wrong, there is nothing to tailcall optimize as there is no recursion |
@NinoFloris However, it also appears that the compiler can help the Jit out by not generating unnecessary .tails. Kevin |
@KevinRansom the problem is that some implementations of jit or even native
compilation don't really care that much about tail. The reason is probably
that C# is not really emitting it and that means they don't have real
pressure. @dsyme had tried to make them aware of the problem, but I assume
it's always the first thing that gets punted.
So (together with always having fable in mind) I think we need to do two
things. A) continue to beg native compilation team and jit team to invest
into tail B) try to emit it only if really needed, because success of A)
might be still miles away
Paul Westcott <notifications@github.com> schrieb am Mo., 18. März 2019,
21:00:
… Check out @mrange <https://github.com/mrange>'s spiel on tail calls here
<https://gist.github.com/mrange/1f26f3bd219da4d407ed6dde56c461dc>...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#6329 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADgNDv5ul6oTRQSRjjg8BI59B07o_5dks5vX_BngaJpZM4bvO4q>
.
|
@KevinRansom the jit has never ever worked well in the presence of .tail... Sharplab runs netfx, but how I came to know of tail woes was through coreclr issues like https://github.com/dotnet/coreclr/issues/18406. But searching for "tail" in coreclr shows a ton of other issues like And quite some closed issues where many are only fixed in upcoming 3.0 Search link Tail is a bugfarm and it'll take years before it'll be as fast as just doing callvirt nop instead of tail callvirt |
It's not only that On rare occassions I have seen the jitter replace a I have upvoted issues, I have spammed chat on twitch during .NET community standups about giving
PS: @manofstick now I am in AVX + Struct type performance hell. The jitter does lot of "funny" stuff :) |
@mrange yeah that helper is for "slowtailcall" and the jmp is when the jit decides "fasttailcall" is possible (which indeed is dependent on argument sizes but also Linux or arm architecture doesn't have any fast tail call support yet) |
@AndyAyersMS would it be possible to keep some/most of the optimization and inlining with .tail? Would AggressiveOptimization (dotnet/coreclr#20009) attribute help here? |
I'll have to read the notes above more carefully, but a few quick replies:
|
@forki Yeah the other implementations like Xamarin don't do anything with tailcalls, the arm compiler code thats generated by AOT just ignores it. When I worked for Xamarin we used to recommend that tailcalls were always disabled in a mobile solutions for that reason. |
My understanding this is inaccurate and out-of-date. Please see mono/mono#6914 (comment), e.g. "pretty much everything is known to work". There were some subsequent patches/fixes but my understanding is that based on the work by @jaykrell the comment above is no longer accurate. |
@AndyAyersMS we did notice that even with adding noinline on the c# version it, the c# version, was still faster by quite a bit. It was only once we removed that explicit tail (and removed noinline on the c# method) that we got comparable results within noise range. Are there more optimizations in the jit besides inlining that work less well or not at all in the presence of explicit tail? |
Not many, no -- there is a restriction on the unboxing optimization (see dotnet/coreclr#21890). |
@NinoFloris Check and compare the generated C# and F# IL. By default F# compiler will inline the small methods however this IL inlining is not always profitable due the more instruction, the JIT may able to make a better inlining decision. (#5178). |
.tail is a dilema. On desktop, .tail can tailcall anything, sometimes a perf gain but sometimes a large perf loss. You sometimes have to chose stack or cycles. |
@zpodlovics I've obviously been looking at the IL ;) F# doesn't inline top level methods and functions AFAIK that only happens if they're marked inline. F# does do some optimizations for local functions and closures indeed. |
Something of interest is that the F# compiler does have logic to eliminate emit of "unnecessary" tailcall instructions even when
In this particular case, the optimization is not triggered because the call to List.Add is a virtual call.
There are cases where we want to make tailcalls through virtual calls to .NET libraries - for example when implementing long deep calling control structures like IStructuralComparison. However this is not such a case (adding something to a list is never going to result in unbounded stack usage) and in the vast majority of cases we really don't need to make tailcalls to .NET virtual methods. I think it would be quite reasonable to essentially somehow whitelist pretty much every virtual call in .NET core libraries apart from the IComparable/IEqualityComparer/IStructuralComparable/... APIs. |
I think we can improve how RyuJit handles the interaction of .tail calls and inlining, see dotnet/coreclr#23370. It is too late to get any changes from this into .Net Core 3.0, but addressing this in 3.1 seems quite possible, assuming we can develop reasonable heuristics. I also expect us to fix some of the tail call limitations we see on some ABIs in .Net Core 3.1. And perhaps there are ways to reduce the need for cases that use the |
That sounds like a very important case, though I suppose to complete it would have to be transitive to some depth, e.g. A --> B --> C --> A --> B --> C ad infinitum, where A requires 10, B 20 and C 30..... |
Since the jit is the one issuing the slow tail calls, it is not quite that bad. Once we're done inlining we know all the tail call candidate sites in the method, and we should be able to compute how much arg stack each needs, as well as what the method would normally provide. If there's a mismatch, we just need to boost the method's arg area size to the maximum needed by any of the calls, and make sure the prolog does the right magic. If anyone's up for benchmarking a very early version of the jit changes, feel free to build from my fork on the InlineExplicitTailCalls branch. This doesn't have the right heuristic yet so it may be too aggressive. And it does not try and remove the need for helper assisted tail calls. Let me know if you run into issues with it, or have interesting results. |
@AndyAyersMS How difficult would that be to setup? We have an extremely hot path of F# code in our website that I could spin another cluster node up for and load this version in. We already have timings between all nodes in the cluster, so at the very least I could give you a (relative) performance differential of real-world use-cases (at least, our real-world use-case). I.e. is it overall faster, slower, or no change. |
If you can run against the most recent 3.0 preview SDK, and can build the coreclr repo, then it should not be too bad. You can follow the Using Your Build instructions. If you want something that works with 2.2 or 2.1 the changes in my branch should port over fairly cleanly. I can prop something up for this if you like but it might take a few days. |
@NinoFloris Consider this code:
Is that call from f2 to f1 a "recursive call"? I'm guessing not but if you don't honor that tail call then f3 will stack overflow. |
@jdh30 Yes it really only gets problematic when dynamic dispatch is involved. |
Hi, I arrived this issue via #13398 and even though most of the details go way over my head, I think the recommendation is to disable tail calls (optimisation, that is) . However, I was wondering if the |
It won't, the attribute only enables tail recursion analysis, and has nothing to do with tail prefixes in IL. |
After a benchmarking discussion in the FSSF slack we found the main culprit for the C# / F# speed differences.
tail.
was being emitted for a non recursive method, that turns off JIT inlining and other optimizations.Repro can be found here.
https://sharplab.io/#v2:DYLgZgzgNALiCWwA+B7ADgUwHYAIDKAnhDBgLYCwAUOtvkSaQHQBKArljPKRowMIqk0iDACc8ogG7wAxhghUqMAphwAZFAHMNo1RgkZgOALw4AgsZwAGHEhwAhHBYCMOBZSUr1W0QAoAlMZUjo7AGDA43BAQAIbaEBaExGR8KMCh0pwoWBCMAOLYojKMqvDEADzEIvBYGgB8Pk6WTZZ+ANxBwR2OANplALJhABYoACYAkoLAPgMww+OTAPJomdmMploichDw+mNYwNXVGn61ALpdOKHhWOg4/hb+F9ykAEaiOLOlxZo+ofqGIDUmm0Il0/ygEQgGhwgMqRwCRguwQiW1icjWIxGPlIUL8SOCNzQj0oTzIbxEH0GXy8vz0BhhQO8oLpwAhKBeACsGQByaII/GOSIxOIYrHsjmMAAqKDwMCqNX8eMoyORhOJimUGBwkuiiC8IPuiOVwSuKKiaPiJkSDBSaQwGXgWRy+SwhWkxVKMAqcqO9UazTaF1Jr3enxyNL+9MB+p0LIhOOhsJ9NX5xpVZuF6NMmOxuLcyOe5Mp1J+kYBjJBYIMbM5PL5gTTKqFFtFPnFUplyY0iqoQA===
Another smaller codegen issue can be seen in the IL for
ObjLog<'a>
The C# IL for the same method completely does away with the arg to local dance and just uses the arg directly.
These extraneous load/store patterns confuse the JITs dataflow analysis whether things are being modified or not, hampering enregistration.
https://sharplab.io/#v2:C4LglgNgNAJiDUAfAAgBgATIIwBYDcAsAFDHIDM6ApgHYCuAtugDID2A5m5QE5OUBulCOmIBvYunQBBKOPQAhYgF9ipCsgBM8gIYBnSqw7dRsiQAcuYPluCVmYHcAA82VAD509Sjp1bOO9AC86NSUAO52Ds5YbgAU0agJAJSERBISJphqOMzsMQacPPyC6BBF0JjRHjpsiRliqWmNnt6+XgB0kjAwMfTVyRnKJA0S5JjZBo4AKq557AW8AkKli1Dok+gsAEYAVrXD6PWNTV4+fh1dMVvbbZMsAMrAFtRsMYn9+4OKQA=
If anything the focus should be on making sure
tail.
is only ever emitted for real recursive methods/functionsThe text was updated successfully, but these errors were encountered: