-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize jump stubs on arm64 #62302
Comments
|
If distance between b <pcRelDistance> |
Wouldn't it take 4 instructions to populate an address constant in the worst case? |
We are not good in hoisting address constant population, so if this is part of the loop, we might regress. |
Note that we require that the target is atomically patchable without suspending the execution. It makes it close to impossible to split this into multiple instructions. I agree that the whole scheme for how we deal with the precodes and back patching is likely very suboptimal on non-x86 architectures (and maybe even on current x86). I think the more optimal path may look like this:
|
My bet would be that the bottleneck is more caused more by the call + indirect jump combination than by the memory load. Patterns like that used to cause pipeline stalls on x86 in the past, and I think it is likely that they are problem for arm64 too. |
@jkotas thanks for a detailed explanation! 👍
@BruceForstall Right, I was wondering if it's still faster, because otherwise I'd expect native compilers to always prefer doing a memory load from data section rather than doing 4 movs (e.g. https://godbolt.org/z/cWYsTq6P6). I played locally with From what I read it takes 3x less cycles to do 4 movs. |
I think we should look into optimizing the jump stubs and friends for arm64. I agree with your initial observation that there is likely bottleneck. |
I guess we also more likely to hit a jump stub on ARM64, quoting jump-stubs.md:
so even pretty simple TE benchmarks hit that. |
Just noticed that a completely empty |
The following methods request a jump-stub on arm64 during compilation of a completely empty program in TC=0:
none of them do that on x64 |
Apparently all FCalls use jump-stubs, e.g.: using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System;
public class Program
{
static void Main() => CallCos(3.14);
[MethodImpl(MethodImplOptions.NoInlining)]
static double CallCos(double d) => Math.Cos(d);
} 00000000 stp fp, lr, [sp,#-16]!
00000000 mov fp, sp
00000000 bl System.Math:Cos(double):double ;; <--- jump stub
00000000 ldp fp, lr, [sp],#16
00000000 ret lr It explains why some microbenchmarks are slow - almost all Math.* functions go via double calls basically |
@jakobbotsch suggested to change these constants runtime/src/coreclr/pal/src/include/pal/virtual.h Lines 188 to 193 in 23de817
|
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
public class Program
{
static void Main(string[] args) =>
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
[Benchmark]
[Arguments(3.14)]
public double Test(double d) => Math.Cos(d) * Math.Sin(d) * Math.Tan(d); // 3 InternalCalls
}
😮 |
#70707 improved perf here, mainly, because we used to use Although, there are still ways to improve it - moving to Future. |
Moving here, so apparently it's also an issue for x64 for large apps: I noticed that BingSNR (when I run it locally on Windows-x64) emits 44k jump stubs (44k calls to Can we do anything with this? E.g. just like in #64148 to emit 64bit addresses to precode slots directly in methods |
Notice that the expensive path goes into HostCodeHeap. HostCodeHeap is used for DynamicMethods. Each dynamic method gets its own set of jump stubs that are all freed when the dynamic method is collected. It is how we ensure that the dynamic stubs are not leaking when the dynamic methods are collected. It means the cost of the jump stubs is not amortized for dynamic methods. I think it is why they are expensive.
Yes, I think it would make sense for dynamic methods at least. (Alternatively, we may be able to come up with some sort of ref-counting scheme for jumps stubs in dynamic methods so that their cost gets amortized.) |
Ah, so for this specific project it's the same problem with redundant dynamic methods (at least they look so) that they might fix |
Right, there are two different concerns. (1) Is given usage of dynamic methods warranted? (2) Does runtime behave efficiently for large projects with a lot of dynamic methods? It is still worth fixing (2) even if the answer for (1) is negative for BingSNR. |
On x64 we emit the following code for jump stubs:
as I understand from
runtime/src/coreclr/vm/amd64/cgenamd64.cpp
Lines 505 to 507 in 70d20f1
while on arm64 we make a memory load (from data section via pc):
runtime/src/coreclr/vm/arm64/cgencpu.h
Lines 294 to 296 in eeb79b3
I'm just wondering if it's not faster to do what x64 does and emit the const directly even if it takes 4 instructions to populate it...
I'm asking because I have a feeling that it could be a bottleneck if I read it correctly from the TE traces (Plaintext benchmark):

cc @dotnet/jit-contrib @jkotas
The text was updated successfully, but these errors were encountered: