-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Fastest way to call non-generic method #13742
Comments
If you're talking about the C# compiler, then I don't think that's true. Code like
It does, but only for value types: in that case, JIT can figure out which branches don't apply and eliminate those. But because of code sharing used by the runtime to implement generics for reference types, the same optimization doesn't work there. (Unless the method is inlined.) |
Yes off course but not automatically from
When exactly? I wrote this benchmark and the disassembly is littered with public struct A { public uint a; }
public struct B { public ushort b; }
private static class NonGeneric {
public static bool TryRead(ref A a, uint from) {
a = new A { a = from };
return true;
}
public static bool TryRead(ref B b, uint from) {
b = new B { b = unchecked((ushort)from) };
return true;
}
}
public bool TryRead<T>(ref T t, uint from) {
if (typeof(T) == typeof(A)) return NonGeneric.TryRead(ref Unsafe.As<T, A>(ref t), from);
else if (typeof(T) == typeof(B)) return NonGeneric.TryRead(ref Unsafe.As<T, B>(ref t), from);
else throw new InvalidCastException();
}
[Benchmark(OperationsPerInvoke = OperationsPerInvoke)]
public uint Read() {
var result = 0u;
A a = default;
B b = default;
for (var u = 0u; u < OperationsPerInvoke; u++) {
if (0 == (u & 1)) {
TryRead(ref a, u);
result += a.a;
} else {
TryRead(ref b, u);
result += b.b;
}
}
return result;
} |
This isn't a question for Roslyn. It is more a question for a runtime. Perhaps you're asking about the coreclr runtime? Assuming so, transferring to that repo. |
Yes posting here would've been the better choice. I practically have to put However it pretty much breaks the performance of everything below it (everything it calls). Instead of inlining to a few instructions, the reads for individual fields will have very long lists of moves from Why does this happen and what can I do to prevent it? E.g. surprising it is just a factor 8 given how much assembly this is to read one field -- this is just half the code for one field since this is just the case where the M00_L06:
lea r10,[rsp+160h]
mov qword ptr [r10],r9
movzx r9d,byte ptr [rax+10h]
test r9b,2
jne M00_L07
mov r9,qword ptr [rsp+160h]
mov qword ptr [rsp+158h],r9
xor r9d,r9d
mov dword ptr [rsp+14Ch],r9d
mov r9,qword ptr [rsp+158h]
mov qword ptr [rsp+130h],r9
mov r9,qword ptr [rsp+130h]
mov qword ptr [rsp+128h],r9
mov r9,qword ptr [rsp+128h]
mov qword ptr [rsp+138h],r9
mov r9,qword ptr [rsp+138h]
mov qword ptr [rsp+150h],r9
lea r9,[rsp+14Ch]
mov r10,qword ptr [rsp+150h]
movzx r10d,word ptr [r10]
mov word ptr [r9],r10w
mov r9,qword ptr [rsp+150h]
mov qword ptr [rsp+108h],r9
mov r9,qword ptr [rsp+108h]
mov qword ptr [rsp+100h],r9
mov r9,qword ptr [rsp+100h]
mov qword ptr [rsp+120h],r9
mov r9,qword ptr [rsp+120h]
mov qword ptr [rsp+0F8h],r9
mov r9,qword ptr [rsp+0F8h]
add r9,2
mov qword ptr [rsp+0F0h],r9
mov r9,qword ptr [rsp+0F0h]
mov qword ptr [rsp+118h],r9
mov r9,qword ptr [rsp+118h]
mov qword ptr [rsp+0E8h],r9
mov r9,qword ptr [rsp+0E8h]
mov qword ptr [rsp+0D8h],r9
mov r9,qword ptr [rsp+0D8h]
mov qword ptr [rsp+0D0h],r9
mov r9,qword ptr [rsp+0D0h]
mov qword ptr [rsp+0E0h],r9
mov r9,qword ptr [rsp+0E0h]
mov qword ptr [rsp+140h],r9
mov r9,qword ptr [rsp+140h]
mov qword ptr [rsp+0C8h],r9
mov r9,qword ptr [rsp+0C8h]
mov qword ptr [rsp+0C0h],r9
mov r9,qword ptr [rsp+0C0h]
mov qword ptr [rsp+158h],r9
mov r9d,dword ptr [rsp+14Ch]
movzx r9d,r9w
mov dword ptr [rsp+20Eh],r9d
mov r9,qword ptr [rsp+158h]
mov qword ptr [rsp+160h],r9
jmp M00_L08
M00_L07: In comparison, the code that reads a two-byte field, when benchmarked alone, basically just produces: mov ecx, word ptr [rax] I am guessing one reason is some automatically introduced locals due to casts and/or dereferences, but how can I make them go away short of writing quite verbose, repetitive code? |
@ericwj sorry for not seeing this earlier. I'm trying to understand this a bit better -- is the assembly listing just above from the sources a bit further above? The |
No, it is from code using the same pattern but with some 36 instead of 2 types to branch on at the |
@AndyAyersMS the main difference with the benchmark is that the code size is not micro. |
If you can share a code example where you're seeing this problem, I'd be happy to investigate further. I don't see these issues in the small repro above. |
I happen to have rewritten everything @AndyAyersMS, so I don't have this anymore. I don't see the exact problem from above appearing anymore. Could off course have to do with running on 3.1 which I don't mind. I still do have the issue that I need to slap I kind of reproduced the general idea here. This will need Windows The deserialization is still comparatively simple, because all fields are exactly primitive types. In my project there often are fields with variable sizes or of some string kind and such that need more logic. Writing that sometimes produces complaints about scope-escaping Related question: can you clarify for me, writing |
No, you don't need to pin anything. |
I've cloned your repo and see similar data:
Where should I start investigating? |
Perhaps have a look at I would say for starters it is shouldn't be necessary to always have to put Let me focus on an ; Benchmarks.Benchmarks.FromByteRef_NoInlining()
(...preamble...)
M00_L00:
(...A...)
lea rdx,[rsp+2C]
mov rcx,rdi
call Benchmarks.NoInlining.TryReadGeneric[[Benchmarks.B, Benchmarks]](Byte ByRef, Benchmarks.B ByRef)
(...etc...)
; Benchmarks.NoInlining.TryReadGeneric[[Benchmarks.B, Benchmarks]](Byte ByRef, Benchmarks.B ByRef)
mov rax,offset Benchmarks.NoInlining.ReadField[[System.UInt16, System.Private.CoreLib]](Byte ByRef, UInt16 ByRef)
jmp rax
; Benchmarks.NoInlining.ReadField[[System.UInt16, System.Private.CoreLib]](Byte ByRef, UInt16 ByRef)
movzx eax,word ptr [rcx]
mov [rdx],ax
lea rax,[rcx+2]
ret With the attribute that whole bunch reduces to: ; Benchmarks.Benchmarks.FromByteRef_FullInlining()
(...preamble...)
M00_L00:
(...A...)
lea r8,[rsp+24]
movzx eax,word ptr [rdx]
mov [r8],ax
(...etc...) And the performance difference is 6.23x on your machine (175ns versus 28ns average) for a list of some 26 or so of these sequences, which consistently go like the first bunch without the attribute and consistently like the second bunch with the attribute. I have little idea how inlining is exactly decided, but seeing the result is 3 machine code instructions somehow the cost before deciding to inline should come up as bearable. The followup would be about composing code like this in a larger program and having it be inlined consistently regardless of what method it is being called from. |
Looks like a regression in your preview build or some weirdness on your machine that That said there is weirdness here too in the preamble where it for example does for every cursor = ref location; line in lea r8,[rdx+4]
lea rax,[rdx+4]
lea r9,[rdx+4]
lea r10,[rdx+4]
lea r11,[rdx+4]
lea rdi,[rdx+4]
lea rbx,[rdx+4]
lea rbp,[rdx+4]
lea r14,[rdx+4]
lea r15,[rdx+4] and then piecemeal mov r12,r8
mov r12,rax
mov r12,r9
mov r12,r10
mov r12,r11
mov r12,rdi
mov r12,rbx
mov r12,rbp
mov r12,r14
mov r12,r15 before it starts resorting to lea r12,[rdx+4] instead of just doing I mean from my point of view it has been determined that |
The the jit is throughput constrained, and the jit inliner is by nature fairly conservative. That means the jit will miss out on automatically inlining methods like We could in principle work up a projective analysis to try to predict that the impact of an inlinee is much smaller than it might appear from the IL size, but it's tricky to engineer such things without adversely impacting jit throughput. Most large methods won't be good inlines, and the larger a method is, the more costly it is to analyze. The status quo for now is to rely on |
Perhaps... codegen for the two is somewhat different for me... does it look similar in your case?
Yes, let's look into that bit. In
Note 849 and 1426 are redundant computations, but they value number differently because of the "field seq" argument to the value num func:
(I think this happens because we just have raw offsets and so run through the "not a field seq" clause in As a result we create a number of identical expressions in the loop preheader. G_M40746_IG02:
mov rdx, gword ptr [rcx+8]
cmp dword ptr [rdx+8], 0
jbe G_M40746_IG10
add rdx, 16
xor ecx, ecx
lea r8, bword ptr [rdx+4]
lea rax, bword ptr [rdx+4]
lea r9, bword ptr [rdx+4]
lea r10, bword ptr [rdx+4]
lea r11, bword ptr [rdx+4]
lea rdi, bword ptr [rdx+4]
lea rbx, bword ptr [rdx+4]
lea rbp, bword ptr [rdx+4]
lea r14, bword ptr [rdx+4]
lea r15, bword ptr [rdx+4]
lea r12, bword ptr [rdx+4] None of these registers is further modified, so they should all collapse into just one CSE. Likely any fix here is going to be post-5.0, so will mark this as future. cc @briansull @dotnet/jit-contrib |
I tested it on ; Benchmarks.Benchmarks.FromByteRef_FullInlining()
lea r8,[rsp+40]
mov esi,[rdx]
mov [r8],esi
mov r8,rax
lea rsi,[rsp+44]
movzx r13d,word ptr [r8]
mov [rsi],r13w
add r8,2
lea rsi,[rsp+46]
movzx r8d,byte ptr [r8]
mov [rsi],r8b
lea r8,[rsp+48]
; Benchmarks.Benchmarks.FromByteRef_ManualFieldForField()
mov r12d,[rdx]
mov [rsp+30],r12d
mov r12,rax
movzx r13d,word ptr [r12]
mov [rsp+34],r13w
add r12,2
movzx r12d,byte ptr [r12]
mov [rsp+36],r12b
mov r12d,[rdx]
; Benchmarks.Benchmarks.FromByteRef_ManualFieldForField_BigData()
mov r9d,[rax]
mov [rsp+30],r9d
add rax,4
movzx r9d,word ptr [rax]
mov [rsp+34],r9w
add rax,2
movzx r9d,byte ptr [rax]
mov [rsp+36],r9b
inc rax
mov r9d,[rax] The last one is new and it pretty much puts Let me push that with the previous results in |
I'm fine not inlining I am thinking maybe ReadyToRun might be able to produce inlining for these situations such that the JIT can completely skip the analysis even?
In light of this, what is the reason lots of code in the runtime goes through the trouble of getting And to put it bluntly it is a bit of a surprise and shocking just how bad Hence I loath a bit having lots of API that resorts to Maybe all this is because of the kind of code that I write for some of my projects and that I just have a slightly different perspective as the average C# user, but how great would it be if it'd be increasingly useless to write C++ for an increasing number of use cases. It is a bit hard to connect all this back to my projects. I guess my benchmarks help you very concretely analyse a scenario, but I'm left with the idea that it is hard and a lot of work to be sure that my low-level code is performing properly. These benchmarks certainly show a few code changes might produce a factor 4-10 slower code. |
I'm not sure. Could be that pointers were the only viable option in the past? Maybe @GrabYourPitchforks can weigh in here.
I agree that there is too much "performance fragility" in our system. Relatively minor changes to source can lead to large performance differences. It is something we need to improve on, and
I don't think we can skip the analysis; it would mean inlining more aggressively in the hope that good things then follow. And while some good things would follow, there would be a dramatic increase in code size and most of these extra inlines would either not improve perf or would degrade perf. We might be able to afford a more in-depth analysis when prejitting. It is one of the areas we are exploring.
Even If the jit did look at the IL, you might think it would be obvious to the jit that all those calls to Unsafe will produce no or little code, and that some of them will evaluate to constants, and so only a fraction of the IL in this method will end up mattering. But it takes a fair amount of work at jit time to determine the target of the call, and the impact of the code they execute. Just resolving the tokens in the IL to runtime data structures describing the callee is costly.
|
Maybe this can be fixed at the compiler level? Especially for the |
|
Can this be configured? Even perhaps through some |
My question isn't about misuse. I am wondering why it is preferred or at least used over using |
|
You can influence this by setting |
Does this mean I have to build the runtime myself or how do I get that? I am not aware of whether anything from dot.net is checked. I am also slightly confused about the actual status of This is the other reason why I have considered hosting |
I have a generic method which obtains a value that I would like to be the public API - or as entry point for obtaining those values in generic enumerations and the like - for a quite sizeable number of types that in fact aren't being obtained in a generic way - there are different methods or let it be overloads that obtain the actual value.
My question is whether there is a really fast way to divert from a generic entry point to a non-generic implementation.
So for example let's say I have:
Where the first method is supposed to call the second method if
typeof(T) == typeof(U)
. It doesn't matter if the<T>
appears on the type or the method, hence I put it in both places.The binder will always pick a generic method, even if there is an overload with the actual type argument, so there is no automatic way of doing this.
It was a slight surprise to me that the runtime just doesn't understand the pattern of switching on the
typeof(T)
. For one, doing this withswitch
is impossible sincetypeof(...)
is not constant, not even for primitive types. Would it be thatT t
is notout
then pattern matching could be done withswitch (value) { U u => TryRead(ref u), ... }
, yet that produces a giantif (isinst t, U)
series which is about 4 times slower than a straightforward cast which can't be done because the type is not known. Chaining a series ofif (typeof(T) == typeof(U))
etcetera (where onlyT
is a generic argument) produces a very bulky and slow series of checks, even though the intuition is that this might be optimized away at runtime when the type ofT
is known.One thing I know for sure is that if the types on which to switch are all simple
unmanaged
types, it is beautifully inlineable to chain a series ofif (Unsafe.SizeOf<T>() == 1/2/4/8)
checks. (Hence, bye bye bulkySystem.Buffers.Binary.BinaryPrimitives
except theReverseEndianness
method)I'm sure it's quite okay performance-wise to just create an interface or an abstract class that is generic and implement them by deriving from that, such that the virtual dispatch does the work, but in this case it produces a very long list of classes which are tedious to write and I consider this approach a maintenance headache.
Please comment.
category:cq
theme:basic-cq
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: