Question: Fastest way to call non-generic method #13742

ericwj · 2019-11-03T21:01:28Z

I have a generic method which obtains a value that I would like to be the public API - or as entry point for obtaining those values in generic enumerations and the like - for a quite sizeable number of types that in fact aren't being obtained in a generic way - there are different methods or let it be overloads that obtain the actual value.

My question is whether there is a really fast way to divert from a generic entry point to a non-generic implementation.

So for example let's say I have:

class Reader<T> {
    bool TryRead<T>(out T t);
    bool TryRead(out U u);
    ...
}

Where the first method is supposed to call the second method if typeof(T) == typeof(U). It doesn't matter if the <T> appears on the type or the method, hence I put it in both places.

The binder will always pick a generic method, even if there is an overload with the actual type argument, so there is no automatic way of doing this.

It was a slight surprise to me that the runtime just doesn't understand the pattern of switching on the typeof(T). For one, doing this with switch is impossible since typeof(...) is not constant, not even for primitive types. Would it be that T t is not out then pattern matching could be done with switch (value) { U u => TryRead(ref u), ... }, yet that produces a giant if (isinst t, U) series which is about 4 times slower than a straightforward cast which can't be done because the type is not known. Chaining a series of if (typeof(T) == typeof(U)) etcetera (where only T is a generic argument) produces a very bulky and slow series of checks, even though the intuition is that this might be optimized away at runtime when the type of T is known.

One thing I know for sure is that if the types on which to switch are all simple unmanaged types, it is beautifully inlineable to chain a series of if (Unsafe.SizeOf<T>() == 1/2/4/8) checks. (Hence, bye bye bulky System.Buffers.Binary.BinaryPrimitives except the ReverseEndianness method)

I'm sure it's quite okay performance-wise to just create an interface or an abstract class that is generic and implement them by deriving from that, such that the virtual dispatch does the work, but in this case it produces a very long list of classes which are tedious to write and I consider this approach a maintenance headache.

Please comment.

category:cq
theme:basic-cq
skill-level:intermediate
cost:medium

The text was updated successfully, but these errors were encountered:

svick · 2019-11-04T10:11:16Z

The binder will always pick a generic method, even if there is an overload with the actual type argument, so there is no automatic way of doing this.

If you're talking about the C# compiler, then I don't think that's true. Code like reader.TryRead(out U u) will call the non-generic method.

It was a slight surprise to me that the runtime just doesn't understand the pattern of switching on the typeof(T).

It does, but only for value types: in that case, JIT can figure out which branches don't apply and eliminate those. But because of code sharing used by the runtime to implement generics for reference types, the same optimization doesn't work there. (Unless the method is inlined.)

ericwj · 2019-11-04T14:18:04Z

Code like reader.TryRead(out U u) will call the non-generic method.

Yes off course but not automatically from TryRead<T> without casting. It will not recognize T is U and hence will StackOverflow calling itself. This means I need the giant switch which I would not want to repeat everytime outside the class implementing this. If I would repeat it say if I wanted to write an enumerator over the U's, the C# compiler will off course neatly call the most specific overload. But I rather write that enumerator over one generic T - that's the whole point.

It does, but only for value types

When exactly? I wrote this benchmark and the disassembly is littered with GetTypeFromHandle and op_Equality calls, not just in IL but in the actual assembly.

public struct A { public uint a; }
public struct B { public ushort b; }
private static class NonGeneric {
    public static bool TryRead(ref A a, uint from) {
        a = new A { a = from };
        return true;
    }
    public static bool TryRead(ref B b, uint from) {
        b = new B { b = unchecked((ushort)from) };
        return true;
    }
}
public bool TryRead<T>(ref T t, uint from) {
         if (typeof(T) == typeof(A)) return NonGeneric.TryRead(ref Unsafe.As<T, A>(ref t), from);
    else if (typeof(T) == typeof(B)) return NonGeneric.TryRead(ref Unsafe.As<T, B>(ref t), from);
    else throw new InvalidCastException();
}
[Benchmark(OperationsPerInvoke = OperationsPerInvoke)]
public uint Read() {
    var result = 0u;
    A a = default;
    B b = default;
    for (var u = 0u; u < OperationsPerInvoke; u++) {
        if (0 == (u & 1)) {
            TryRead(ref a, u);
            result += a.a;
        } else {
            TryRead(ref b, u);
            result += b.b;
        }
    }
    return result;
}

gafter · 2019-11-06T18:14:44Z

This isn't a question for Roslyn. It is more a question for a runtime. Perhaps you're asking about the coreclr runtime? Assuming so, transferring to that repo.

ericwj · 2019-11-07T02:58:45Z

Yes posting here would've been the better choice.

I practically have to put [MethodImpl(MethodImplOptions.AggressiveInlining)] on the TryRead<T> from the example to get rid of the long series of tests (I do it with more than 2 types).

However it pretty much breaks the performance of everything below it (everything it calls). Instead of inlining to a few instructions, the reads for individual fields will have very long lists of moves from r9 to rsp+(i<<3) and back and the performance drops a factor of say 8.

Why does this happen and what can I do to prevent it?

E.g. surprising it is just a factor 8 given how much assembly this is to read one field -- this is just half the code for one field since this is just the case where the test gets to set ZF meaning it is 2 bytes:

M00_L06:
       lea     r10,[rsp+160h]
       mov     qword ptr [r10],r9
       movzx   r9d,byte ptr [rax+10h]
       test    r9b,2
       jne     M00_L07
       mov     r9,qword ptr [rsp+160h]
       mov     qword ptr [rsp+158h],r9
       xor     r9d,r9d
       mov     dword ptr [rsp+14Ch],r9d
       mov     r9,qword ptr [rsp+158h]
       mov     qword ptr [rsp+130h],r9
       mov     r9,qword ptr [rsp+130h]
       mov     qword ptr [rsp+128h],r9
       mov     r9,qword ptr [rsp+128h]
       mov     qword ptr [rsp+138h],r9
       mov     r9,qword ptr [rsp+138h]
       mov     qword ptr [rsp+150h],r9
       lea     r9,[rsp+14Ch]
       mov     r10,qword ptr [rsp+150h]
       movzx   r10d,word ptr [r10]
       mov     word ptr [r9],r10w
       mov     r9,qword ptr [rsp+150h]
       mov     qword ptr [rsp+108h],r9
       mov     r9,qword ptr [rsp+108h]
       mov     qword ptr [rsp+100h],r9
       mov     r9,qword ptr [rsp+100h]
       mov     qword ptr [rsp+120h],r9
       mov     r9,qword ptr [rsp+120h]
       mov     qword ptr [rsp+0F8h],r9
       mov     r9,qword ptr [rsp+0F8h]
       add     r9,2
       mov     qword ptr [rsp+0F0h],r9
       mov     r9,qword ptr [rsp+0F0h]
       mov     qword ptr [rsp+118h],r9
       mov     r9,qword ptr [rsp+118h]
       mov     qword ptr [rsp+0E8h],r9
       mov     r9,qword ptr [rsp+0E8h]
       mov     qword ptr [rsp+0D8h],r9
       mov     r9,qword ptr [rsp+0D8h]
       mov     qword ptr [rsp+0D0h],r9
       mov     r9,qword ptr [rsp+0D0h]
       mov     qword ptr [rsp+0E0h],r9
       mov     r9,qword ptr [rsp+0E0h]
       mov     qword ptr [rsp+140h],r9
       mov     r9,qword ptr [rsp+140h]
       mov     qword ptr [rsp+0C8h],r9
       mov     r9,qword ptr [rsp+0C8h]
       mov     qword ptr [rsp+0C0h],r9
       mov     r9,qword ptr [rsp+0C0h]
       mov     qword ptr [rsp+158h],r9
       mov     r9d,dword ptr [rsp+14Ch]
       movzx   r9d,r9w
       mov     dword ptr [rsp+20Eh],r9d
       mov     r9,qword ptr [rsp+158h]
       mov     qword ptr [rsp+160h],r9
       jmp     M00_L08
M00_L07:

In comparison, the code that reads a two-byte field, when benchmarked alone, basically just produces:

mov ecx, word ptr [rax]

I am guessing one reason is some automatically introduced locals due to casts and/or dereferences, but how can I make them go away short of writing quite verbose, repetitive code?

AndyAyersMS · 2020-04-29T02:32:52Z

@ericwj sorry for not seeing this earlier. I'm trying to understand this a bit better -- is the assembly listing just above from the sources a bit further above?

The typeof(T) == typeof(A) type specialization pattern should be fairly well optimized for value classes. We have a lot of framework code that depends on it.

ericwj · 2020-05-05T20:00:09Z

No, it is from code using the same pattern but with some 36 instead of 2 types to branch on at the TryRead<T> level and that generic argument has constraints, one of which is unmanaged. The types it reads vary from A and B in having more fields. Many fields may be WORD or DWORD which is determined at runtime.

ericwj · 2020-05-05T20:06:01Z

@AndyAyersMS the main difference with the benchmark is that the code size is not micro.

AndyAyersMS · 2020-05-05T20:10:10Z

If you can share a code example where you're seeing this problem, I'd be happy to investigate further. I don't see these issues in the small repro above.

ericwj · 2020-05-06T17:05:05Z

I happen to have rewritten everything @AndyAyersMS, so I don't have this anymore. I don't see the exact problem from above appearing anymore. Could off course have to do with running on 3.1 which I don't mind.

I still do have the issue that I need to slap [MethodImpl] on about everything; it appears to me that if the methods that aren't inlined in my code (3 assembly instructions) aren't inlined, what will ever?

I kind of reproduced the general idea here. This will need Windows cmd to build or the pre-build event reviewed. The benchmark results are in the repo.

The deserialization is still comparatively simple, because all fields are exactly primitive types. In my project there often are fields with variable sizes or of some string kind and such that need more logic. Writing that sometimes produces complaints about scope-escaping refs which I think are actually not, but I don't have an example right away.

Related question: can you clarify for me, writing ref var b = ref span[0] and passing b around, does the memory from which the span was created need pinning?

AndyAyersMS · 2020-06-16T18:43:15Z

Related question: can you clarify for me, writing ref var b = ref span[0] and passing b around, does the memory from which the span was created need pinning?

No, you don't need to pin anything.

AndyAyersMS · 2020-06-16T23:49:13Z

I've cloned your repo and see similar data:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.900 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.4.20258.7
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT


|                                   Method |        Mean |     Error |     StdDev |      Median | Code Size |
|----------------------------------------- |------------:|----------:|-----------:|------------:|----------:|
|                   FromByteRef_NoInlining |   175.21 ns |  5.729 ns |  16.622 ns |   171.70 ns |    1829 B |
|                 FromByteRef_FullInlining |    28.08 ns |  1.227 ns |   3.619 ns |    27.20 ns |    1429 B |
|             FromByteRef_InlineFieldReads |   141.05 ns |  9.983 ns |  27.158 ns |   139.67 ns |    1593 B |
|            FromByteRef_InlineStructReads |   323.30 ns | 23.337 ns |  68.811 ns |   285.19 ns |    1279 B |
|          FromByteRef_ManualFieldForField |    43.33 ns |  0.861 ns |   2.064 ns |    43.19 ns |    1233 B |
|                      FromSpan_NoInlining |   627.34 ns | 24.170 ns |  71.265 ns |   597.28 ns |    5029 B |
|                    FromSpan_FullInlining |   450.81 ns | 19.962 ns |  55.646 ns |   432.36 ns |    5857 B |
|                FromSpan_InlineFieldReads |   388.70 ns | 31.693 ns |  92.450 ns |   381.08 ns |    4365 B |
|               FromSpan_InlineStructReads |   583.64 ns | 52.618 ns | 155.144 ns |   549.93 ns |    3871 B |
| FromSpan_ManualFieldForField_Progressive | 1,653.85 ns | 63.437 ns | 170.419 ns | 1,612.30 ns |   10544 B |
|  FromSpan_ManualFieldForField_NoProgress | 1,300.66 ns | 25.850 ns |  57.817 ns | 1,293.10 ns |    8509 B |

Where should I start investigating?

ericwj · 2020-06-17T01:24:25Z

Perhaps have a look at BenchmarkDotNet.Artifacts\results\Benchmarks.Benchmarks-asm.md?

I would say for starters it is shouldn't be necessary to always have to put [MethodImpl(MethodImplOptions.AggressiveInlining)] everywhere.

Let me focus on an ushort case struct B, the second read of the bunch. For example the first one FromByteRef will generate this:

; Benchmarks.Benchmarks.FromByteRef_NoInlining()
       (...preamble...)
M00_L00:
       (...A...)
       lea       rdx,[rsp+2C]
       mov       rcx,rdi
       call      Benchmarks.NoInlining.TryReadGeneric[[Benchmarks.B, Benchmarks]](Byte ByRef, Benchmarks.B ByRef)
       (...etc...)

; Benchmarks.NoInlining.TryReadGeneric[[Benchmarks.B, Benchmarks]](Byte ByRef, Benchmarks.B ByRef)
       mov       rax,offset Benchmarks.NoInlining.ReadField[[System.UInt16, System.Private.CoreLib]](Byte ByRef, UInt16 ByRef)
       jmp       rax

; Benchmarks.NoInlining.ReadField[[System.UInt16, System.Private.CoreLib]](Byte ByRef, UInt16 ByRef)
       movzx     eax,word ptr [rcx]
       mov       [rdx],ax
       lea       rax,[rcx+2]
       ret

With the attribute that whole bunch reduces to:

; Benchmarks.Benchmarks.FromByteRef_FullInlining()
       (...preamble...)
M00_L00:
       (...A...)
       lea       r8,[rsp+24]
       movzx     eax,word ptr [rdx]
       mov       [r8],ax
       (...etc...)

And the performance difference is 6.23x on your machine (175ns versus 28ns average) for a list of some 26 or so of these sequences, which consistently go like the first bunch without the attribute and consistently like the second bunch with the attribute.

I have little idea how inlining is exactly decided, but seeing the result is 3 machine code instructions somehow the cost before deciding to inline should come up as bearable.

The followup would be about composing code like this in a larger program and having it be inlined consistently regardless of what method it is being called from.

ericwj · 2020-06-17T02:20:32Z

Looks like a regression in your preview build or some weirdness on your machine that FromByteRef_ManualFieldForField is twice slower than FromByteRef_FullInlining. I get comparable assembly for these and 22.2 versus 23.3ns respectively.

That said there is weirdness here too in the preamble where it for example does for every

	cursor = ref location;

line in FromByteRef_ManualFieldForField:

       lea       r8,[rdx+4]
       lea       rax,[rdx+4]
       lea       r9,[rdx+4]
       lea       r10,[rdx+4]
       lea       r11,[rdx+4]
       lea       rdi,[rdx+4]
       lea       rbx,[rdx+4]
       lea       rbp,[rdx+4]
       lea       r14,[rdx+4]
       lea       r15,[rdx+4]

and then piecemeal

       mov       r12,r8
       mov       r12,rax
       mov       r12,r9
       mov       r12,r10
       mov       r12,r11
       mov       r12,rdi
       mov       r12,rbx
       mov       r12,rbp
       mov       r12,r14
       mov       r12,r15

before it starts resorting to

       lea       r12,[rdx+4]

instead of just doing lea once and then just sticking to mov r12, r8. That would also eliminate a series of pops and pushes.

I mean from my point of view it has been determined that ref location is safe to cache in a bunch of registers so it could be in one register for the duration of the call, that would be simplest and fastest. Could try this with a more real-world bigger buffer and see if ref cursor is treated that way, then maybe this is moot because it is peculiar benchmark code in the first place.

AndyAyersMS · 2020-06-17T18:27:17Z

The the jit is throughput constrained, and the jit inliner is by nature fairly conservative.

That means the jit will miss out on automatically inlining methods like TryReadGeneric whose IL implementation is large (1046 bytes), but once inlined, simplify down to relatively small amounts of code.

We could in principle work up a projective analysis to try to predict that the impact of an inlinee is much smaller than it might appear from the IL size, but it's tricky to engineer such things without adversely impacting jit throughput. Most large methods won't be good inlines, and the larger a method is, the more costly it is to analyze.

The status quo for now is to rely on AggressiveInlining on cases like these, to allow users to signal to the jit that inlining such "large" methods has benefits.

AndyAyersMS · 2020-06-17T21:31:17Z

Looks like a regression in your preview build

Perhaps... codegen for the two is somewhat different for me... does it look similar in your case?

That said there is weirdness here too

Yes, let's look into that bit. In FromByteRef_FullInlining we do a bunch of loop hoisting in anticipation of CSE cleaning things up later:

Hoisting a copy of [000849] into PreHeader for loop L00 <BB02..BB02>:
N003 (  3,  3) [000849] ------------              *  ADD       byref  $301
N001 (  1,  1) [000832] ------------              +--*  LCL_VAR   byref  V03 loc1         u:2 $300
N002 (  1,  1) [000848] ------------              \--*  CNS_INT   long   4 $143

New Basic Block BB04 [3458] created.

Created PreHeader (BB04) for loop L00 (BB02 - BB02), with weight = 1   
This hoisted copy placed in PreHeader (BB04):
               [017071] ------------              *  COMMA     void  
     (  3,  3) [017067] -------H----              +--*  ADD       byref  $301
     (  1,  1) [017068] -------H----              |  +--*  LCL_VAR   byref  V03 loc1         u:2 $300
     (  1,  1) [017069] -------H----              |  \--*  CNS_INT   long   4 $143
               [017070] ------------              \--*  NOP       void  

Hoisting a copy of [001426] into PreHeader for loop L00 <BB02..BB02>:
N003 (  3,  3) [001426] ------------              *  ADD       byref  $303
N001 (  1,  1) [001409] ------------              +--*  LCL_VAR   byref  V03 loc1         u:2 $300
N002 (  1,  1) [001425] ------------              \--*  CNS_INT   long   4 $143

This hoisted copy placed in PreHeader (BB04):
               [017076] ------------              *  COMMA     void  
     (  3,  3) [017072] -------H----              +--*  ADD       byref  $303
     (  1,  1) [017073] -------H----              |  +--*  LCL_VAR   byref  V03 loc1         u:2 $300
     (  1,  1) [017074] -------H----              |  \--*  CNS_INT   long   4 $143
               [017075] ------------              \--*  NOP       void

Note 849 and 1426 are redundant computations, but they value number differently because of the "field seq" argument to the value num func:

N003 [000849]   ADD       => $301 {PtrToArrElem($1c1, $201, $142, $640)}
N003 [001426]   ADD       => $303 {PtrToArrElem($1c1, $201, $142, $643)}

(I think this happens because we just have raw offsets and so run through the "not a field seq" clause in VNForFieldSeq; seems like maybe we could do something different there?)

As a result we create a number of identical expressions in the loop preheader.

G_M40746_IG02:
       mov      rdx, gword ptr [rcx+8]
       cmp      dword ptr [rdx+8], 0
       jbe      G_M40746_IG10
       add      rdx, 16
       xor      ecx, ecx
       lea      r8, bword ptr [rdx+4]
       lea      rax, bword ptr [rdx+4]
       lea      r9, bword ptr [rdx+4]
       lea      r10, bword ptr [rdx+4]
       lea      r11, bword ptr [rdx+4]
       lea      rdi, bword ptr [rdx+4]
       lea      rbx, bword ptr [rdx+4]
       lea      rbp, bword ptr [rdx+4]
       lea      r14, bword ptr [rdx+4]
       lea      r15, bword ptr [rdx+4]
       lea      r12, bword ptr [rdx+4]

None of these registers is further modified, so they should all collapse into just one CSE.

Likely any fix here is going to be post-5.0, so will mark this as future.

cc @briansull @dotnet/jit-contrib

ericwj · 2020-06-18T16:33:59Z

Perhaps... codegen for the two is somewhat different for me... does it look similar in your case?

I tested it on 5.0.100-preview.5.20279.10 and I added a test which doesn't start over at byte 0, but hat enough bytes in the source array to read all structs from and all three tests are within about 10% of each other. The essential ASM looks like this for one full struct C-Z:

; Benchmarks.Benchmarks.FromByteRef_FullInlining()
       lea       r8,[rsp+40]
       mov       esi,[rdx]
       mov       [r8],esi
       mov       r8,rax
       lea       rsi,[rsp+44]
       movzx     r13d,word ptr [r8]
       mov       [rsi],r13w
       add       r8,2
       lea       rsi,[rsp+46]
       movzx     r8d,byte ptr [r8]
       mov       [rsi],r8b
       lea       r8,[rsp+48]
; Benchmarks.Benchmarks.FromByteRef_ManualFieldForField()
       mov       r12d,[rdx]
       mov       [rsp+30],r12d
       mov       r12,rax
       movzx     r13d,word ptr [r12]
       mov       [rsp+34],r13w
       add       r12,2
       movzx     r12d,byte ptr [r12]
       mov       [rsp+36],r12b
       mov       r12d,[rdx]
; Benchmarks.Benchmarks.FromByteRef_ManualFieldForField_BigData()
       mov       r9d,[rax]
       mov       [rsp+30],r9d
       add       rax,4
       movzx     r9d,word ptr [rax]
       mov       [rsp+34],r9w
       add       rax,2
       movzx     r9d,byte ptr [rax]
       mov       [rsp+36],r9b
       inc       rax
       mov       r9d,[rax]

The last one is new and it pretty much puts cursor in rax and does the whole read very efficiently, plus it gets rid of the long lists of push, lea and pop that the benchmarks have because I had to start over at data[0].

Let me push that with the previous results in BenchmarkDotNet.Artifacts\results\previous.

ericwj · 2020-06-18T17:59:09Z

Most large methods won't be good inlines, and the larger a method is, the more costly it is to analyze.

I'm fine not inlining TryReadGeneric but consistently inlining the much smaller TryReads. And if those aren't inlined then at least consistently ReadField. That last one is the one that is usually 3 assembly instructions or less. Maybe that is possible at least for code that uses structs or unmanaged types and generic code that declares a type constraint like unmanaged such that pretty much all that matters is the size? I have lots of code that uses this pattern of checking Unsafe.SizeOf<T> for sizes 1, 2, 4, 8 exclusively, for example.

I am thinking maybe ReadyToRun might be able to produce inlining for these situations such that the JIT can completely skip the analysis even?

No, you don't need to pin anything.

In light of this, what is the reason lots of code in the runtime goes through the trouble of getting System.Buffers.MemoryHandle and then writing againt T* instead of ref T or in T? Does the latter have GC overhead?

And to put it bluntly it is a bit of a surprise and shocking just how bad Span<T> is for this kind of work. For example in the light of these benchmarks, System.Buffers.Binary.BinaryPrimitives is a no-go, except for System.Buffers.Binary.BinaryPrimitives.ReverseEndianness. To change that it would need API that takes ref/out T or in T or even ref/in byte as source/destination to be generally usable. Seeing these are pretty much intrinsics, Span<T> as a source or destination might be very useful but is obviously not the right level of primitive if it is the only choice. (and please if more primitive API might be added, make the names less verbose please)

Hence I loath a bit having lots of API that resorts to T*. System.Runtime.CompilerServices.Unsafe is a big one. I wrote a replacement in IL that pretty much replaces T* with IntPtr and in T, out T where appropriate and I reach for those increasingly often. Intrinsics is another example. Sure there is the problem of alignment, but lots of us can certainly use these API's perfectly fine without resorting to unsafe code. Maybe this will be slightly less of a problem once we happily replace Universal Windows projects with plain C# using WinRT but still I think not having to declare Allow unsafe code is an advantage.

Maybe all this is because of the kind of code that I write for some of my projects and that I just have a slightly different perspective as the average C# user, but how great would it be if it'd be increasingly useless to write C++ for an increasing number of use cases.

It is a bit hard to connect all this back to my projects. I guess my benchmarks help you very concretely analyse a scenario, but I'm left with the idea that it is hard and a lot of work to be sure that my low-level code is performing properly. These benchmarks certainly show a few code changes might produce a factor 4-10 slower code.

AndyAyersMS · 2020-06-18T18:47:55Z

In light of this, what is the reason lots of code in the runtime goes through the trouble of getting System.Buffers.MemoryHandle and then writing againt T* instead of ref T or in T?

I'm not sure. Could be that pointers were the only viable option in the past? Maybe @GrabYourPitchforks can weigh in here.

I'm left with the idea that it is hard and a lot of work to be sure that my low-level code is performing properly. These benchmarks certainly show a few code changes might produce a factor 4-10 slower code.

I agree that there is too much "performance fragility" in our system. Relatively minor changes to source can lead to large performance differences. It is something we need to improve on, and
something we are slowly improving, but it will take time.

I am thinking maybe ReadyToRun might be able to produce inlining for these situations such that the JIT can completely skip the analysis even?

I don't think we can skip the analysis; it would mean inlining more aggressively in the hope that good things then follow. And while some good things would follow, there would be a dramatic increase in code size and most of these extra inlines would either not improve perf or would degrade perf.

We might be able to afford a more in-depth analysis when prejitting. It is one of the areas we are exploring.

then at least consistently ReadField. That last one is the one that is usually 3 assembly instructions or less.

Even ReadField looks like a big method to the jit. By default no method of more than 100 bytes of IL will be considered for inlining.

If the jit did look at the IL, you might think it would be obvious to the jit that all those calls to Unsafe will produce no or little code, and that some of them will evaluate to constants, and so only a fraction of the IL in this method will end up mattering. But it takes a fair amount of work at jit time to determine the target of the call, and the impact of the code they execute. Just resolving the tokens in the IL to runtime data structures describing the callee is costly.

.method public hidebysig static uint8&  ReadField<valuetype .ctor (class [System.Runtime]System.ValueType modreq([System.Runtime.InteropServices]System.Runtime.InteropServices.UnmanagedType)) T>(uint8& location,
                                                                                                                                                                                                   [out] !!T& result) cil managed
{
  .param type T 
  .custom instance void System.Runtime.CompilerServices.IsUnmanagedAttribute::.ctor() = ( 01 00 00 00 ) 
  // Code size       115 (0x73)
  .maxstack  2
  IL_0000:  call       int32 [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::SizeOf<!!0>()
  IL_0005:  ldc.i4.4
  IL_0006:  bne.un.s   IL_0026
  IL_0008:  ldarg.1
  IL_0009:  ldarg.0
  IL_000a:  call       !!1& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<uint8,uint32>(!!0&)
  IL_000f:  call       !!1& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<uint32,!!0>(!!0&)
  IL_0014:  ldobj      !!T
  IL_0019:  stobj      !!T
  IL_001e:  ldarg.0
  IL_001f:  ldc.i4.4
  IL_0020:  call       !!0& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::Add<uint8>(!!0&,
                                                                                                                       int32)
  IL_0025:  ret
  IL_0026:  call       int32 [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::SizeOf<!!0>()
  IL_002b:  ldc.i4.2
  IL_002c:  bne.un.s   IL_004c
  IL_002e:  ldarg.1
  IL_002f:  ldarg.0
  IL_0030:  call       !!1& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<uint8,uint16>(!!0&)
  IL_0035:  call       !!1& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<uint16,!!0>(!!0&)
  IL_003a:  ldobj      !!T
  IL_003f:  stobj      !!T
  IL_0044:  ldarg.0
  IL_0045:  ldc.i4.2
  IL_0046:  call       !!0& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::Add<uint8>(!!0&,
                                                                                                                       int32)
  IL_004b:  ret
  IL_004c:  call       int32 [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::SizeOf<!!0>()
  IL_0051:  ldc.i4.1
  IL_0052:  bne.un.s   IL_006d
  IL_0054:  ldarg.1
  IL_0055:  ldarg.0
  IL_0056:  call       !!1& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<uint8,!!0>(!!0&)
  IL_005b:  ldobj      !!T
  IL_0060:  stobj      !!T
  IL_0065:  ldarg.0
  IL_0066:  ldc.i4.1
  IL_0067:  call       !!0& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::Add<uint8>(!!0&,
                                                                                                                       int32)
  IL_006c:  ret
  IL_006d:  newobj     instance void [System.Runtime]System.InvalidCastException::.ctor()
  IL_0072:  throw
} // end of method NoInlining::ReadField

ericwj · 2020-06-18T19:28:52Z

Maybe this can be fixed at the compiler level? Especially for the SizeOf<T> switch case, I am pretty sure in my code 90% or more of the use of SizeOf<T> is about power of two sizes of things that either are or are 100% compatible with primitives for which sizeof(T) could be substituted if they weren't generic. Except enums. Enums are an annoyance, which is a pity, since my mental model of them is that they are integers - though it is way to hard to get to know their size.

GrabYourPitchforks · 2020-06-18T20:41:22Z

In light of this, what is the reason lots of code in the runtime goes through the trouble of getting System.Buffers.MemoryHandle and then writing againt T* instead of ref T or in T? Does the latter have GC overhead?

MemoryHandle is only useful for asynchronous p/invokes, where you pass an address and need it to remain pinned while the asynchronous operation takes place. I very quickly glanced at usage throughout the framework (see the left pane of https://source.dot.net/System.Private.CoreLib/R/eca2251e76ff6e0b.html) and nothing sticks out to me as improper usage. Did you see library code misusing it?

ericwj · 2020-06-18T21:18:00Z

By default no method of more than 100 bytes of IL will be considered for inlining.

Can this be configured? Even perhaps through some ComPlus_ switch? I have one project where it might be appropriate to launch a new process which can be configured, if that is in fact possible in the future (it is right now not on Universal Windows).

ericwj · 2020-06-18T21:20:32Z

Did you see library code misusing it?

My question isn't about misuse. I am wondering why it is preferred or at least used over using ref/in. One more example of an API that doesn't quite work unless compiling with Allow unsafe code. And look at all the (IntPtr) casts in that list you linked.

GrabYourPitchforks · 2020-06-18T21:34:16Z

MemoryHandle is not preferred over ref / in. MemoryHandle was intended for one scenario only: pinning objects for p/invoke. I don't immediately see any code paths in Kestrel or the networking stack that use it for any other purpose.

AndyAyersMS · 2020-06-18T22:17:45Z

Can this be configured? Even perhaps through some ComPlus_ switch?

You can influence this by setting COMPlus_JITInlineSize, but only for Checked builds. Note our current analysis will be unlikely to see large inlines as profitable. So the likely impact of setting a higher than default limit is that jitting takes longer to produce the same code -- the jit will analyze more and bigger candidates but end up doing the same inlines.

ericwj · 2020-06-20T16:52:04Z

but only for Checked builds

Does this mean I have to build the runtime myself or how do I get that? I am not aware of whether anything from dot.net is checked.

I am also slightly confused about the actual status of COMPlus_legacyCorruptedStateExceptionsPolicy. I read this #12077 and this #31157 is open but searching for it, it might currently just as well completely be stripped and unavailable #35938. Yet the docs suggest otherwise.

This is the other reason why I have considered hosting dotnet myself, since I can rely on the hardware doing checks instead of myself for a region of unmanaged memory. The application will actually change page protection. If I eventually won't have that code path completely in unmanaged code. Another thing that will become possible when I can happily skip Universal Windows.

gafter transferred this issue from dotnet/roslyn Nov 6, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

AndyAyersMS modified the milestones: 5.0.0, Future Jun 17, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Fastest way to call non-generic method #13742

Question: Fastest way to call non-generic method #13742

ericwj commented Nov 3, 2019

svick commented Nov 4, 2019

ericwj commented Nov 4, 2019 •

edited

Loading

gafter commented Nov 6, 2019

ericwj commented Nov 7, 2019

AndyAyersMS commented Apr 29, 2020

ericwj commented May 5, 2020

ericwj commented May 5, 2020

AndyAyersMS commented May 5, 2020

ericwj commented May 6, 2020

AndyAyersMS commented Jun 16, 2020

AndyAyersMS commented Jun 16, 2020

ericwj commented Jun 17, 2020 •

edited

Loading

ericwj commented Jun 17, 2020 •

edited

Loading

AndyAyersMS commented Jun 17, 2020

AndyAyersMS commented Jun 17, 2020

ericwj commented Jun 18, 2020 •

edited

Loading

ericwj commented Jun 18, 2020

AndyAyersMS commented Jun 18, 2020

ericwj commented Jun 18, 2020

GrabYourPitchforks commented Jun 18, 2020

ericwj commented Jun 18, 2020

ericwj commented Jun 18, 2020

GrabYourPitchforks commented Jun 18, 2020

AndyAyersMS commented Jun 18, 2020

ericwj commented Jun 20, 2020

Question: Fastest way to call non-generic method #13742

Question: Fastest way to call non-generic method #13742

Comments

ericwj commented Nov 3, 2019

svick commented Nov 4, 2019

ericwj commented Nov 4, 2019 • edited Loading

gafter commented Nov 6, 2019

ericwj commented Nov 7, 2019

AndyAyersMS commented Apr 29, 2020

ericwj commented May 5, 2020

ericwj commented May 5, 2020

AndyAyersMS commented May 5, 2020

ericwj commented May 6, 2020

AndyAyersMS commented Jun 16, 2020

AndyAyersMS commented Jun 16, 2020

ericwj commented Jun 17, 2020 • edited Loading

ericwj commented Jun 17, 2020 • edited Loading

AndyAyersMS commented Jun 17, 2020

AndyAyersMS commented Jun 17, 2020

ericwj commented Jun 18, 2020 • edited Loading

ericwj commented Jun 18, 2020

AndyAyersMS commented Jun 18, 2020

ericwj commented Jun 18, 2020

GrabYourPitchforks commented Jun 18, 2020

ericwj commented Jun 18, 2020

ericwj commented Jun 18, 2020

GrabYourPitchforks commented Jun 18, 2020

AndyAyersMS commented Jun 18, 2020

ericwj commented Jun 20, 2020

ericwj commented Nov 4, 2019 •

edited

Loading

ericwj commented Jun 17, 2020 •

edited

Loading

ericwj commented Jun 17, 2020 •

edited

Loading

ericwj commented Jun 18, 2020 •

edited

Loading