Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Dead code at runtime not completely removed #13824

Closed
Thealexbarney opened this issue Nov 25, 2019 · 6 comments
Closed

JIT: Dead code at runtime not completely removed #13824

Thealexbarney opened this issue Nov 25, 2019 · 6 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@Thealexbarney
Copy link

The JIT doesn't properly remove dead code in certain instances. Here's an example that does some method dispatching based on CPU features:

public class C
{
    public static bool IsAvxSupported()
    {
        return Avx.IsSupported;
    }

    public static void A(int a, Span<byte> b)
    {
        if (IsAvxSupported())
            FuncAvx(a, b);
        else
            FuncNoAvx(a, b);
    }

    public static void A2(int a, Span<byte> b)
    {
        if (Avx.IsSupported)
            FuncAvx(a, b);
        else
            FuncNoAvx(a, b);
    }

    public static void A3(int a, Span<byte> b)
    {
        if (IsAvxSupported())
            FuncAvx(a, b);
        else
        {
            // Removed 
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static void FuncAvx(int a, Span<byte> b) { }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static void FuncNoAvx(int a, Span<byte> b) { }
}

A1 Moves some arguments around on the stack unnecessarily.

G_M13277_IG01:
       sub      rsp, 56
       vzeroupper 
       xor      rax, rax
       mov      qword ptr [rsp+28H], rax

G_M13277_IG02:
       vmovdqu  xmm0, qword ptr [rdx]
       vmovdqu  qword ptr [rsp+28H], xmm0

G_M13277_IG03:
       lea      rdx, bword ptr [rsp+28H]
       call     C:FuncAvx(int,struct)
       nop      

G_M13277_IG04:
       add      rsp, 56
       ret

Manually inlining IsAvxSupported or removing the code from the false branch like in A2 and A3 make the JIT remove all those instructions, leaving just the call.

G_M62574_IG01:
       sub      rsp, 40
       nop      

G_M62574_IG02:
       call     C:FuncAvx(int,struct)
       nop      

G_M62574_IG03:
       add      rsp, 40
       ret   

category:cq
theme:optimization
skill-level:expert
cost:large

@AndyAyersMS
Copy link
Member

This particular issue is both specific to this coding pattern and also somewhat general.

The general issue is that the jit is fairly slow at propagating information from callees to callers, so wrapping things like Avx.IsSupported in methods often leads to missed optimization opportunities like you see here. This slowness is fairly fundamental to the structure of the early part of the jit. Because it imports the entire caller before inlining the callee, there is no way callee information can influence caller importation.

The specific issue is that because of this slowness, a copy optimization in morph does not fire when jitting A. The optimization is dependent on early reference counts to determine if a struct reference is the "last use" of a struct, and is easily confused by "dead" references that ultimately will be optimized away. So it is a particularly vulnerable optimization as it is trying to do something flow-specific before the jit is fully able to understand flow.

Morph actually folds the branch before it processes either call, but doesn't have the ability to determine that because of this fold, one of the calls is now unreachable, and because of that, that the early reference counts should be adjusted, and because of that, the copy optimization can safely fire on the surviving call. Instead it thinks both calls are live, and processes them both, introducing copies of the local b as is generally required when passing a structure to a callee. Only later does the jit eliminate the dead code, and by that point there is nothing that can eliminate the now-unnecessary copy.

We might be able to rely on block ref counts to eliminate unreachable blocks after inlining, and add logic to the postorder post-inline callback (currently fgLateDevirtualization) to try and optimize branches if inline substitution happened in the tree. This would let the jit eliminate the dead code earlier and unblock the morph optimization in this case. Doing this might save a bit of throughput too, there's no point in morphing code that we are just going to delete once we understand which blocks are truly reachable.

Probably worth prototyping, though early block ref counts might not be accurate enough to do this safely.

A related issue is that we end up being handicapped early in the jit by lack of full flow graph connectivity; I've been burned by this in other places too (eg dotnet/coreclr#27113).

cc @dotnet/jit-contrib

@AndyAyersMS
Copy link
Member

Results from a simple, crude prototype which gets the same codegen for A and A2:

Total bytes of diff: -1046 (-0.00% of base)
    diff is an improvement.

Top file regressions (bytes):
          22 : System.Transactions.Local.dasm (0.02% of base)
          10 : Microsoft.CodeAnalysis.CSharp.dasm (0.00% of base)
           6 : System.Diagnostics.Process.dasm (0.01% of base)
           2 : System.Net.Mail.dasm (0.00% of base)
           2 : System.Net.Security.dasm (0.00% of base)

Top file improvements (bytes):
        -764 : System.Private.CoreLib.dasm (-0.02% of base)
        -224 : System.Memory.dasm (-0.09% of base)
         -78 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.00% of base)
         -17 : System.Threading.Tasks.Parallel.dasm (-0.01% of base)
          -5 : System.Private.Xml.Linq.dasm (-0.00% of base)

12 total files with Code Size differences (6 improved, 6 regressed), 117 unchanged.

Top method regressions (bytes):
          22 ( 8.24% of base) : System.Transactions.Local.dasm - TransactionScope:SetCurrent(Transaction):this
          13 ( 1.69% of base) : System.Private.CoreLib.dasm - Buffer:Memmove(byref,byref,long) (9 methods)
          10 ( 2.19% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MatchSymbols:AreTypesEqual(TypeSymbol,TypeSymbol):bool:this
           9 ( 4.55% of base) : System.Private.CoreLib.dasm - MemoryExtensions:IndexOf(Span`1,ReadOnlySpan`1):int (6 methods)
           9 ( 4.55% of base) : System.Private.CoreLib.dasm - MemoryExtensions:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int (6 methods)

Top method improvements (bytes):
         -81 (-93.10% of base) : System.Private.CoreLib.dasm - Environment:get_Is64BitOperatingSystem():bool
         -78 (-92.86% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - ETWTraceEventSource:GetOSPointerSize():int
         -70 (-1.49% of base) : System.Memory.dasm - SequenceReader`1:TryReadToAnyInternal(byref,ReadOnlySpan`1,bool,int):bool:this (6 methods)
         -53 (-2.38% of base) : System.Memory.dasm - SequenceReader`1:TryReadToAny(byref,ReadOnlySpan`1,bool):bool:this (12 methods)
         -40 (-1.00% of base) : System.Private.CoreLib.dasm - Utf8Parser:TryParse(ReadOnlySpan`1,byref,byref,ushort):bool (16 methods)

Top method regressions (percentages):
          22 ( 8.24% of base) : System.Transactions.Local.dasm - TransactionScope:SetCurrent(Transaction):this
           9 ( 6.25% of base) : System.Private.CoreLib.dasm - Span`1:Clear():this (7 methods)
           9 ( 4.55% of base) : System.Private.CoreLib.dasm - MemoryExtensions:IndexOf(Span`1,ReadOnlySpan`1):int (6 methods)
           9 ( 4.55% of base) : System.Private.CoreLib.dasm - MemoryExtensions:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int (6 methods)
          10 ( 2.19% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MatchSymbols:AreTypesEqual(TypeSymbol,TypeSymbol):bool:this

Top method improvements (percentages):
         -81 (-93.10% of base) : System.Private.CoreLib.dasm - Environment:get_Is64BitOperatingSystem():bool
         -78 (-92.86% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - ETWTraceEventSource:GetOSPointerSize():int
         -39 (-58.21% of base) : System.Private.CoreLib.dasm - MemoryExtensions:IndexOfAny(Span`1,long,long,long):int
         -39 (-58.21% of base) : System.Private.CoreLib.dasm - MemoryExtensions:IndexOfAny(ReadOnlySpan`1,long,long,long):int
         -37 (-56.92% of base) : System.Private.CoreLib.dasm - MemoryExtensions:IndexOfAny(Span`1,int,int,int):int

79 total methods with Code Size differences (57 improved, 22 regressed), 208854 unchanged.

Interestingly the biggest percentage wins come from removing useless pinvoke frame setup. See for example, Environment:get_Is64BitOperatingSystem():bool

;; current jit
G_M40889_IG01:
       push     rbp
       push     r15
       push     r14
       push     r13
       push     r12
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 88
       lea      rbp, [rsp+90H]
						;; bbWeight=1    PerfScore 8.75
G_M40889_IG02:
       lea      rcx, [rbp-88H]
       mov      rdx, r10
       call     CORINFO_HELP_INIT_PINVOKE_FRAME
       mov      qword ptr [rbp-48H], rax
       mov      rdx, rsp
       mov      qword ptr [rbp-68H], rdx
       mov      rdx, rbp
       mov      qword ptr [rbp-58H], rdx
						;; bbWeight=1    PerfScore 5.25
G_M40889_IG03:
       mov      eax, 1
       mov      rdx, qword ptr [rbp-48H]
       mov      byte  ptr [rdx+12], 1
						;; bbWeight=1    PerfScore 2.25
G_M40889_IG04:
       lea      rsp, [rbp-38H]
       pop      rbx
       pop      rsi
       pop      rdi
       pop      r12
       pop      r13
       pop      r14
       pop      r15
       pop      rbp
       ret      
						;; bbWeight=1    PerfScore 5.50
;; prototype
G_M40888_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M40888_IG02:
       mov      eax, 1
						;; bbWeight=1    PerfScore 0.25
G_M40888_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00 

The prototype can only very crudely remove unreachable blocks, and it fails on one of the jit-diff assemblies, so the logic may still be amiss somewhere. A more functional version would have some more robust way to clean things up -- likely flagging methods where we suspect unreachable code during inlining, and then running some cleanup post pass on those flagged methods.

@AndyAyersMS
Copy link
Member

Also note A2 should really be able to tail call FuncAvx. This doesn't happen right now because we don't do the copy elimination opt until after we've decided the call can't be a fast tail call because it has a byref arg. But if that byref arg comes from a byref param for the method it should be OK to tail call, provided we actually do the copy optimization.

A second simple, crude prototype produces this code for A2

       E95B4AFEFF           jmp      C:FuncAvx(int,System.Span`1[Byte])

And once we support caller-byref params for fast tail calls, we can broaden out the copy opt elimination further since we know all tail-call uses must be last uses.

For future reference, some other things to play with in that copy opt clause:

  • we just need to know call site is not in loop (double-check TR loops are handled right), not that the method as a whole might have a loop
  • does it handle args that are byref struct fields of byref struct params?
  • should we try something similar when inlining at a last-use call (we won't have ref counts)?
  • is there some cheap and more robust way to detect last use?

This prototype sees a fair number of hits....

Total bytes of diff: -9014 (-0.02% of base)
    diff is an improvement.

Top file regressions (bytes):
          14 : xunit.core.dasm (0.02% of base)
          14 : xunit.execution.dotnet.dasm (0.01% of base)
          10 : xunit.assert.dasm (0.01% of base)
           2 : Microsoft.DotNet.ProjectModel.dasm (0.00% of base)
           2 : System.IO.Compression.dasm (0.00% of base)

Top file improvements (bytes):
       -2024 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.04% of base)
       -1772 : System.Private.CoreLib.dasm (-0.04% of base)
       -1442 : Microsoft.CodeAnalysis.CSharp.dasm (-0.03% of base)
        -826 : System.Data.Common.dasm (-0.06% of base)
        -426 : System.Threading.Tasks.Dataflow.dasm (-0.05% of base)

57 total files with Code Size differences (52 improved, 5 regressed), 72 unchanged.

Top method regressions (bytes):
          14 ( 2.78% of base) : xunit.assert.dasm - TypeErasedEqualityComparer:EqualsGeneric(Vector`1,Vector`1):bool:this (7 methods)
          14 ( 2.78% of base) : xunit.core.dasm - TypeErasedEqualityComparer:EqualsGeneric(Vector`1,Vector`1):bool:this (7 methods)
          14 ( 2.56% of base) : xunit.execution.dotnet.dasm - TypeErasedEqualityComparer:EqualsGeneric(Vector`1,Vector`1):bool:this (7 methods)
           4 ( 1.51% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Binder:DecodeIdentifierType(SyntaxToken,AsClauseSyntax,Func`1,byref,DiagnosticBag):TypeSymbol:this
           4 ( 3.57% of base) : System.Security.Cryptography.Algorithms.dasm - DSACng:ImportEncryptedPkcs8PrivateKey(ReadOnlySpan`1,ReadOnlySpan`1,byref):this (2 methods)

Top method improvements (bytes):
        -140 (-27.78% of base) : System.Threading.Tasks.Dataflow.dasm - <>c__DisplayClass9_0:<.ctor>b__3(KeyValuePair`2):this (14 methods)
        -140 (-27.78% of base) : System.Threading.Tasks.Dataflow.dasm - <>c__DisplayClass9_0:<.ctor>b__4(KeyValuePair`2):this (14 methods)
        -136 (-64.45% of base) : System.Memory.dasm - ReadOnlySequence`1:Slice(int,SequencePosition):ReadOnlySequence`1:this (7 methods)
        -136 (-64.76% of base) : System.Memory.dasm - ReadOnlySequence`1:Slice(SequencePosition,int):ReadOnlySequence`1:this (7 methods)
        -100 (-68.03% of base) : System.Private.CoreLib.dasm - Vector:SquareRoot(Vector`1):Vector`1 (6 methods)

Top method regressions (percentages):
           2 ( 4.65% of base) : System.Runtime.Serialization.Formatters.dasm - SerializationEvents:InvokeOnDelegate(Object,StreamingContext,List`1)
           2 ( 4.55% of base) : System.Private.Xml.dasm - XmlAsyncCheckWriter:WriteValue(DateTimeOffset):this
           2 ( 4.44% of base) : System.Private.DataContractSerialization.dasm - XmlJsonWriter:WriteValue(Decimal):this
           2 ( 4.44% of base) : System.Private.DataContractSerialization.dasm - XmlJsonWriter:WriteValue(Guid):this
           2 ( 4.44% of base) : System.Private.Xml.dasm - XmlAsyncCheckWriter:WriteValue(Decimal):this

Top method improvements (percentages):
         -20 (-80.00% of base) : System.Data.Common.dasm - SqlDecimal:Add(SqlDecimal,SqlDecimal):SqlDecimal
         -20 (-80.00% of base) : System.Data.Common.dasm - SqlDecimal:Multiply(SqlDecimal,SqlDecimal):SqlDecimal
         -20 (-80.00% of base) : System.Data.Common.dasm - SqlDecimal:Divide(SqlDecimal,SqlDecimal):SqlDecimal
         -20 (-80.00% of base) : System.Data.Common.dasm - SqlDouble:Add(SqlDouble,SqlDouble):SqlDouble
         -20 (-80.00% of base) : System.Data.Common.dasm - SqlDouble:Subtract(SqlDouble,SqlDouble):SqlDouble

1088 total methods with Code Size differences (1050 improved, 38 regressed), 207845 unchanged.

@AndyAyersMS
Copy link
Member

The intersection of copy avoidance for last-use implicit by-ref structs at calls and tail calls is interesting. I'm not sure I completely understand it yet.

Imagine we have a chain of methods that tail call one another, A->B->C, and B and C have a by-ref struct params. A passes a local struct (by-ref) to B, and B passes it directly to C.

A cannot safely tail call B, because A's frame is trashed on entry to B. But A can avoid copying the struct at the call site if it is last use.

One might think B can safely tail call C only if it avoids making a local copy.

But dotnet/coreclr#5394 shows this isn't always true; a chain of slow tail calls via helpers can "reach back" to earlier frames and trash them as well. The fix in dotnet/coreclr#5394 is to have B copy the struct if it is going to slow tail call C. But this would seemingly make it unsafe for B to tail call C, unless the slow helper does something magical.

It's possible the proposed new portable fast tail call mechanism (#341) will have similar reach-back abilities. Need to look into this more closely.

@jakobbotsch
Copy link
Member

But dotnet/coreclr#5394 shows this isn't always true; a chain of slow tail calls via helpers can "reach back" to earlier frames and trash them as well. The fix in dotnet/coreclr#5394 is to have B copy the struct if it is going to slow tail call C. But this would seemingly make it unsafe for B to tail call C, unless the slow helper does something magical.

JIT_TailCall is invoked with all args to be passed and will set them up in the frame of the caller's caller with special copying. For args passed implicitly by ref, it will copy the contents, not just the reference:

if (ArgIterator::IsArgPassedByRef(cbArg)) {

It's possible the proposed new portable fast tail call mechanism (#341) will have similar reach-back abilities. Need to look into this more closely.

This optimization should be ok for the new tail call helpers. From the JIT's perspective the helpers will not do anything special with the arguments. They will simply be copied into TLS using standard IL instructions.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@AndyAyersMS
Copy link
Member

Fixed by #1751.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

4 participants