Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for partially jitted methods #34522

Closed
wants to merge 11 commits into from

Conversation

AndyAyersMS
Copy link
Member

Introduce a new form of patchpoint for rarely-executed blocks in a method
(typically, blocks with throws). Instead of jitting the code in those blocks,
add code to call a new patchpoint helper (similar to the existing one) which
finds or creates the jitted code for the method beyond this point, and then
transitions control to that code.

Suppressing codegen for exception throws in this way gives similar benefits
to the throw helper transformation, without needing to modify source. This
currently works only in Tier0, but in principle could be extended to handle
optimized code.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 3, 2020
@AndyAyersMS
Copy link
Member Author

Jit diffs (tier0, pmi) shows how much space all these throw sequences take up. Regressions seem to be PMI artifacts.

Total bytes of diff: -329106 (-4.56% of base)
    diff is an improvement.

Top file improvements (bytes):
     -329106 : System.Private.CoreLib.dasm (-4.56% of base)

1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.

Top method regressions (bytes):
          86 (12.55% of base) : System.Private.CoreLib.dasm - RuntimeType:ValidateGenericArguments(MemberInfo,ref,Exception) (1 base, 2 diff methods)
           1 ( 0.21% of base) : System.Private.CoreLib.dasm - Task:InternalCancel(bool):bool:this
           1 ( 0.10% of base) : System.Private.CoreLib.dasm - <DisposeAsync>d__35:MoveNext():this

Top method improvements (bytes):
       -3426 (-60.62% of base) : System.Private.CoreLib.dasm - UnmanagedMemoryAccessor:ReadArray(long,ref,int,int):int:this (6 methods)
       -3354 (-68.00% of base) : System.Private.CoreLib.dasm - UnmanagedMemoryAccessor:WriteArray(long,ref,int,int):this (6 methods)
       -2472 (-65.61% of base) : System.Private.CoreLib.dasm - UnmanagedMemoryAccessor:Read(long,byref):this (6 methods)
       -2466 (-64.25% of base) : System.Private.CoreLib.dasm - UnmanagedMemoryAccessor:Write(long,byref):this (6 methods)
       -2201 (-33.84% of base) : System.Private.CoreLib.dasm - RuntimeType:InvokeMember(String,int,Binder,Object,ref,ref,CultureInfo,ref):Object:this
       -1881 (-47.24% of base) : System.Private.CoreLib.dasm - Marshal:CopyToManaged(long,ref,int,int) (7 methods)
       -1851 (-37.46% of base) : System.Private.CoreLib.dasm - CustomAttributeBuilder:InitCustomAttributeBuilder(ConstructorInfo,ref,ref,ref,ref,ref):this
       -1806 (-34.73% of base) : System.Private.CoreLib.dasm - SafeBuffer:ReadArray(long,ref,int,int):this (6 methods)
       -1806 (-34.73% of base) : System.Private.CoreLib.dasm - SafeBuffer:WriteArray(long,ref,int,int):this (6 methods)
       -1728 (-48.92% of base) : System.Private.CoreLib.dasm - TaskFactory:CheckMultiContinuationTasksAndCopy(ref):ref (8 methods)
       -1700 (-22.55% of base) : System.Private.CoreLib.dasm - CLRIPropertyValueImpl:CoerceArrayValue(int):ref:this (7 methods)
       -1540 (-55.82% of base) : System.Private.CoreLib.dasm - Vector`1:CopyTo(ref,int):this (6 methods)
       -1500 (-31.15% of base) : System.Private.CoreLib.dasm - MapToCollectionAdapter:CopyTo(ref,int):this (7 methods)
       -1474 (-29.34% of base) : System.Private.CoreLib.dasm - DictionaryKeyCollection`2:CopyTo(ref,int):this (7 methods)
       -1462 (-29.31% of base) : System.Private.CoreLib.dasm - DictionaryValueCollection`2:CopyTo(ref,int):this (7 methods)
       -1448 (-39.63% of base) : System.Private.CoreLib.dasm - VectorToCollectionAdapter:CopyTo(ref,int):this (7 methods)
       -1263 (-31.64% of base) : System.Private.CoreLib.dasm - IdnMapping:PunycodeDecode(String):String
       -1232 (-15.15% of base) : System.Private.CoreLib.dasm - TlsOverPerCoreLockedStacksArrayPool`1:Return(ref,bool):this (9 methods)
        -990 (-27.45% of base) : System.Private.CoreLib.dasm - CLRIPropertyValueImpl:UnboxArray(Type):ref:this (6 methods)
        -931 (-27.72% of base) : System.Private.CoreLib.dasm - ValueTuple`2:System.Collections.IStructuralComparable.CompareTo(Object,IComparer):int:this (7 methods)

Top method regressions (percentages):
          86 (12.55% of base) : System.Private.CoreLib.dasm - RuntimeType:ValidateGenericArguments(MemberInfo,ref,Exception) (1 base, 2 diff methods)
           1 ( 0.21% of base) : System.Private.CoreLib.dasm - Task:InternalCancel(bool):bool:this
           1 ( 0.10% of base) : System.Private.CoreLib.dasm - <DisposeAsync>d__35:MoveNext():this

Top method improvements (percentages):
        -480 (-79.21% of base) : System.Private.CoreLib.dasm - EncoderExceptionFallbackBuffer:Fallback(ushort,ushort,int):bool:this
        -497 (-79.14% of base) : System.Private.CoreLib.dasm - GregorianCalendar:IsLeapMonth(int,int,int):bool:this
        -308 (-77.19% of base) : System.Private.CoreLib.dasm - JapaneseCalendar:ToFourDigitYear(int):int:this
        -308 (-77.19% of base) : System.Private.CoreLib.dasm - TaiwanCalendar:ToFourDigitYear(int):int:this
        -371 (-76.18% of base) : System.Private.CoreLib.dasm - GC:RegisterForFullGCNotification(int,int)
        -297 (-75.77% of base) : System.Private.CoreLib.dasm - GregorianCalendar:GetLeapMonth(int,int):int:this
        -297 (-75.38% of base) : System.Private.CoreLib.dasm - GregorianCalendar:ToFourDigitYear(int):int:this
        -297 (-75.19% of base) : System.Private.CoreLib.dasm - GregorianCalendar:GetMonthsInYear(int,int):int:this
        -731 (-75.05% of base) : System.Private.CoreLib.dasm - GregorianCalendar:IsLeapDay(int,int,int,int):bool:this
        -297 (-75.00% of base) : System.Private.CoreLib.dasm - ChineseLunisolarCalendar:GetGregorianYear(int,int):int:this
        -297 (-75.00% of base) : System.Private.CoreLib.dasm - KoreanLunisolarCalendar:GetGregorianYear(int,int):int:this
        -300 (-75.00% of base) : System.Private.CoreLib.dasm - JulianCalendar:ToFourDigitYear(int):int:this
        -301 (-74.32% of base) : System.Private.CoreLib.dasm - HijriCalendar:CheckYearMonthRange(int,int,int)
        -301 (-74.32% of base) : System.Private.CoreLib.dasm - PersianCalendar:CheckYearMonthRange(int,int,int)
        -145 (-73.23% of base) : System.Private.CoreLib.dasm - Char:ConvertToUtf32_ThrowInvalidArgs(int)
        -297 (-72.62% of base) : System.Private.CoreLib.dasm - HijriCalendar:ToFourDigitYear(int):int:this
        -297 (-72.62% of base) : System.Private.CoreLib.dasm - PersianCalendar:ToFourDigitYear(int):int:this
        -191 (-71.54% of base) : System.Private.CoreLib.dasm - PersianCalendar:CheckYearRange(int,int)
        -297 (-71.05% of base) : System.Private.CoreLib.dasm - HebrewCalendar:ToFourDigitYear(int):int:this
        -297 (-71.05% of base) : System.Private.CoreLib.dasm - UmAlQuraCalendar:ToFourDigitYear(int):int:this

2379 total methods with Code Size differences (2376 improved, 3 regressed), 20139 unchanged.

This gives us a ballpark estimate of how much code size we might save if we throw-helperized everything in corelib (biased on the high side because of Tier0 codegen; on the low side because there is no inlining). We can't yet apply this to R2R/Tier1 because we don't support transitions from optimized code.

With this optimization we don't need to throw helperize, it more or less does that automagically. Note there are pass-through benefits in many cases as prologs are simpler and we have less zeroing, since the string formatting done in many throws creates structs; these costs are now "deferred" to the remainder method that is jitted only if the exceptional path is reached.

Would guess that this ought to give us ~2% startup wins in scenarios where Tier0 jit time is prominent (~5% faster jitting, jit is 50% of total startup cost).

Would be nice to try and apply this to funclets (at least catch/fault funclets) too; haven't yet looked into how feasible that might be.

We could also potentially apply this more broadly based on IBC, if we had high confidence that IBC "zero" counts were reflective of how most of the code is actually used.

cc @dotnet/jit-contrib @jkotas

@AndyAyersMS
Copy link
Member Author

Will need a new jit guid, if/when we get closer to merging.

@AndyAyersMS
Copy link
Member Author

Testing ongoing, but a few early returns:

  • looks like no-op phase detection post-check is firing for OSR, will need to tweak that part.
  • uncommon patchpoints seem to be happening in handlers

@EgorBo
Copy link
Member

EgorBo commented Apr 4, 2020

We could also potentially apply this more broadly based on IBC

Or Unsafe.Unlikely 🙂

So if JIT doesn't compile rarely used blocks - doesn't it make sense to allow inliner to inline more? E.g. don't count throw new instructions when it estimates native code size (or count them as ThrowHelpers)

@AndyAyersMS
Copy link
Member Author

@EgorBo currently this only works for unoptimized codegen (no inlining). Agree things get quite interesting once we can handle patchpoints in optimized codegen, but that's a ways off.

I need to look more carefully at what sort of costs partial jitting imposes, both in terms of time and memory. For example, if the continuation piece is already jitted, the transition time from the original method to the continuation is the sum of:

  • helper call cost
  • map from code address to loader allocator
  • hash lookup in patchpoint table to find continuation entry point
  • two calls to Virtual Unwind (unwind helper, unwind original method)
  • update and restore context
  • continuation method prolog/epilog

Would guess this time is dominated by VirtualUnwind, so maybe 100,000 instructions? Say ~ 50us? Need to measure. This limits the granularity at which partial compilation pays off -- the not-compiled region has to be relatively costly/slow/unlikely.

It might be possible to backpatch the patchpoint call and turn it into a jump to the continuation -- that would reduce the transition time quite a bit (basically just the prolog/epilog, which should be 10's of instructions). That would make this technique viable for a wider variety of cases.

Initial transition time would be quite a bit higher, as the jit needs to run; for RyuJit the minimum jitting time is on the order of 200us (?) and there is more work on the runtime side too.

@AndyAyersMS
Copy link
Member Author

Remaining failure in the "jit-experimental / uncommon" leg seems to be a dataflow bug. I could work around by not optimizing the Tier0 uncommon fragments, but we should fix the bug. Not clear how easy it will be to repro w/o the unusual dataflow created by the changes in the PR. Will give it a try.

@AndyAyersMS
Copy link
Member Author

Think I've fixed the last bug in the pri1 tests.

Here's a sample of the codegen this produces, for a simple program:

using System;

class E : Exception
{
    public E(double v)
    {
        m_v = v;
    }

    double m_v;
}

class X
{
    public static void Main(string[] args)
    {
        if (args.Length == 1)
        {
            throw new E(11.3);
        }
        Console.WriteLine("hello, world\n");
    }
}

Initial jitting of Main leaves a helper call in place of the throw:

; Assembly listing for method X:Main(System.String[])
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  1,  1   )     ref  ->  [rbp+0x10]   class-hnd
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02    ] (  1,  1   )     ref  ->  [rbp-0x08]   must-init class-hnd exact "NewObj constructor temp"
;
; Lcl frame size = 48

G_M3786_IG01:
       55                   push     rbp
       4883EC30             sub      rsp, 48
       488D6C2430           lea      rbp, [rsp+30H]
       33C0                 xor      rax, rax
       488945F8             mov      qword ptr [rbp-08H], rax
       48894D10             mov      gword ptr [rbp+10H], rcx
                                                ;; bbWeight=1    PerfScore 4.00
G_M3786_IG02:
       488B4D10             mov      rcx, gword ptr [rbp+10H]
       83790801             cmp      dword ptr [rcx+8], 1
       750A                 jne      SHORT G_M3786_IG04
                                                ;; bbWeight=1    PerfScore 4.00
G_M3786_IG03:
       B906000000           mov      ecx, 6
       E828044C5F           call     CORINFO_HELP_UNCOMMON_PATCHPOINT
                                                ;; bbWeight=0    PerfScore 0.00
G_M3786_IG04:
       48B99031A16CA4010000 mov      rcx, 0x1A46CA13190
       488B09               mov      rcx, gword ptr [rcx]
       E8F6F8FFFF           call     System.Console:WriteLine(System.String)
       90                   nop
                                                ;; bbWeight=1    PerfScore 3.50
G_M3786_IG05:
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret
                                                ;; bbWeight=1    PerfScore 2.00

; Total bytes of code 65, prolog size 16, PerfScore 20.00, (MethodHash=69d3f135) for method X:Main(System.String[])
; ============================================================

at runtime when the helper is called, it creates an OSR method for the bit we didn't compile:

; Assembly listing for method X:Main(System.String[])
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; OSR variant for entry point 0x6
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )     ref  ->  zero-ref    class-hnd
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T00] (  4,  0   )     ref  ->  rsi         class-hnd exact "NewObj constructor temp"
;
; Lcl frame size = 32

G_M3786_IG01:
       56                   push     rsi
       4883EC20             sub      rsp, 32
       C5F877               vzeroupper
                                                ;; bbWeight=1    PerfScore 2.25
G_M3786_IG02:
       48B9E0CF21DBF97F0000 mov      rcx, 0x7FF9DB21CFE0
       E819A78D5F           call     CORINFO_HELP_NEWSFAST
       488BF0               mov      rsi, rax
       488BCE               mov      rcx, rsi
       E88E10FFFF           call     System.Exception:.ctor():this
       C5FB100516000000     vmovsd   xmm0, qword ptr [reloc @RWD00]
       C5FB114678           vmovsd   qword ptr [rsi+120], xmm0
       488BCE               mov      rcx, rsi
       E839A64B5F           call     CORINFO_HELP_THROW
       CC                   int3
                                                ;; bbWeight=0    PerfScore 0.00
RWD00  dq       402699999999999Ah


; Total bytes of code 56, prolog size 8, PerfScore 8.05, (MethodHash=69d3f135) for method X:Main(System.String[])
; ============================================================

and we see the expected result:

Unhandled exception. E: Exception of type 'E' was thrown.
   at X.Main(String[] args)

We're optimizing the OSR part which is probably not the right thing here, but if the OSR continuation contains more code than just a throw -- say the throw was in a try -- we might want to optimize.

The full Tier0 version of the method is 103 bytes,. Partial Tier0 + OSR is 121. This delta would go down some if we didn't optimize the OSR part.

@AndyAyersMS
Copy link
Member Author

Remaining failures in "jit-experimental" are all from write-through.

@AndyAyersMS
Copy link
Member Author

@dotnet/jit-contrib think this is ready for some feedback.

I'm going to try and measure startup impact from this version.

Possible next steps:

  • Jit the deferred continuation at Tier0 instead of Tier1. Likely this means allowing OSR from the continuation, since it may have loops and rarely executed blocks.
  • If the continuation can have patchpoints we'll need to think through the dynamics ... there could be some interesting effects. I think the transition between first continuation and second should be able to pop off the first's frame extension, as any live state must live on the original frame. I think this may just work with the current transition logic. That also means the first continuation can simply reuse the descriptor from the original method (and same for any subsequent continuation). Current design requires us to make a copy of this data, we might look into ways we could share. We might be able to link from continuation to original method in the runtime, and so redirect patchpoint info lookup that way.
  • Because of this continuation frame popping, the overall stack usage of a method with deferral might be less than if we had jitted the whole method in one go, and it is possible the net code size might be less overall too (fewer large stack offsets?). If so, we might consider compiling very large Tier0 methods as chunks.
  • Look for other easy to identify cases besides throws where deferral makes sense. IBC data could be one source -- say a method that is called frequently and has cold paths -- but we currently don't have access to IBC data when jitting.
  • Try and more carefully measure overheads -- if we compile say half of a method and defer the rest, then later need to compile the rest, what is the net impact on overall jit time? If we think Tier0 jit time is roughly C + KS, where C and K are constants and S is some measure of the method size, then jitting two halves costs 2C + KS, so the cost delta is just C. Would be nice to try and verify and get ballpark estimates on the constants, or refute this model.

@AndyAyersMS AndyAyersMS marked this pull request as ready for review April 8, 2020 18:08
@AndyAyersMS
Copy link
Member Author

@dotnet/jit-contrib anyone want to take a look? This was (to me) a surprisingly small change on top of OSR, and I think it opens the door to some interesting possibilities.

Introduce a new form of patchpoint for rarely-executed blocks in a method
(typically, blocks with throws). Instead of jitting the code in those blocks,
add code to call a new patchpoint helper (similar to the existing one) which
finds or creates the jitted code for the method beyond this point, and then
transitions control to that code.

Suppressing codegen for exception throws in this way gives similar benefits
to the throw helper transformation, without needing to modify source. This
currently works only in Tier0, but in principle could be extended to handle
optimized code.
There may also be an issue (not restricted to OSR) for try exits;
left a todo there for now.
}
else
{
BitVecOps::IntersectionD(apTraits, block->bbAssertionIn, firstTryBlock->bbAssertionIn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment could be added here stating that for the normal case we use the assertion state from the first block of the try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

BitVecOps::IntersectionD(apTraits, block->bbAssertionIn, firstTryBlock->bbAssertionIn);
}

// TODO -- possible bug: we may need to merge in the out set for every block in the try range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want this TODO ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're safe now, could not find a way to break us here. Will remove this.

@@ -396,6 +396,8 @@ CONFIG_INTEGER(JitGuardedDevirtualizationGuessBestClass, W("JitGuardedDevirtuali

// Enable insertion of patchpoints into Tier0 methods with loops.
CONFIG_INTEGER(TC_OnStackReplacement, W("TC_OnStackReplacement"), 0)
// Enable uncommon patchpoints in Tier0 methods
CONFIG_INTEGER(TC_UncommonPatchpoint, W("TC_UncommonPatchpoint"), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add " (by default this is disabled)" to the comment.

compiler->lvaTable[ppCounterLclNum].lvType = TYP_INT;
if (compiler->doesMethodHavePatchpoints())
{
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this Empty block or add some code or JITDUMP.

//
// ==>
//
// call JIT_UNCOMMON_PATHCPOINT (ilOffset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this method header comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's trying to say that we delete all the statements in the block and add a call; I'll clarify.

" unexpected context IP 0x%p\n", ip, GetIP(&frameContext));
EEPOLICY_HANDLE_FATAL_ERROR(COR_E_EXECUTIONENGINE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behavior, but it now matches the comments.

{
// Should be fatal
STRESS_LOG2(LF_TIEREDCOMPILATION, LL_INFO10, "Jit_UncommonPatchpoint: patchpoint (0x%p) TRANSITION"
" unexpected context IP 0x%p\n", ip, GetIP(&frameContext));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says that this should be fatal, but it isn't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will update to match the other cases like this.

@AndyAyersMS
Copy link
Member Author

@briansull thanks -- updated per your feedback.

I must have rebased somewhere along the way too, so had to force push.

@AndyAyersMS
Copy link
Member Author

Had to do another merge (conflict with another of my changes).

@AndyAyersMS
Copy link
Member Author

I'm going to close this for now, will revisit once 5.0 has branched off.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants