Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Patch vtable slots and similar when tiering is enabled #21292

Merged
merged 12 commits into from
Jan 12, 2019

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Nov 30, 2018

For a method eligible for tiered compilation and vtable slot backpatching:

  • The entry point to the final code is versionable, as for any method eligible for tiering
  • It does not have a precode (HasPrecode() returns false)
  • It does not have a stable entry point (HasStableEntryPoint() returns false)
  • A call to the method may be:
    • An indirect call through the MethodTable's backpatchable vtable slot
    • A direct call to a backpatchable FuncPtrStub, perhaps through a JumpStub
    • For interface methods, an indirect call through the virtual stub dispatch (VSD) indirection cell to a backpatchable DispatchStub or a ResolveStub that refers to a backpatchable ResolveCacheEntry
  • The purpose is that typical calls to the method have no additional overhead when tiering is enabled

Recording and backpatching slots:

  • In order for all vtable slots for the method to be backpatchable:
    • A vtable slot initially points to the MethodDesc's temporary entry point, even when the method is inherited by a derived type (the slot's value is not copied from the parent)
    • The temporary entry point always points to the prestub and is never backpatched, in order to be able to discover new vtable slots through which the method may be called
    • The prestub, as part of DoBackpatch(), records any slots that are transitioned from the temporary entry point to the method's at-the-time current, non-prestub entry point
    • Any further changes to the method's entry point cause recorded slots to be backpatched in BackpatchEntryPointSlots()
  • In order for the FuncPtrStub to be backpatchable:
    • After the FuncPtrStub is created and exposed, it is patched to point to the method's at-the-time current entry point if necessary
    • Any further changes to the method's entry point cause the FuncPtrStub to be backpatched in BackpatchEntryPointSlots()
  • In order for VSD entities to be backpatchable:
    • A DispatchStub's entry point target is aligned and recorded for backpatching in BackpatchEntryPointSlots()
    • A ResolveCacheEntry's entry point target is recorded for backpatching in BackpatchEntryPointSlots()

Slot lifetime and management of recorded slots:

  • A slot is recorded in the LoaderAllocator in which the slot is allocated, see RecordAndBackpatchEntryPointSlot()
  • An inherited slot that has a shorter lifetime than the MethodDesc, when recorded, needs to be accessible by the MethodDesc for backpatching, so the dependent LoaderAllocator with the slot to backpatch is also recorded in the MethodDesc's LoaderAllocator, see MethodDescBackpatchInfo::AddDependentLoaderAllocatorsWithSlotsToBackpatch_Locked()
  • At the end of a LoaderAllocator's lifetime, the LoaderAllocator is unregistered from dependency LoaderAllocators, see MethodDescBackpatchInfoTracker::ClearDependencyMethodDescEntryPointSlotsToBackpatchHash()
  • When a MethodDesc's entry point changes, backpatching also includes iterating over recorded dependent LoaderAllocators to backpatch the relevant slots recorded there, see BackpatchEntryPointSlots()

Synchronization between entry point changes and backpatching slots

  • A global lock is used to ensure that all recorded backpatchable slots corresponding to a MethodDesc point to the same entry point, see DoBackpatch() and BackpatchEntryPointSlots() for examples

Typical slot value transitions:

  • Initially, the slot contains the method's temporary entry point, which always points to the prestub (see above)
  • After the tier 0 JIT completes, the slot is transitioned to the tier 0 entry point, and the slot is recorded for backpatching
  • When tiered compilation decides to begin counting calls for the method, the slot is transitioned to the temporary entry point (call counting currently happens in the prestub)
  • When the call count reaches the tier 1 threshold, the slot is transitioned to the tier 0 entry point and a tier 1 JIT is scheduled
  • After the tier 1 JIT completes, the slot is transitioned to the tier 1 entry point

Due to startup time perf issues:

  • IsEligibleForTieredCompilation() is called more frequently with this change and in hotter paths. I chose to use a MethodDesc flag to store that information for fast retreival. The flag is initialized by DetermineAndSetIsEligibleForTieredCompilation().
  • Initially, I experimented with allowing a tiered vtable method to have a precode, and allocated a new precode that would also be the stable entry point when a direct call is necessary. That also allows recording a new slot to be optional - in the event of an OOM, the slot may just point to the stable entry point. There are a large number of such methods and the allocations were slowing down startup perf. So, I had to eliminate precodes for tiered vtable methods and that in turn means that recording slots is necessary for versionability.

@jkotas @davidwrighton @noahfalk @AndyAyersMS

@kouvel kouvel added the area-VM label Nov 30, 2018
@kouvel kouvel added this to the 3.0 milestone Nov 30, 2018
@kouvel kouvel self-assigned this Nov 30, 2018
@jkotas
Copy link
Member

jkotas commented Nov 30, 2018

Do you have any numbers for the costs (memory, time) of this extra bookkeeping?

@kouvel
Copy link
Member Author

kouvel commented Nov 30, 2018

For perf, here is what I have so far. I'm in the process of gathering data for some more tests, and more samples for some cases to weed out error, I'll update inline once I have more data. Startup perf is a bit lower in some JitBench tests (measurably but slightly), and same as before in TechEmpower ASP.NET tests. Steady-state perf is better in several TechEmpower ASP.NET tests.

For memory, I took a count of number of methods eligible for vtable slot backpatching and in JitBench tests it's approximately 1/3 of methods eligible for tiering. Typically, there are no slots recorded for a method eligible for vtable slot backpatching. In the same tests, cases where there is a slot recorded, there is typically only one or two. From that high level info I figured the memory impact would be minimal and scales proportionally with number of virtual methods. I can gather more specific data if you would like, let me know.

JitBench - Before

       Benchmark                Metric           NoR2R NoTiering            NoR2R                Default
-----------------------  --------------------  --------------------  --------------------  -------------------
Dotnet_Build_HelloWorld         Duration (ms)      2890 (2884-2906)  1962.5 (1958.5-1967)     1185 (1180-1190)
        Csc_Hello_World         Duration (ms)  1872.5 (1868.5-1878)       1002 (997-1012)      459 (458-462.5)
      Csc_Roslyn_Source         Duration (ms)      3622 (3603-3642)      3248 (3230-3264)     2306 (2296-2313)
  Empty Console Program         Duration (ms)        39.5 (39-41.5)          39 (38.5-40)           39 (39-40)
             MusicStore          Startup (ms)      1714 (1710-1726)      1061 (1056-1068)        526 (518-529)
             MusicStore    First Request (ms)  1667.5 (1662.5-1672)         926 (918-932)      409 (408-413.5)
             MusicStore  Median Response (ms)         1.99 (1.97-2)      2.04 (2.01-2.05)       2 (1.985-2.03)
               AllReady          Startup (ms)      2433 (2426-2440)      1687 (1684-1697)     1098 (1094-1104)
               AllReady    First Request (ms)      1022 (1011-1030)       594.5 (591-600)      337.5 (332-339)
               AllReady  Median Response (ms)     2.51 (2.505-2.56)    2.59 (2.575-2.605)    2.635 (2.6-2.695)
               Word2Vec         Training (ms)   36454 (36047-36798)   39364 (39148-39830)  39354 (38870-39745)
               Word2Vec     First Search (ms)          24 (24-24.5)          89 (89-89.5)         86 (85-86.5)
               Word2Vec    Median Search (ms)    21.46 (21.4-21.54)   21.53 (21.38-21.84)  21.46 (21.44-21.57)

JitBench - After

       Benchmark                Metric           NoR2R NoTiering            NoR2R               Default
-----------------------  --------------------  --------------------  -------------------  -------------------
Dotnet_Build_HelloWorld         Duration (ms)      2916 (2904-2924)     1980 (1978-1992)     1204 (1198-1210)
        Csc_Hello_World         Duration (ms)      1885 (1880-1896)     1012 (1005-1019)      468 (466.5-469)
      Csc_Roslyn_Source         Duration (ms)      3635 (3614-3650)     3222 (3188-3242)     2274 (2252-2301)
  Empty Console Program         Duration (ms)            40 (39-41)         40 (39-40.5)         39.5 (39-40)
             MusicStore          Startup (ms)      1715 (1708-1728)     1076 (1066-1082)        532 (529-535)
             MusicStore    First Request (ms)      1666 (1662-1677)        930 (926-940)      419 (417-422.5)
             MusicStore  Median Response (ms)     1.98 (1.97-1.995)    1.99 (1.98-2.005)       1.97 (1.945-2)
               AllReady          Startup (ms)    2449.5 (2443-2452)     1710 (1703-1716)     1114 (1108-1122)
               AllReady    First Request (ms)  1017.5 (1015-1020.5)  597.5 (595.5-603.5)        338 (337-347)
               AllReady  Median Response (ms)      2.51 (2.49-2.53)    2.54 (2.505-2.55)    2.555 (2.54-2.62)
               Word2Vec         Training (ms)   36452 (36172-36904)  39280 (38960-39920)  39693 (38999-40534)
               Word2Vec     First Search (ms)            24 (24-24)       86 (85.5-86.5)         86 (85-87.5)
               Word2Vec    Median Search (ms)   21.51 (21.46-21.56)  21.27 (21.23-21.33)  21.52 (21.44-21.96)

TechEmpower ASP.NET

  • With fragile ngen for CoreLib, no R2R
             Untiered TieredBefore TieredAfter
Json Windows KestrelLibuv
  Averages  464858.00    448243.00   464925.00
    Diff %                   -3.57        0.01
Json Linux KestrelLibuv
  Averages  435850.67    414764.00   426399.33
    Diff %                   -4.84       -2.17
MvcJil Windows KestrelSockets
  Averages  163697.67    159851.33   164040.25
    Diff %                   -2.35        0.21
MvcJil Linux KestrelSockets
  Averages  159984.25    155610.00   159347.67
    Diff %                   -2.73       -0.40
MvcJson Windows KestrelSockets
  Averages  162821.33    156610.17   162611.50
    Diff %                   -3.81       -0.13
MvcJson Linux KestrelSockets
  Averages  155805.75    152314.33   156347.00
    Diff %                   -2.24        0.35
MvcJson Windows KestrelLibuv
  Averages  147469.00    146242.67   146963.00
    Diff %                   -0.83       -0.34
MvcJson Linux KestrelLibuv
  Averages  156528.75    151799.25   155429.50
    Diff %                   -3.02       -0.70
MvcPlaintext Windows KestrelSockets
  Averages  516013.75    494174.75   519025.50
    Diff %                   -4.23        0.58
MvcPlaintext Linux KestrelSockets
  Averages  471352.50    451681.25   472128.25
    Diff %                   -4.17        0.16
MvcPlaintext Windows KestrelLibuv
  Averages  404184.25    361472.25   401474.00
    Diff %                  -10.57       -0.67
MvcPlaintext Linux KestrelLibuv
  Averages  403010.33    385628.00   400861.33
    Diff %                   -4.31       -0.53
Plaintext Windows KestrelSockets
  Averages 1927906.33   1886305.75  1926567.25
    Diff %                   -2.16       -0.07
Plaintext Linux KestrelSockets
  Averages 1684935.00   1619076.33  1675223.67
    Diff %                   -3.91       -0.58
ResponseCachingPlaintextCached Windows KestrelSockets
  Averages  800636.33    781698.50   820939.67
    Diff %                   -2.37        2.54
ResponseCachingPlaintextCached Linux KestrelSockets
  Averages  727953.00    710223.67   739248.00
    Diff %                   -2.44        1.55
ResponseCachingPlaintextCachedDelete Windows KestrelSockets
  Averages 1794292.25   1720943.75  1795011.50
    Diff %                   -4.09        0.04
ResponseCachingPlaintextCachedDelete Linux KestrelSockets
  Averages 1595171.75   1543908.75  1591782.75
    Diff %                   -3.21       -0.21
ResponseCachingPlaintextVaryByCached Windows KestrelSockets
  Averages  685215.00    669478.67   699582.00
    Diff %                   -2.30        2.10
ResponseCachingPlaintextVaryByCached Linux KestrelSockets
  Averages  605933.00    618759.00   637792.00
    Diff %                    2.12        5.26
StaticFiles Windows KestrelSockets
  Averages 1696266.00   1667242.75  1711140.50
    Diff %                   -1.71        0.88
StaticFiles Linux KestrelSockets
  Averages 1482655.75   1429070.00  1487667.50
    Diff %                   -3.61        0.34

@kouvel
Copy link
Member Author

kouvel commented Nov 30, 2018

Most of the slight startup time degradation I believe is from finding and dealing with MethodDescs when copying slots to derived types. From using the MethodData cache much of that was decreased but GetMethodDescFromSlot and GetMethodDescFromSlotAddress still appear to take more time during startup than before.

@kouvel
Copy link
Member Author

kouvel commented Nov 30, 2018

Updated commit description

@kouvel
Copy link
Member Author

kouvel commented Dec 4, 2018

Some concrete data on memory impact is below. I have not accounted for unused space in arrays/hashtables and with the numbers below it looks like the amount of unused space would be small.

  • VirtualInfo count = number of MethodDescVirtualInfo objects created
  • SlotsToBackpatch count = number of slots recorded for backpatching in the same LoaderAllocator as the MethodDesc
  • Dependent LoaderAllocator count = number of dependent LoaderAllocators that have slots to backpatch and are recorded in MethodDescVirtualInfo
  • LoaderAllocator dependency MethodDesc count = number of MethodDescs in a different LoaderAllocator for which slots have been recorded in this LoaderAllocator (element count of new hash table in loaderallocator.hpp)
  • Dependent SlotsToBackpatch count = number of slots recorded in a LoaderAllocator that is different from the corresponding MethodDesc's LoaderAllocator
  • ctvm = called tiered vtable method

Number of slots recorded per called tiered vtable method is low. The MethodDescVirtualInfo itself appears to be more significant by byte count than the number of slots recorded in it. There may be possibilities to tune for 1 and 2 slots without as much overhead. From below though, the overhead looks fairly low to me.

JitBench MusicStore

Tiered method count:                               72129
Tiered vtable method count:                        21039
Called tiered method count:                        13068
Called tiered vtable method count:                  3620
VirtualInfo count:                                  1677 (  40 B each,      67080 B,  18.53 B/ctvm)
SlotsToBackpatch count:                             4088 (   8 B each,      32704 B,   9.03 B/ctvm,  1.13 slots/ctvm)
Dependent LoaderAllocator count:                       5 (   8 B each,         40 B,   0.01 B/ctvm)
LoaderAllocator dependency MethodDesc count:           5 (  32 B each,        160 B,   0.04 B/ctvm)
Dependent SlotsToBackpatch count:                      5 (   8 B each,         40 B,   0.01 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:         213 (  16 B each,       3408 B,   0.94 B/ctvm)
Total:                                                   (                 103432 B,  28.57 B/ctvm,  1.13 slots/ctvm)

JitBench AllReady

Tiered method count:                               75582
Tiered vtable method count:                        21308
Called tiered method count:                        12634
Called tiered vtable method count:                  3456
VirtualInfo count:                                  1821 (  40 B each,      72840 B,  21.08 B/ctvm)
SlotsToBackpatch count:                             7204 (   8 B each,      57632 B,  16.68 B/ctvm,  2.08 slots/ctvm)
Dependent LoaderAllocator count:                       0 (   8 B each,          0 B,   0.00 B/ctvm)
LoaderAllocator dependency MethodDesc count:           0 (  32 B each,          0 B,   0.00 B/ctvm)
Dependent SlotsToBackpatch count:                      0 (   8 B each,          0 B,   0.00 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:         174 (  16 B each,       2784 B,   0.81 B/ctvm)
Total:                                                   (                 133256 B,  38.56 B/ctvm,  2.08 slots/ctvm)

JitBench Word2Vec

Tiered method count:                                2573
Tiered vtable method count:                          326
Called tiered method count:                          259
Called tiered vtable method count:                    24
VirtualInfo count:                                     1 (  40 B each,         40 B,   1.67 B/ctvm)
SlotsToBackpatch count:                                1 (   8 B each,          8 B,   0.33 B/ctvm,  0.04 slots/ctvm)
Dependent LoaderAllocator count:                       0 (   8 B each,          0 B,   0.00 B/ctvm)
LoaderAllocator dependency MethodDesc count:           0 (  32 B each,          0 B,   0.00 B/ctvm)
Dependent SlotsToBackpatch count:                      0 (   8 B each,          0 B,   0.00 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:           2 (  16 B each,         32 B,   1.33 B/ctvm)
Total:                                                   (                     80 B,   3.33 B/ctvm,  0.04 slots/ctvm)

JitBench CscBuildRoslynSource (repeated in same process many times)

Tiered method count:                              112458
Tiered vtable method count:                        34166
Called tiered method count:                        16700
Called tiered vtable method count:                  5337
VirtualInfo count:                                  2008 (  40 B each,      80320 B,  15.05 B/ctvm)
SlotsToBackpatch count:                             7055 (   8 B each,      56440 B,  10.58 B/ctvm,  1.32 slots/ctvm)
Dependent LoaderAllocator count:                       7 (   8 B each,         56 B,   0.01 B/ctvm)
LoaderAllocator dependency MethodDesc count:           7 (  32 B each,        224 B,   0.04 B/ctvm)
Dependent SlotsToBackpatch count:                      9 (   8 B each,         72 B,   0.01 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:         259 (  16 B each,       4144 B,   0.78 B/ctvm)
Total:                                                   (                 141256 B,  26.47 B/ctvm,  1.32 slots/ctvm)

JitBench CscBuildHelloWorld (repeated in same process many times)

Tiered method count:                               83313
Tiered vtable method count:                        22174
Called tiered method count:                         6723
Called tiered vtable method count:                  1552
VirtualInfo count:                                   728 (  40 B each,      29120 B,  18.76 B/ctvm)
SlotsToBackpatch count:                             1402 (   8 B each,      11216 B,   7.23 B/ctvm,  0.90 slots/ctvm)
Dependent LoaderAllocator count:                       0 (   8 B each,          0 B,   0.00 B/ctvm)
LoaderAllocator dependency MethodDesc count:           0 (  32 B each,          0 B,   0.00 B/ctvm)
Dependent SlotsToBackpatch count:                      0 (   8 B each,          0 B,   0.00 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:          94 (  16 B each,       1504 B,   0.97 B/ctvm)
Total:                                                   (                  41840 B,  26.96 B/ctvm,  0.90 slots/ctvm)

Json Windows KestrelSockets

Tiered method count:                               23673
Tiered vtable method count:                         6175
Called tiered method count:                         3550
Called tiered vtable method count:                   575
VirtualInfo count:                                   292 (  40 B each,      11680 B,  20.31 B/ctvm)
SlotsToBackpatch count:                              472 (   8 B each,       3776 B,   6.57 B/ctvm,  0.82 slots/ctvm)
Dependent LoaderAllocator count:                       0 (   8 B each,          0 B,   0.00 B/ctvm)
LoaderAllocator dependency MethodDesc count:           0 (  32 B each,          0 B,   0.00 B/ctvm)
Dependent SlotsToBackpatch count:                      0 (   8 B each,          0 B,   0.00 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:           7 (  16 B each,        112 B,   0.19 B/ctvm)
Total:                                                   (                  15568 B,  27.07 B/ctvm,  0.82 slots/ctvm)

MvcJil Windows KestrelSockets

Tiered method count:                               38132
Tiered vtable method count:                         9688
Called tiered method count:                         6210
Called tiered vtable method count:                   943
VirtualInfo count:                                   483 (  40 B each,      19320 B,  20.49 B/ctvm)
SlotsToBackpatch count:                              955 (   8 B each,       7640 B,   8.10 B/ctvm,  1.01 slots/ctvm)
Dependent LoaderAllocator count:                       2 (   8 B each,         16 B,   0.02 B/ctvm)
LoaderAllocator dependency MethodDesc count:           2 (  32 B each,         64 B,   0.07 B/ctvm)
Dependent SlotsToBackpatch count:                      2 (   8 B each,         16 B,   0.02 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:           8 (  16 B each,        128 B,   0.14 B/ctvm)
Total:                                                   (                  27184 B,  28.83 B/ctvm,  1.01 slots/ctvm)

MvcJson Windows KestrelSockets

Tiered method count:                               32286
Tiered vtable method count:                         8559
Called tiered method count:                         5761
Called tiered vtable method count:                   932
VirtualInfo count:                                   471 (  40 B each,      18840 B,  20.21 B/ctvm)
SlotsToBackpatch count:                              927 (   8 B each,       7416 B,   7.96 B/ctvm,  0.99 slots/ctvm)
Dependent LoaderAllocator count:                       2 (   8 B each,         16 B,   0.02 B/ctvm)
LoaderAllocator dependency MethodDesc count:           2 (  32 B each,         64 B,   0.07 B/ctvm)
Dependent SlotsToBackpatch count:                      2 (   8 B each,         16 B,   0.02 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:          10 (  16 B each,        160 B,   0.17 B/ctvm)
Total:                                                   (                  26512 B,  28.45 B/ctvm,  1.00 slots/ctvm)

MvcPlaintext Windows KestrelSockets

Tiered method count:                               28795
Tiered vtable method count:                         7526
Called tiered method count:                         5109
Called tiered vtable method count:                   815
VirtualInfo count:                                   427 (  40 B each,      17080 B,  20.96 B/ctvm)
SlotsToBackpatch count:                              833 (   8 B each,       6664 B,   8.18 B/ctvm,  1.02 slots/ctvm)
Dependent LoaderAllocator count:                       2 (   8 B each,         16 B,   0.02 B/ctvm)
LoaderAllocator dependency MethodDesc count:           2 (  32 B each,         64 B,   0.08 B/ctvm)
Dependent SlotsToBackpatch count:                      2 (   8 B each,         16 B,   0.02 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:           7 (  16 B each,        112 B,   0.14 B/ctvm)
Total:                                                   (                  23952 B,  29.39 B/ctvm,  1.02 slots/ctvm)

Plaintext Windows KestrelSockets

Tiered method count:                               20085
Tiered vtable method count:                         5081
Called tiered method count:                         3048
Called tiered vtable method count:                   490
VirtualInfo count:                                   258 (  40 B each,      10320 B,  21.06 B/ctvm)
SlotsToBackpatch count:                              409 (   8 B each,       3272 B,   6.68 B/ctvm,  0.83 slots/ctvm)
Dependent LoaderAllocator count:                       0 (   8 B each,          0 B,   0.00 B/ctvm)
LoaderAllocator dependency MethodDesc count:           0 (  32 B each,          0 B,   0.00 B/ctvm)
Dependent SlotsToBackpatch count:                      0 (   8 B each,          0 B,   0.00 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:           3 (  16 B each,         48 B,   0.10 B/ctvm)
Total:                                                   (                  13640 B,  27.84 B/ctvm,  0.83 slots/ctvm)

ResponseCachingPlaintextCached Windows KestrelSockets

Tiered method count:                               21323
Tiered vtable method count:                         5283
Called tiered method count:                         3429
Called tiered vtable method count:                   570
VirtualInfo count:                                   297 (  40 B each,      11880 B,  20.84 B/ctvm)
SlotsToBackpatch count:                              505 (   8 B each,       4040 B,   7.09 B/ctvm,  0.89 slots/ctvm)
Dependent LoaderAllocator count:                       0 (   8 B each,          0 B,   0.00 B/ctvm)
LoaderAllocator dependency MethodDesc count:           0 (  32 B each,          0 B,   0.00 B/ctvm)
Dependent SlotsToBackpatch count:                      0 (   8 B each,          0 B,   0.00 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:           4 (  16 B each,         64 B,   0.11 B/ctvm)
Total:                                                   (                  15984 B,  28.04 B/ctvm,  0.89 slots/ctvm)

ResponseCachingPlaintextCachedDelete Windows KestrelSockets

Tiered method count:                               21716
Tiered vtable method count:                         5518
Called tiered method count:                         3102
Called tiered vtable method count:                   481
VirtualInfo count:                                   254 (  40 B each,      10160 B,  21.12 B/ctvm)
SlotsToBackpatch count:                              397 (   8 B each,       3176 B,   6.60 B/ctvm,  0.83 slots/ctvm)
Dependent LoaderAllocator count:                       0 (   8 B each,          0 B,   0.00 B/ctvm)
LoaderAllocator dependency MethodDesc count:           0 (  32 B each,          0 B,   0.00 B/ctvm)
Dependent SlotsToBackpatch count:                      0 (   8 B each,          0 B,   0.00 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:           5 (  16 B each,         80 B,   0.17 B/ctvm)
Total:                                                   (                  13416 B,  27.89 B/ctvm,  0.83 slots/ctvm)

ResponseCachingPlaintextVaryByCached Windows KestrelSockets

Tiered method count:                               21335
Tiered vtable method count:                         5284
Called tiered method count:                         3456
Called tiered vtable method count:                   572
VirtualInfo count:                                   301 (  40 B each,      12040 B,  21.05 B/ctvm)
SlotsToBackpatch count:                              485 (   8 B each,       3880 B,   6.78 B/ctvm,  0.85 slots/ctvm)
Dependent LoaderAllocator count:                       0 (   8 B each,          0 B,   0.00 B/ctvm)
LoaderAllocator dependency MethodDesc count:           0 (  32 B each,          0 B,   0.00 B/ctvm)
Dependent SlotsToBackpatch count:                      0 (   8 B each,          0 B,   0.00 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:           3 (  16 B each,         48 B,   0.08 B/ctvm)
Total:                                                   (                  15968 B,  27.92 B/ctvm,  0.85 slots/ctvm)

StaticFiles Windows KestrelSockets

Tiered method count:                               20309
Tiered vtable method count:                         5091
Called tiered method count:                         3108
Called tiered vtable method count:                   497
VirtualInfo count:                                   262 (  40 B each,      10480 B,  21.09 B/ctvm)
SlotsToBackpatch count:                              422 (   8 B each,       3376 B,   6.79 B/ctvm,  0.85 slots/ctvm)
Dependent LoaderAllocator count:                       0 (   8 B each,          0 B,   0.00 B/ctvm)
LoaderAllocator dependency MethodDesc count:           0 (  32 B each,          0 B,   0.00 B/ctvm)
Dependent SlotsToBackpatch count:                      0 (   8 B each,          0 B,   0.00 B/ctvm,  0.00 slots/ctvm)
FuncPtrStub count for tiered vtable methods:           4 (  16 B each,         64 B,   0.13 B/ctvm)
Total:                                                   (                  13920 B,  28.01 B/ctvm,  0.85 slots/ctvm)

@davidwrighton
Copy link
Member

// Keep in-sync with MethodDesc::IsEligibleForTieredCompilation()

Change this to MethodDesc::DetermineAndSetIsEligibleForTieredCompilation


Refers to: src/vm/methodtablebuilder.cpp:6952 in c294ed4. [](commit_id = c294ed4, deletion_comment = False)

@davidwrighton
Copy link
Member

Could you write up a description of the various stages a vtable entry goes through, starting from initial, which points to prestub, to the end case where its finally entirely using a tier 2 pointer?

@kouvel
Copy link
Member Author

kouvel commented Dec 5, 2018

Updated perf numbers and added count of new FuncPtrStubs to memory numbers inline above. Rebased for conflicts. Addressed feedback in the last two commits.

@davidwrighton
Copy link
Member

Why is all this infrastructure labeled with Virtual this and VTable that? I don't see any fundamental reason why this stuff should be tied to virtual functions. It seems to me that this would also be useful for other places where we hold a function pointer, that ideally would just point to the target method address. For instance, in R2R code all calls to other methods go through an indirection cell, and this concept of a backpatchable address could be used to improve the performance of those calls for applications which are relatively short lived.

Also, I'm in the process of building out infrastructure to support making interface dispatch calls to interfaces which only have 1 implementation. This would involve placing code pointers directly into VSD indirection cells. Interestingly enough, this exactly matches the logic here, except that those cells need to be able to be updated to not be target address pointers. It feels like that would be a straightforward addition to this logic, but I'd love to hear your opinion.

@davidwrighton
Copy link
Member

Finally, I've been looking at this code for a few days, and I've concluded that the memory management is correct, and to the best of my ability to code review the implementation is correct, but I'm uneasy with the way that cross loader allocator data is handled.

  1. The data structure is implemented in a such a way that a similar data structure would require rebuilding it all from scratch, while necessary for this purpose, it feels like a data structure that should be written in a generic fashion.
  2. The cost of a LoaderAllocatorSet per MethodDesc that goes multi-loaderallocator seems somewhat high in terms of memory consumption. The numbers around memory use presented above are not representative of the true cost here, as you are not really measuring any collectible loaderallocator scenarios, as well as the minimum cost of having a LoaderAllocatorSet which is ~32 bytes + 7*pointersize. I have been thinking that instead of attempting to use a LoaderAllocatorSet, we could play some games with a hashtable and seperate storage for the nodes, such that we store the nodes in the appropriate loaderallocator, but a hashtable of pointers in the loaderallocator of the controlling method. This would reduce lookup costs to a single O(~1) lookup, reduce memory usage by removing the need for duplicate LoaderAllocatorSets, and be roughly equivalent in cleanup costs as the cleanup of the one master hash is probably not any more expensive than cleaning an item out of a large collection of small sets.

Overall, I believe the current state is acceptable for checkin, but I would like to see the data structure management turn into a generic service of the LoaderAllocator concept in the next few months.

@kouvel
Copy link
Member Author

kouvel commented Dec 7, 2018

Thanks for the feedback. You're right, the actual storage doesn't need to be specific to virtualness. I'll go ahead and remove association with virtualness for the storage portion.

As discussed offline, the LoaderAllocatorSet can be eliminated in favor of using a multi-hash instead and that could simplify some of the data structures and save some space. I'll look into how easy it would be to make that change as well, so that it would be easy to move to the generic service in the future.

@kouvel
Copy link
Member Author

kouvel commented Dec 12, 2018

Rebased, last three commits are new. After discussing further offline, I kept the storage mechanism as it was, and did some cleanup to make it easier to move to a new storage mechanism in the future and to use the same backpatching logic for other purposes.

@kouvel
Copy link
Member Author

kouvel commented Dec 14, 2018

@dotnet-bot test this please

@kouvel
Copy link
Member Author

kouvel commented Dec 14, 2018

@davidwrighton, would you be able to make a pass over the latest commits (starting from the one titled "Rename VirtualInfo -> BackpatchInfo")? I'd like to get this into 3.0 preview 2 (I'm told that would be around mid Jan) with some bake time to flush things out.

@kouvel
Copy link
Member Author

kouvel commented Dec 15, 2018

@davidwrighton I see you're on vacation, this can wait, have a good time!

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry its taken me so long to get to this Kount with my December vacation, but I spend a decent chunk of time this weekend to get this cleared away. I don't have the expertise in VSD/MethodTable construction that I suspect David or Jan might have, but from what I can tell looks quite solid. Most of my comments wound up being cosmetic around naming, factoring, and commenting suggestions.

src/vm/frames.cpp Outdated Show resolved Hide resolved
src/vm/method.cpp Outdated Show resolved Hide resolved
src/vm/precode.cpp Show resolved Hide resolved
src/vm/precode.cpp Show resolved Hide resolved
Documentation/design-docs/code-versioning.md Outdated Show resolved Hide resolved
src/vm/methoddescbackpatchinfo.h Outdated Show resolved Hide resolved
src/vm/methoddescbackpatchinfo.h Outdated Show resolved Hide resolved
src/vm/methoddescbackpatchinfo.h Outdated Show resolved Hide resolved
src/vm/prestub.cpp Show resolved Hide resolved
Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -880,10 +880,10 @@ EXTERN_C PCODE VirtualMethodFixupWorker(TransitionBlock * pTransitionBlock, CORC
INSTALL_MANAGED_EXCEPTION_DISPATCHER;
INSTALL_UNWIND_AND_CONTINUE_HANDLER_NO_PROBE;

if (pMD->IsTieredVtableMethod())
if (pMD->IsVersionableWithVtableSlotBackpatch())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still seems a little odd to me to refer to vtable slots specifically when we actually patch a variety of different slots. However given that the comments accurately describe all the slots that are handled I think its a minor concern and I can be fine with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I went back and forth on that. The VSD slots that are recorded are sort of extensions of vtable slots, in the sense they copy the value of a vtable slot and so they must be backpatched whenever the source vtable slot changes. They are not actually vtable slots but slots that cache vtable slot values, I thought the naming was close enough. Another reason to include "vtable" or something similar in the name is because the versioning scheme is specific to methods that have a vtable slot, which would have different restrictions (like not having a precode or stable entry point for example) than another versioning scheme that may apply to non-vtable slots or caller slots.


MethodDescBackpatchInfo *backpatchInfo =
mdLoaderAllocator->GetMethodDescBackpatchInfoTracker()->GetOrAddBackpatchInfo_Locked(this);
if (slotLoaderAllocator == mdLoaderAllocator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above only a suggestion for the future (I unresolved to ensure visibility, not sure what happens if i respond to a resovled conversation)

I'm not sure when it would be a common case.

The case I am refering to is module A.dll with method A in it calls module B.dll with method B in it. Assuming neither module is collectible they should both have AppDomain lifetime. This should make it OK to store A.dll's slot in method B's backpatch info. If our customers don't go crazy with collectible assemblies this should be true for most cross-module calls.

SetEntryPointToBackpatch_Locked(entryPoint, isPrestubEntryPoint);
}

void MethodDesc::SetCodeEntryPoint(PCODE entryPoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't want to rathole on it and I'm fine if you want to check it in as-is, I do think there is some method to madness (no pun intended?) in the naming and this new method isn't following convention as I understand it. Let me explain my understanding of it, then I'm happy to chat about it offline if you want to discuss other ideas in the future or let it lie if not.

In the context of MethodDesc the runtime has taken somewhat generic terms CS terms 'entry point' and 'native code' and turned them into proper nouns 'EntryPoint' and 'NativeCode' that have specific meanings that are much narrower than the generic CS terms might suggest. Some addresses only meet the criteria of an EntryPoint, others only meet the criteria of 'NativeCode' and some addresses meet the criteria of both which allows them to be refered to as either term.

I think of 'EntryPoint's as the values returned by GetMethodEntryPoint()/GetTemporaryEntryPoint()/GetStableEntryPoint() and they have properties:

  1. It is can be called with the managed ABI, as defined by the MethodDesc's signature
  2. All callable methods have at least one such address
  3. Invoking it always executes the current version of the method
  4. It can be stored in the method's slot

I think of 'NativeCode' as the value returned by GetNativeCode() and it has properties:

  1. It is always the direct address of jitted (or prejitted code), never any other code generator.
  2. It is generally the jitted code that implements the default version of the method, however EnC is a special case (which I hope to eliminate in the future) that uses NativeCode as the jitted code for the most recent version of the method.
  3. NativeCode is the only thing that is permitted to be stored in the NativeCodeSlot, but it does not have to be stored in a NativeCodeSlot if it is possible to compute it from other fields.

Using those criteria, most of the names in the MethodDesc code make sense to me. Some examples:

  1. When a non-versionable method is jitted, the jitted code qualifies as both an 'EntryPoint' and 'NativeCode'.
  2. The start address of a Precode qualifies as an EntryPoint but it isn't NativeCode. The target of the Precode might be NativeCode, but it is never that method's EntryPoint.
  3. For a method implemented by a stub, the target of the Precode is neither an EntryPoint, nor NativeCode.
  4. For versionable code with JumpStamp, the NativeCode is also the EntryPoint. The jitted code for other versions is neither the EntryPoint nor the NativeCode.

When I look at SetCodeEntryPoint it doesn't seem to be consistent with those patterns (and on further inspection neither was my suggested name PublishNativeCode so I take back that suggestion). By default I assumed 'EntryPoint' in that name carries the same meaning as EntryPoint elsewhere but it doesn't match up because the values in the 2nd and 3rd scopes of the if/else are not EntryPoint values. Right now my best attempt to make it fit my pre-existing understanding is to define 'CodeEntryPoint' as a new term used only by that method which has a meaning that is distinct from both 'NativeCode' and 'EntryPoint'. I hope this makes the challenge a little more apparent.

For a method eligible for tiered compilation and vtable slot backpatching:
  - The entry point to the final code is versionable, as for any method eligible for tiering
  - It does not have a precode (`HasPrecode()` returns false)
  - It does not have a stable entry point (`HasStableEntryPoint()` returns false)
  - A call to the method may be:
    - An indirect call through the `MethodTable`'s backpatchable vtable slot
    - A direct call to a backpatchable `FuncPtrStub`, perhaps through a `JumpStub`
    - For interface methods, an indirect call through the virtual stub dispatch (VSD) indirection cell to a backpatchable `DispatchStub` or a `ResolveStub` that refers to a backpatchable `ResolveCacheEntry`
  - The purpose is that typical calls to the method have no additional overhead when tiering is enabled

Recording and backpatching slots:
  - In order for all vtable slots for the method to be backpatchable:
    - A vtable slot initially points to the `MethodDesc`'s temporary entry point, even when the method is inherited by a derived type (the slot's value is not copied from the parent)
    - The temporary entry point always points to the prestub and is never backpatched, in order to be able to discover new vtable slots through which the method may be called
    - The prestub, as part of `DoBackpatch()`, records any slots that are transitioned from the temporary entry point to the method's at-the-time current, non-prestub entry point
    - Any further changes to the method's entry point cause recorded slots to be backpatched in `BackpatchEntryPointSlots()`
  - In order for the `FuncPtrStub` to be backpatchable:
    - After the `FuncPtrStub` is created and exposed, it is patched to point to the method's at-the-time current entry point if necessary
    - Any further changes to the method's entry point cause the `FuncPtrStub` to be backpatched in `BackpatchEntryPointSlots()`
  - In order for VSD entities to be backpatchable:
    - A `DispatchStub`'s entry point target is aligned and recorded for backpatching in `BackpatchEntryPointSlots()`
      - The DispatchStub was modified on x86 and x64 such that the entry point target is aligned to a pointer to make it backpatchable
    - A `ResolveCacheEntry`'s entry point target is recorded for backpatching in `BackpatchEntryPointSlots()`

Slot lifetime and management of recorded slots:
  - A slot is recorded in the `LoaderAllocator` in which the slot is allocated, see `RecordAndBackpatchEntryPointSlot()`
  - An inherited slot that has a shorter lifetime than the `MethodDesc`, when recorded, needs to be accessible by the `MethodDesc` for backpatching, so the dependent `LoaderAllocator` with the slot to backpatch is also recorded in the `MethodDesc`'s `LoaderAllocator`, see `MethodDescVirtualInfo::AddDependentLoaderAllocatorsWithSlotsToBackpatch_Locked()`
  - At the end of a `LoaderAllocator`'s lifetime, the `LoaderAllocator` is unregistered from dependency `LoaderAllocators`, see `LoaderAllocator::ClearDependencyMethodDescEntryPointSlotsToBackpatchHash()`
  - When a `MethodDesc`'s entry point changes, backpatching also includes iterating over recorded dependent `LoaderAllocators` to backpatch the relevant slots recorded there, see `BackpatchEntryPointSlots()`

Synchronization between entry point changes and backpatching slots
  - A global lock is used to ensure that all recorded backpatchable slots corresponding to a `MethodDesc` point to the same entry point, see `DoBackpatch()` and `BackpatchEntryPointSlots()` for examples

Due to startup time perf issues:
  - `IsEligibleForTieredCompilation()` is called more frequently with this change and in hotter paths. I chose to use a `MethodDesc` flag to store that information for fast retreival. The flag is initialized by `DetermineAndSetIsEligibleForTieredCompilation()`.
  - Initially, I experimented with allowing a tiered vtable method to have a precode, and allocated a new precode that would also be the stable entry point when a direct call is necessary. That also allows recording a new slot to be optional - in the event of an OOM, the slot may just point to the stable entry point. There are a large number of such methods and the allocations were slowing down startup perf. So, I had to eliminate precodes for tiered vtable methods and that in turn means that recording slots is necessary for versionability.

@jkotas @davidwrighton @noahfalk @AndyAyersMS
@kouvel
Copy link
Member Author

kouvel commented Jan 11, 2019

Rebased for conflicts

@kouvel
Copy link
Member Author

kouvel commented Jan 11, 2019

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

1 similar comment
@kouvel
Copy link
Member Author

kouvel commented Jan 12, 2019

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

@kouvel
Copy link
Member Author

kouvel commented Jan 12, 2019

Tizen armel seems to be some infra issue that is unrelated to this change:

16:14:39 ERROR: Unable to find project for artifact copy: dotnet_corefx/master/tizen_armel_cross_release
16:14:39 This may be due to incorrect project name or permission settings; see help for project name in job configuration.

@kouvel kouvel merged commit 37b9d85 into dotnet:master Jan 12, 2019
@kouvel kouvel deleted the PatchVtableSlot branch January 12, 2019 02:02
kouvel added a commit to kouvel/coreclr that referenced this pull request Jan 23, 2019
Fixes https://github.com/dotnet/coreclr/issues/22103
- There were reports of build failure from dotnet#21292, worked around it for now with a todo
kouvel added a commit that referenced this pull request Jan 24, 2019
Fixes https://github.com/dotnet/coreclr/issues/22103
- There were reports of build failure from #21292, worked around it for now with a todo
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…r#21292)

Patch vtable slots and similar when tiering is enabled

For a method eligible for code versioning and vtable slot backpatch:
  - It does not have a precode (`HasPrecode()` returns false)
  - It does not have a stable entry point (`HasStableEntryPoint()` returns false)
  - A call to the method may be:
    - An indirect call through the `MethodTable`'s backpatchable vtable slot
    - A direct call to a backpatchable `FuncPtrStub`, perhaps through a `JumpStub`
    - For interface methods, an indirect call through the virtual stub dispatch (VSD) indirection cell to a backpatchable `DispatchStub` or a `ResolveStub` that refers to a backpatchable `ResolveCacheEntry`
  - The purpose is that typical calls to the method have no additional overhead when code versioning is enabled

Recording and backpatching slots:
  - In order for all vtable slots for the method to be backpatchable:
    - A vtable slot initially points to the `MethodDesc`'s temporary entry point, even when the method is inherited by a derived type (the slot's value is not copied from the parent)
    - The temporary entry point always points to the prestub and is never backpatched, in order to be able to discover new vtable slots through which the method may be called
    - The prestub, as part of `DoBackpatch()`, records any slots that are transitioned from the temporary entry point to the method's at-the-time current, non-prestub entry point
    - Any further changes to the method's entry point cause recorded slots to be backpatched in `BackpatchEntryPointSlots()`
  - In order for the `FuncPtrStub` to be backpatchable:
    - After the `FuncPtrStub` is created and exposed, it is patched to point to the method's at-the-time current entry point if necessary
    - Any further changes to the method's entry point cause the `FuncPtrStub` to be backpatched in `BackpatchEntryPointSlots()`
  - In order for VSD entities to be backpatchable:
    - A `DispatchStub`'s entry point target is aligned and recorded for backpatching in `BackpatchEntryPointSlots()`
      - The `DispatchStub` was modified on x86 and x64 such that the entry point target is aligned to a pointer to make it backpatchable
    - A `ResolveCacheEntry`'s entry point target is recorded for backpatching in `BackpatchEntryPointSlots()`

Slot lifetime and management of recorded slots:
  - A slot is recorded in the `LoaderAllocator` in which the slot is allocated, see `RecordAndBackpatchEntryPointSlot()`
  - An inherited slot that has a shorter lifetime than the `MethodDesc`, when recorded, needs to be accessible by the `MethodDesc` for backpatching, so the dependent `LoaderAllocator` with the slot to backpatch is also recorded in the `MethodDesc`'s `LoaderAllocator`, see `MethodDescBackpatchInfo::AddDependentLoaderAllocator_Locked()`
  - At the end of a `LoaderAllocator`'s lifetime, the `LoaderAllocator` is unregistered from dependency `LoaderAllocators`, see `MethodDescBackpatchInfoTracker::ClearDependencyMethodDescEntryPointSlots()`
  - When a `MethodDesc`'s entry point changes, backpatching also includes iterating over recorded dependent `LoaderAllocators` to backpatch the relevant slots recorded there, see `BackpatchEntryPointSlots()`

Synchronization between entry point changes and backpatching slots
  - A global lock is used to ensure that all recorded backpatchable slots corresponding to a `MethodDesc` point to the same entry point, see `DoBackpatch()` and `BackpatchEntryPointSlots()` for examples

Due to startup time perf issues:
  - `IsEligibleForTieredCompilation()` is called more frequently with this change and in hotter paths. I chose to use a `MethodDesc` flag to store that information for fast retreival. The flag is initialized by `DetermineAndSetIsEligibleForTieredCompilation()`.
  - Initially, I experimented with allowing a method versionable with vtable slot backpatch to have a precode, and allocated a new precode that would also be the stable entry point when a direct call is necessary. That also allows recording a new slot to be optional - in the event of an OOM, the slot may just point to the stable entry point. There are a large number of such methods and the allocations were slowing down startup perf. So, I had to eliminate precodes for methods versionable with vtable slot backpatch and that in turn means that recording slots is necessary for versionability.

Commit migrated from dotnet/coreclr@37b9d85
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Fixes https://github.com/dotnet/coreclr/issues/22103
- There were reports of build failure from dotnet/coreclr#21292, worked around it for now with a todo

Commit migrated from dotnet/coreclr@29d442f
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants