JIT: De-abstraction in .NET 10 #108913

AndyAyersMS · 2024-10-16T01:45:27Z

De-Abstraction

In .NET 10 we hope to further enhance the JIT's ability to remove abstraction overhead from code.

Stack Allocation Improvements

During .NET 10 we would like to implement 2-3 enhancements to stack allocation of ref class instances. Priority may be given to issues that enable array de-abstraction (see below).

Delegate GDV Improvements

We currently only GDV instance delegates. We'd like to extend support to static delegates.

PGO Improvements

We currently lose a bit of performance in R2R compiled methods with Tiered PGO, because the instrumented version of the method doesn't collect profile data for inlinees.

#44372

Inlining Improvements

We'd like to enable inlining of methods with EH.

#108900

Array Enumeration De-Abstraction

Completed Work:

Todo:

enable promotion in cases where the collection type is only known via GDV
enable full set of loop opts for enumerator loops (when enumerator is stack allocated & promoted)

Background

The goal of this work is to (to the best of our ability) eliminate the abstraction penalty for cases where an IEnumerable<T> is iterated via foreach, and the underlying collection is an array (or is very likely to be an array).

In previous releases we've built a number of optimizations that can reduce abstraction overhead. But there is still a lot of room for improvement, especially in cases like the above, where the abstraction pattern involves several abstract objects acting in concert.

What is the Abstraction Penalty?

Consider the following pair of benchmark methods that both sum up an integer array:

    static readonly int[] s_ro_array;

    public int foreach_static_readonly_array()
    {
        int sum = 0;
        foreach (int i in s_ro_array) sum += i;
        return sum;
    }

    public int foreach_static_readonly_array_via_interface()
    {
        IEnumerable<int> e = s_ro_array;
        int sum = 0;
        foreach (int i in e) sum += i;
        return sum;
    }

These two methods do the exact same computation, yet benchmarking shows the second method takes 4.5x as long as the first (with 512 element arrays, using very early .NET 10 bits incorporating #108604 and #108153):

Method	Mean	Ratio
foreach_static_readonly_array	147.69 ns	1.00
foreach_static_readonly_array_via_interface	665.28 ns	4.50

This sort of overhead from an abstract presentation of computation is commonly known as the abstraction penalty.

Note things used to be far worse; .NET 6's ratio here is 12.6.

Method	Runtime	Mean	Allocated	Ratio
foreach_static_readonly_array	.NET 10.0	149.5 ns	-	1.00
foreach_static_readonly_array_via_interface	.NET 10.0	665.1 ns	-	4.45
foreach_static_readonly_array_via_interface	.NET 9.0	830.2 ns	32 B	5.55
foreach_static_readonly_array_via_interface	.NET 8.0	951.5 ns	32 B	6.36
foreach_static_readonly_array_via_interface	.NET 6.0	1,896.7 ns	32 B	12.69

Why is there an abstraction penalty?

The IL generated for the foreach_static_readonly_array_via_interface is expressed in the shape of the abstract enumeration pattern: first e.GetEnumerator() is called on the abstract collection to produce an abstract enumerator, and then loop iterates via MoveNext() and get_Current() interface calls on this enumerator, all wrapped in a try finally to properly dispose the enumerator should an exception arise.

Seeing through all this to the actual simple computation going on in the loop requires a surprising amount of optimization machinery. In past releases we've built many of the necessary pieces, and now it's time to get them all working together to remove the remaining overhead.

In particular we need to leverage:

Tiered compilation, and in particular dynamic PGO
Guarded Devirtualization
Object Stack Allocation
Loop Cloning
Physical Promotion

More generally the JIT will need to rely on PGO to determine the (likely) underlying type for the collection.

Why focus on Arrays?

Arrays are the most common and also the simplest collection type. Assuming all goes well we may try and stretch the optimization to cover Lists.

What needs to be done?

When the collection is an array, the enumerator is an instance of the ref class SZGenericArrayEnumerator<T>. Thanks to #108153 we can devirtualize (under guard) and inline the enumerator constructor, and devirtualize and inline calls on the enumerator. And in some cases we can even stack allocate the enumerator (note in the table above, .NET 10 no longer has allocations for the benchmarks).

Current inner loop codegen:

;; foreach_static_readonly_array 

       align    [8 bytes for IG03]
						;; size=32 bbWeight=1.00 PerfScore 3.24
G_M1640_IG03:  ;; offset=0x0020
       add      eax, dword ptr [rcx]
       add      rcx, 4
       dec      edx
       jne      SHORT G_M1640_IG03

;; foreach_static_readonly_array_via_interface (rdx is the enumerator object)
;; both enumerator and array access do bounds checks, even though array size is "known"

G_M36467_IG03:  ;; offset=0x0053
       mov      r8, gword ptr [rdx+0x10]
       cmp      ecx, dword ptr [r8+0x08]
       jae      SHORT G_M36467_IG09
       mov      ecx, ecx
       add      eax, dword ptr [r8+4*rcx+0x10]
						;; size=17 bbWeight=513.82 PerfScore 4752.82
G_M36467_IG04:  ;; offset=0x0064
       mov      ecx, dword ptr [rdx+0x08]
       inc      ecx
       mov      ebx, dword ptr [rdx+0x0C]
       cmp      ecx, ebx
       jae      SHORT G_M36467_IG06
						;; size=12 bbWeight=514.82 PerfScore 2831.50
G_M36467_IG05:  ;; offset=0x0070
       mov      dword ptr [rdx+0x08], ecx
       mov      ecx, dword ptr [rdx+0x08]
       cmp      ecx, dword ptr [rdx+0x0C]
       jb       SHORT G_M36467_IG03

However, we cannot yet fully optimize the enumeration loop:

We may fail to prove the enumerator can't escape (and so can't stack allocate)
Even if we can prove the enumerator can't escape, we may fail to stack allocate (the allocation site may be in a loop)
Even if we stack allocate the enumerator, we may think it is address exposed (and so fail to promote). There are several sub-problems here:
- If we're able to learn the enumerator type w/o GDV (as in the examples above), we run into the complication that the SZGenericArrayEnumerator constructor has an optimization for empty arrays, where instead of constructing a new enumerator instance, it returns a static instance. So at the enumerator use sites there is some ambiguity about which object is enumerating. For cases where the array length is known this ambiguity gets resolved, but too late in the phase order.
- If we rely on GDV, then there are three reaching definitions, and (as far as we know) even the unknown collection type could produce an instance of SZGenericArrayEnumerator, so all three definitions can reach through the enumerator GDV tests (see JIT: Support for devirtualizing array interface methods #108153 (comment) for a picture and more notes). And we may get confused by the try/finally or try/fault which will also contain a reference to the enumerator (for GDV).

While these may seem like small problems, the solutions are not obvious. Either we need to disentangle the code paths for each possibility early (basically do an early round of cloning, not just of the enumeration loop but of all the code from the enumerator creation sites to the last use of the enumerator and possibly the EH regions) or we need to think about making our escape and address propagation logic be flow-sensitive and contextual and introduce runtime disambiguation for the reaching values (that is, at each enumerator use site, we have to test if the enumerator is the just-allocated instance from the SZGenericArrayEnumerator that we hope to stack allocate and promote).

The cloning route seems more viable, but it is costly to duplicate all that code and we'll have to do it well in advance of knowing whether the rest of the optimizations pay off. So we might need to be able to undo it if there's no real benefit.

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-10-16T01:45:47Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

See dotnet/runtime#108913

AndyAyersMS · 2024-10-16T21:39:57Z

I worked up a suite of related benchmarks: dotnet/performance#4522. Ideally all these (save for the sum_static_array_via_local) should run in roughly the same time.

With recent main we see:

Method	Mean	Ratio	Allocated
foreach_static_readonly_array	148.63 ns	1.00	-
foreach_static_readonly_array_in_loop	147.69 ns	0.99	-
foreach_static_readonly_array_via_local	147.83 ns	0.99	-
foreach_static_readonly_array_via_interface	662.23 ns	4.46	-
foreach_static_readonly_array_via_interface_property	687.38 ns	4.63	32 B
foreach_static_readonly_array_via_interface_property_in_loop	913.82 ns	6.15	32 B
foreach_static_array	146.91 ns	0.99	-
foreach_static_array_via_local	146.75 ns	0.99	-
foreach_static_array_via_interface	663.73 ns	4.47	-
foreach_static_array_via_interface_property	658.78 ns	4.43	32 B
foreach_static_array_via_interface_property_in_loop	917.07 ns	6.17	32 B
foreach_member_array	146.92 ns	0.99	-
foreach_member_array_via_local	145.51 ns	0.98	-
foreach_member_array_via_interface	660.08 ns	4.44	-
foreach_member_array_via_interface_property	658.77 ns	4.43	32 B
foreach_member_array_via_interface_property_in_loop	917.22 ns	6.17	32 B
foreach_opaque_array_via_interface	656.48 ns	4.42	32 B
foreach_opaque_array_via_interface_in_loop	916.89 ns	6.17	32 B
sum_static_array_via_local	43.93 ns	0.30	-

Note there is a penalty for enumerating an array inside another loop. That's something else we should try and fix.

sum_static_array_via_local uses the Linq Sum extension, which pulls a span out of the enumerable and then runs a vectorized sum. So one might say the "true" abstraction penalty is even higher. However, we'll stick with a simple foreach as our baseline here.

Add some benchmarks for array deabstraction. See dotnet/runtime#108913

AndyAyersMS · 2024-10-19T01:00:46Z

Seems like #102965 could be useful her (enhances local morph's ability to propagate addresses in loops). However, I will also need local morph to be more aggressive with handlers.

Looks like in addition to per-loop summaries I'd need per-try summaries (at least for try/finally/fault), and if a local is not defined within a try, we can propagate any assertion about that local true at try entry to the try's handler.

The computation of the summaries is simple enough, the question is mainly how to do this efficiently and whether it is generally useful or special purpose. It's possible that in foreach constructs that the try and loop exactly or nearly coincide, so being able to leverage the loop summary when computing the try summary makes sense.

BB15 [0001]  2  0    BB10,BB19           35309. 35309 [00D..01B)-> BB24(0.462),BB23(0.538) ( cond ) T0      try {       i IBC keep bwd
BB16 [0002]  1  0    BB25                35240. 35240 [00F..010)-> BB19(1),BB18(0)         ( cond ) T0                  i IBC bwd bwd-target
BB18 [0028]  1  0    BB16                  0        0 [00F..010)-> BB19(1)                 (always) T0                  i IBC rare bwd
BB19 [0029]  2  0    BB16,BB18           35240. 35240 [00F..01A)-> BB15(1)                 (always) T0                  i IBC bwd
BB23 [0032]  1  0    BB15                18993. 18993 [01A..01B)-> BB25(1)                 (always) T0                  i IBC bwd
BB24 [0033]  1  0    BB15                16316. 16316 [01A..01B)-> BB25(1)                 (always) T0                  i IBC bwd
BB25 [0034]  2  0    BB23,BB24           35309. 35309 [01A..022)-> BB16(0.998),BB27(0.00195)   ( cond ) T0      }           i IBC bwd bwd

***************  Natural loop graph
L00 header: BB15
  Members (7): [BB15..BB25]
  Entry: BB10 -> BB15
  Exit: BB25 -> BB27
  Back: BB19 -> BB15

AndyAyersMS · 2024-10-23T15:00:38Z

@EgorBo points out that array interface methods (like ICollection<T>.Count) are not currently handled via GDV, for instance when col is string[]:

    int Test(ICollection<string> col) => col.Count;

G_M15003_IG01:  ;; offset=0x0000
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25
G_M15003_IG02:  ;; offset=0x0004
       mov      rcx, rdx
       mov      r11, 0x7FFCCA100210      ; code for System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
       call     [r11]System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
       nop      
						;; size=17 bbWeight=1 PerfScore 3.75
G_M15003_IG03:  ;; offset=0x0015
       add      rsp, 40
       ret

The JIT is able to devirtualize (under GDV) but the resulting method can't be inlined, so we drop the GDV:

impDevirtualizeCall: Trying to devirtualize interface call:
    class for 'this' is System.Collections.Generic.ICollection`1[System.String] (attrib 20200400)
    base method is System.Collections.Generic.ICollection`1[System.__Canon]::get_Count
Considering guarded devirtualization at IL offset 1 (0x1)
Likely classes for call [000001] on class 00007FFCFE715F20 (System.Collections.Generic.ICollection`1[System.String])
  1) 00007FFCFE675188 (System.String[]) [likelihood:100%]
Accepting type System.String[] with likelihood 100 as a candidate
GDV likely: resolveVirtualMethod (method 00007FFCFE5B0680 class 00007FFCFE675188 context 00007FFCFE715F21)
interface call would invoke method System.SZArrayHelper:get_Count[System.Object]():int:this
Marking call [000001] as guarded devirtualization candidate; will guess for class System.String[]

info.compCompHnd->canTailCall returned false for call [000001]

CheckCanInline: fetching method info for inline candidate get_Count -- context 00007FFCFEDDEF70
Method context: System.SZArrayHelper:get_Count[System.Object]():int:this
INLINER: during 'impMarkInlineCandidate for GDV' result 'failed this callee' reason 'cannot get method info' for 'ArrayDeabstraction:Test(System.Collections.Generic.ICollection`1[System.String]):int:this' calling 'System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this'

(Note the runtime maps all ref types to object in the method sig, but that should not cause problems). The blocker here is that the method returned by devirtualization is wrapped in an instantiating stub.

If the JIT can learn that the collection type is string[] without GDV, then the call is devirtualized, but again not inlined:

G_M19360_IG01:  ;; offset=0x0000
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25
G_M19360_IG02:  ;; offset=0x0004
       mov      rcx, 0x26C4E800C60      ; data for ArrayDeabstraction:s_str_array
       mov      rcx, gword ptr [rcx]
       cmp      dword ptr [rcx], ecx
       call     System.SZArrayHelper:get_Count[System.Object]():int:this
       nop      
						;; size=21 bbWeight=1 PerfScore 6.50
G_M19360_IG03:  ;; offset=0x0019
       add      rsp, 40
       ret

In thes cases it seems like the JIT has all the information it needs to call the underlying method properly (and for Count on an array, the generic context will be ignored).

The JIT recently enabled devirtualization of `GetEnumerator`, but other methods were inhibited from devirtualization because the runtime was returning an instantiating stub instead of the actual method. This blocked inlining and the JIT currently will not GDV unless it can also inline. So for instance `ICollection<T>.Count` would not devirtualize. We think we know enough to pass the right inst parameter (the exact method desc) so enable this for the array case, at least for normal jitting. For NAOT array devirtualization happens via normal paths thanks to `Array<T>` so should already fpr these cases. For R2R we don't do array interface devirt (yet). There was an existing field on `CORINFO_DEVIRTUALIZATION_INFO` to record the need for an inst parameter, but it was unused and so I renamed it and use it for this case. Contributes to dotnet#108913.

For empty arrays, `GetEnumerator` returns a static instance rather than a new instance. This hinders the JIT's ability to optimize when it is able to stack allocate the new instance. Detect when a stack allocation comes from an array `GetEnumerator` and fold the branch in the inlined enumerator method to always take the new instance path (since it is now cheap, allocation free, and is functionally equivalent). Contributes to dotnet#108913.

AndyAyersMS · 2024-10-25T18:45:47Z

In these cases it seems like the JIT has all the information it needs to call the underlying method properly (and for Count on an array, the generic context will be ignored).

The actual distinction is whether the array type is shared or not... (eg string[]). #109209 is one attempt to handle these, but it is not clear how far we can go.

Even if we enable GDV by finding inline info, we end up blocking inlining, because the GetEnumerator refers to a static field, and it seems like in some cases we're not passing the right method desc.

For empty arrays, `GetEnumerator` returns a static instance rather than a new instance. This hinders the JIT's ability to optimize when it is able to stack allocate the new instance. Detect when a stack allocation comes from an array `GetEnumerator` and fold the branch in the inlined enumerator method to always take the new instance path (since it is now cheap, allocation free, and is functionally equivalent). Contributes to #108913.

#109209) The JIT recently enabled devirtualization of `GetEnumerator`, but for some cases were inhibited from devirtualization because the runtime was returning an instantiating stub instead of the actual method. This blocked inlining and the JIT currently will not GDV unless it can also inline. So for instance `ICollection<T>.Count` would not devirtualize. We think we know enough to pass the right inst parameter (the exact method desc) so enable this for the array case, at least for normal jitting. For NAOT array devirtualization happens via normal paths thanks to `Array<T>` so should already fpr these cases. For R2R we don't do array interface devirt (yet). There was an existing field on `CORINFO_DEVIRTUALIZATION_INFO` to record the need for an inst parameter, but it was unused and so I renamed it and use it for this case. Contributes to #108913.

AndyAyersMS · 2024-10-28T21:19:35Z

With the recently merged PRs we're able to handle a broad range of cases where the JIT learns the collection is an array via its own data flow.

For cases where the collection is likely an array there is still work to do...

AndyAyersMS · 2024-10-29T15:38:33Z

Spent some time looking at what it might take to get normal loop opts to kick in. The graph below shows the situation during loop inversion.

We focus just on the loop.

The block colors:

blue are the keys to figuring out the loop. This is the residue of the enumerator.
orange is the loop payload
gray is the preheader, it's also the vestige of a not yet fully optimized GDV. Morph could possibly remove this, but it has optimized a conditional just above and it looks like we're not blocking propagation of facts from now-unreachable blocks.

Upstream from BB08 V17 is set to -1 and V18 to the array length.

As is, the loop is not an obvious candidate for inversion; the exit test in BB13 is in a different block then the entry BB07. These can be merged but it takes a bit of work. First we need to tail duplicate BB13, then optimize the branches; that leaves something like

and now BB07 is both loop entry and exit, so can be duplicated:

We still have the "wraparound IV" case to sort out (V14,V17) but perhaps we can handle that now

So the key seems to be to at least do the early flow opt where we tail duplicate the V13 test block (this is the vestige of the MoveNext() enumerator call and optimize the flow.

jakobbotsch · 2024-10-29T16:19:38Z

I generalize loop inversion in #109346, and some form of loop inversion does now kick in for the simple case, but it still doesn't seem to be enough for the remaining loop opts to kick in... in fact it seems we now can't eliminate the bounds check either. I haven't dug in deeper at what goes wrong, hopefully it's something simple.

AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI User Story A single user-facing feature. Can be grouped under an epic. labels Oct 16, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 16, 2024

AndyAyersMS added this to the 10.0.0 milestone Oct 16, 2024

dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Oct 16, 2024

AndyAyersMS added a commit to AndyAyersMS/performance that referenced this issue Oct 16, 2024

Add some benchmarks for array deabstraction

7565e73

See dotnet/runtime#108913

AndyAyersMS added a commit to AndyAyersMS/performance that referenced this issue Oct 16, 2024

Add some benchmarks for array deabstraction

b009791

See dotnet/runtime#108913

AndyAyersMS mentioned this issue Oct 16, 2024

Add some benchmarks for array deabstraction dotnet/performance#4522

Merged

JulieLeeMSFT mentioned this issue Oct 17, 2024

JIT Focus Area for .NET 10 #108988

Open

13 tasks

JulieLeeMSFT moved this to Team User Stories in .NET Core CodeGen Oct 17, 2024

JulieLeeMSFT added this to .NET Core CodeGen Oct 17, 2024

JulieLeeMSFT assigned AndyAyersMS Oct 17, 2024

LoopedBard3 pushed a commit to dotnet/performance that referenced this issue Oct 18, 2024

Add some benchmarks for array deabstraction (#4522)

e7c42e5

Add some benchmarks for array deabstraction. See dotnet/runtime#108913

AndyAyersMS mentioned this issue Oct 25, 2024

JIT: enable devirtualization/inlining of other array interface methods #109209

Merged

AndyAyersMS mentioned this issue Oct 25, 2024

JIT: empty array enumerator opt #109237

Merged

AndyAyersMS mentioned this issue Oct 26, 2024

unblock inlining of generics with static fields #109256

Merged

jakobbotsch mentioned this issue Oct 29, 2024

JIT: Make loop inversion graph based #109346

Draft

AndyAyersMS mentioned this issue Oct 29, 2024

Array interface method devirtualization #62457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: De-abstraction in .NET 10 #108913

JIT: De-abstraction in .NET 10 #108913

AndyAyersMS commented Oct 16, 2024 •

edited

Loading

dotnet-policy-service bot commented Oct 16, 2024

AndyAyersMS commented Oct 16, 2024

AndyAyersMS commented Oct 19, 2024 •

edited

Loading

AndyAyersMS commented Oct 23, 2024

AndyAyersMS commented Oct 25, 2024

AndyAyersMS commented Oct 28, 2024

AndyAyersMS commented Oct 29, 2024

jakobbotsch commented Oct 29, 2024

JIT: De-abstraction in .NET 10 #108913

JIT: De-abstraction in .NET 10 #108913

Comments

AndyAyersMS commented Oct 16, 2024 • edited Loading

De-Abstraction

Stack Allocation Improvements

Delegate GDV Improvements

PGO Improvements

Inlining Improvements

Array Enumeration De-Abstraction

Completed Work:

Todo:

Background

What is the Abstraction Penalty?

Why is there an abstraction penalty?

Why focus on Arrays?

What needs to be done?

dotnet-policy-service bot commented Oct 16, 2024

AndyAyersMS commented Oct 16, 2024

AndyAyersMS commented Oct 19, 2024 • edited Loading

AndyAyersMS commented Oct 23, 2024

AndyAyersMS commented Oct 25, 2024

AndyAyersMS commented Oct 28, 2024

AndyAyersMS commented Oct 29, 2024

We still have the "wraparound IV" case to sort out (V14,V17) but perhaps we can handle that now

jakobbotsch commented Oct 29, 2024

AndyAyersMS commented Oct 16, 2024 •

edited

Loading

AndyAyersMS commented Oct 19, 2024 •

edited

Loading