-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coalescing calls to non-returning throw helpers? #9205
Comments
I think it would break line numbers in pdbs? (debugging, might be more?) @AndyAyersMS said in dotnet/coreclr#7580 (comment) G_M55934_IG94:
call ThrowHelper:ThrowArgumentOutOfRange_IndexException()
G_M55934_IG95:
call ThrowHelper:ThrowArgumentOutOfRange_IndexException()
...
|
Too bad we can't just pass the IL offset as an argument. Also in the example there are a lot of redundant branches. Kind of surprised that some part of the optimizer can't sort this out. For instance: G_M35549_IG03:
8B4108 mov eax, dword ptr [rcx+8]
3BC2 cmp eax, edx
0F82C0000000 jb G_M35549_IG12
448BC8 mov r9d, eax
442BCA sub r9d, edx
453BC8 cmp r9d, r8d
0F82B1000000 jb G_M35549_IG12
G_M35549_IG04:
4883C110 add rcx, 16
4C63D2 movsxd r10, edx
4A8D0C91 lea rcx, bword ptr [rcx+4*r10]
3BC2 cmp eax, edx // same compare as in the block above
0F82A3000000 jb G_M35549_IG13
453BC8 cmp r9d, r8d // ditto
0F829A000000 jb G_M35549_IG13
G_M35549_IG05:
4C8BD1 mov r10, rcx
458BD8 mov r11d, r8d
3BC2 cmp eax, edx // ditto
0F8291000000 jb G_M35549_IG14
453BC8 cmp r9d, r8d // ditto
0F8288000000 jb G_M35549_IG14 |
Actually I should say that when the jit is optimizing it generally won't take pains to maintain unique IL offsets. So coalescing those calls would be something we'd consider to be fair game. And something like this already happens now for jit-induced throws, e.g. array bounds checks, and also with the recently introduced constant return value merging. However generalizing this to user-provided throws the jit would need to use a potentially jit-time expensive match to coalesce blocks. So since this particular optimization is potentially expensive and mainly saves code size (and here, size in the "cold" section) it typically doesn't stack up well versus other opportunities. |
In theory this should be easy, those blocks are produced during inlining by importing the same block from the same method. So they should be identical. But in practice it's probably "yeah, good luck with guaranteeing that the importer generates identical IR from the same IL block" unfortunately. |
Not sure how you got that code. When I try this the result is even funnier: G_M55922_IG02:
48B9A88B6778F97F0000 mov rcx, 0x7FF978678BA8
BA14000000 mov edx, 20
E8C814835F call CORINFO_HELP_NEWARR_1_VC
33C0 xor eax, eax
83F814 cmp eax, 20
7741 ja SHORT G_M55922_IG05
B801000000 mov eax, 1
83F814 cmp eax, 20
7737 ja SHORT G_M55922_IG05
G_M55922_IG03:
B801000000 mov eax, 1
83F814 cmp eax, 20
7732 ja SHORT G_M55922_IG06
B802000000 mov eax, 2
83F814 cmp eax, 20
772D ja SHORT G_M55922_IG07
B803000000 mov eax, 3
83F814 cmp eax, 20
7728 ja SHORT G_M55922_IG08
B804000000 mov eax, 4
83F814 cmp eax, 20
7723 ja SHORT G_M55922_IG09
B805000000 mov eax, 5
83F814 cmp eax, 20
771E ja SHORT G_M55922_IG10 I'll take a look, it's just too lame to ignore this :) |
There are probably more ways to look at this but one way goes like this:
If AssertionProp is changed to always morph, even if it did not propagate anything, then the generated code reduces to: G_M55922_IG01:
4883EC28 sub rsp, 40
G_M55922_IG02:
48B9A88B6778F97F0000 mov rcx, 0x7FF978678BA8
BA14000000 mov edx, 20
E8C814845F call CORINFO_HELP_NEWARR_1_VC
90 nop
G_M55922_IG03:
4883C428 add rsp, 40
C3 ret
; Total bytes of code 30, prolog size 4 for method Program:Test() |
@mikedn I was looking at a slightly different example case: [MethodImpl(MethodImplOptions.NoInlining)]
private static int Test1(int[] arr, int offset, int count)
{
Span<int> s1 = new Span<int>(arr, offset, count);
Span<int> s2 = new Span<int>(arr, offset, count);
Span<int> s3 = new Span<int>(arr, offset, count);
Span<int> s4 = new Span<int>(arr, offset, count);
Span<int> s5 = new Span<int>(arr, offset, count);
Span<int> s6 = new Span<int>(arr, offset, count);
return s1[1] + s2[2] + s3[3] + s4[4] + s5[5] + s6[6];
} |
Right, assertion prop could likely catch these but currently it only handles EQ and NE: Hmm, maybe I'll play a bit with it. I'm curious what CQ improvements handling additional relops could provide and at what cost. |
Wonder how hard it would be to change assertion prop from a forward dense to a backward sparse or on demand approach, especially given that we have SSA and that relatively few operations benefit from or generate assertions. Trying to guess which facts might matter later seems inevitably suboptimal. |
I was wondering the same thing recently, but from the perspective of RangeCheck. That one is mostly demand driven (it does have a "build SSA def map" pre-pass but that's actually unnecessary) but it relies on AssertionProp doing a lot of work upfront and every time it merges assertions it has to scan all live assertions looking for the one(s) that are suitable for merging. And after all this work RangeCheck only manages to eliminate 5% of the (corelib) range checks that pass through it. Attempting to redo AssertionProp is likely too much for me, even as an experiment. But I think I could experiment a bit with having RangeCheck compute its own assertions on demand. |
Yeah, a quick attempt at removing redundant
The regression from The if (inputBuffer[foundIndex] == '\0') inputBuffer[foundIndex] = '\\';
|
We got better about optimizing the bounds checks from the span indexer by making it an intrinsic, but we didn't do anything about the bounds checks in I think we have some places that can only optimize |
Revisiting this; accessing multiple arrays in different ways get away with a single |
The |
It might be interesting to do a crude prototype (that might not have all the needed safety checks) to get a read on how much code size we could save. In the jit it would be nice to do this early to cut down on IR volume, so run say just after inlining. Doesn't have to be a generalized tail merge, can just special case throw blocks and literally just pattern match the simplest cases. If that looks promising, this would primarily be a size-on-disk / jit TP optimization. There are some wins at runtime from keeping cold code compact.. We're talking about the cost of partial hot/cold cache lines and hot/cold code pages. The impact of these can be subtle and I have seen some indications that in realistic apps we are losing more perf than one might expect from poor cache/tlb density. And this could also help cold startup some. So again this might be worth a look if we see decent amount of compression. A more robust fix for runtime impact would be to implement hot/cold splitting as this would let us sequester larger sequences of cold code (say the voluminous exception object construction one sees w/o helpers). Splitting was supported by fragile NGEN but is not supported for R2R or jitted code (and is/was not yet enabled for all ISAs/OSs). I think this is mainly a runtime/R2R issue and the jit has most of the logic it needs already. |
Here's a prototype SimpleThrowTailMerge that can merge many common throw helper blocks. Jit-Diffs reports:
The size win is modest, though still perhaps worth pursuing.
|
For regressions: suspect the block commoning, plus the fact that we don't model the fact that noreturn call blocks have no successor flow until we've morphed the calls, ends up spoofing I suppose we should introduce some variant of BBJ_THROW for the noreturn call case, but knowledge of the possible jump kinds is scattered all over. So maybe just an new BBF instead (or in addition)... ? |
Fixed my prototype to handle EH better, and to avoid most problematic flow messes by choosing canonical examples starting from the end of the block list, rather than the start. Updated diffs:
|
Tried extending this to BBJ_THROW in ExtendThrowTailMergeToThrows, with various search limits (eg cutoff search after 20 stmts). Did not find much in the way of additional merge opportunities. Delta from above was:
so as I had suspected, we do not see duplicate throws very often. |
Look for blocks with single statement noreturn calls, and try to reroute flow so there's just one block call that all predecessors target. Resolves #14770. Note this impairs debuggability of optimized code a bit, as it can change which line of code apparently invokes a throw helper in a backtrace. But since we're already commoning jit-inserted throw helpers (like array index OOB) this is not breaking any new ground. We could also handle commoning BBJ_THROW blocks, with some extra effort, but prototyping indicates duplicate throws are pretty rare.
Look for blocks with single statement noreturn calls, and try to reroute flow so there's just one block call that all predecessors target. Resolves #14770. Note this impairs debuggability of optimized code a bit, as it can change which line of code apparently invokes a throw helper in a backtrace. But since we're already commoning jit-inserted throw helpers (like array index OOB) this is not breaking any new ground. We could also handle commoning BBJ_THROW blocks, with some extra effort, but prototyping indicates duplicate throws are pretty rare.
Look for blocks with single statement noreturn calls, and try to reroute flow so there's just one block call that all predecessors target. Resolves #14770. Note this impairs debuggability of optimized code a bit, as it can change which line of code apparently invokes a throw helper in a backtrace. But since we're already commoning jit-inserted throw helpers (like array index OOB) this is not breaking any new ground. We could also handle commoning BBJ_THROW blocks, with some extra effort, but prototyping indicates duplicate throws are pretty rare. This phase runs just before `optOptimizeFlow`, so that we can leverage the ref counts and predecessor lists to ensure we make correct flow updates. It doesn't bother trying to clean out IR, that happens naturally as blocks become unreferenced. In some cases nothrow helpers end up being tail call candidates. We now suppress tail calling noreturn methods if there is more than one such call site in the method, hoping that instead we can merge the calls.
Thanks, @AndyAyersMS. |
Repro:
The generated asm includes this at the end:
Can/should these be coalesced into a common call point?
category:cq
theme:basic-cq
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: