-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On Stack Replacement Next Steps #33658
Comments
And if you want to try this out (x64 only):
By default a transition requires ~10,000 executions of a patchpoint. Unlike tiered compilation, transitions are currently synchronous, so there's no lag/delay once this threshold is reached. To enable "aggressive OSR" -- where an OSR method is created the first time a patchpoint is encountered, also set
To see OSR log messages from the runtime, also add
|
Design sketch for better supporting patchpoints within
|
Started working on the try region fix noted just above. Seems like we ought to wait until after the importer has done its "transitive closure" to add the OSR inspired try region "step blocks" -- if we add them before importation then we will make all the code within tries reachable and this may be hard to undo. If after importation an enclosing try's entry block is unreachable then the step block can unconditionally transfer to the enclosed try entry and/or actual entry point. Still checking if it's ok for a try entry to have multiple preds or if we actually expect there to be a unique fall through pred. |
Here's an example from the tailrecursetry test case with my current changes main..FixMidTryPatchpoint. This is adding the new flow pre-importation; as noted above that might not be the best approach overall. Seems like perhaps we should defer this until
|
This change adds control flow to ensure that an OSR method for a patchpoint nested in try regions enters those regions try from the first block of each try rather than mid-try. This lets these OSR methods conform to the data flow expectations that the only way control flow can enter a try is via its first block. See #33658 for more details on the approach taken here. Fixes #35687.
This change adds control flow to ensure that an OSR method for a patchpoint nested in try regions enters those regions try from the first block of each try rather than mid-try. This lets these OSR methods conform to the data flow expectations that the only way control flow can enter a try is via its first block. See #33658 for more details on the approach taken here. Fixes #35687.
If there's an OSR method that has multiple patchpoints and multiple active threads we can trigger an "OSR storm" of sorts
This seems kind of unfortunate, but not sure there's any easy fix. One could imagine, say, deferring OSR generation for a method if there's already an OSR method being built at any patchpoint in the method, with the hope that other threads eventually hit and can use the one OSR method being created. But there's no guarantee that any of these threads will ever hit other patchpoints. |
(1) Been looking into what would be required to enable QJFL=1 via OSR. There are still some methods that OSR can't handle, and some loops within methods that OSR can't handle. So the proposal is that with QJFL=1 and OSR=1 (which would become the default on x64, say) and provided the JIT could guarantee escape the Tier0 method via OSR, we'd jit at Tier0. If the JIT can't guarantee this then we instead optimize. So a much larger set of methods would jit at Tier0. We could reserve QJFL=2 say to mean "always jit at Tier0" regardless, if we wanted such capability. Bailing out of Tier0+OSR because of the "method" checks this is easy enough as it can use the same pattern we already use for QJFL=0. But for the per-loop failures, it's not as clear how to handle the fallback -- as we need to import to detect the problem. If these cases are rare enough, it could work like we do for minopts fallback, failing the Tier0 jit request with OSR set; then the driver could request a new jit with OSR not set. Or we could try and cleanly architect a jit reset that bails from mid-import back to pre-import somehow, but there's a lot of ambient jit state around.... |
(2) For OSR + {stack,loc}alloc, we'd need to have the jit and gc info support the idea of having 3 pointers into the frame. One would point at the original method's frame base (to access the OSR state), one at the OSR method's frame base (to access any state it introduces, eg spilled temps), and one to refer to the outgoing args area. The separation between these 3 would vary depending on how much allocation was done by the base method before triggering OSR, and how much is done by the OSR method. Unfortunately the idea that there are at most two frame pointers (FP and SP) is baked in everywhere, including the GC encoding. Fixing this is straightforward in principle, but also a fair amount of work. And while there are potentially other uses for three frame pointers (eg using the 3rd to reduce the need for large frame-relative offsets), at least initially this support would be exclusive to OSR, and within OSR exclusive to that subset of methods using localloc. So not faring well in terms of cost/benefit. So thinking now that the right move is to initially exclude these methods from OSR by detecting this as one of the cases we can't handle, and bailing to optimized code (like we do currently for QJFL=0). |
(3) Reverse PInvoke methods currently don't seem to be eligible for tiering. At least the We'd need to update the patchpoint state to track the pinvoke frame var, give its OSR version a suitable location during Moving it below the cut line for now. |
(4) Looked at powershell startup and impact from OSR would be minimal; almost no jitted code is run. |
(5) For explicit tail calls. I had thought this was working but didn't realize there was a secondary block. There are likely two things we need to do here:
Note that enabling OSR for methods with explicit tail calls may not work out very well. Imagine a simple set of mutually tail calling methods -- eventually each call/jump to the method will force an OSR transition. These must be going via call counting stubs so perhaps instead we should just rely on normal tiering here. Tail recursion is not converted to internal branching so should also be handled by tiering. So it seems like we should suppress OSR in methods with explicit tail calls but not force them to switch to optimized. |
(6) OSR and altjit (specifically cross-altjits) Two issues:
|
For cross-altjit, will there ever be correct target-specific patchpoint info? We never ran the cross-altjit generated code. |
Yes, the right patchpoint info gets produced by the Tier0 altjit, but then it's thrown away because the altjit reports failure. So presumably we could hold onto this info somehow and supply it to the OSR altjit request. But not clear how to do that. I guess there's another implicit assumption here, that when an altjit is specified it will process both the Tier0 and any OSR method. Currently that's the case. Another option is to recognize in the jit that we have an altjit and have the "wrong" patchpoint info, and just make up plausible values for anything host or abi specific. Currently not sure what information we might need. |
So, specifically, I guess the value you're trying to get here is, for example, testing both patchpoint info creation and OSR method consumption of an x64-hosted arm64-targeting altjit instead of requiring doing all OSR testing on arm64 hardware itself. And presumably you have a way to force OSR method creation for all (or some random collection of) patchpoints as a stress mode, so for cross-altjit it wouldn't depend on running (e.g., if we could force these compilations via PMI, that would be good). That does seem like a worthwhile goal. |
Right, I'm currently bringing up OSR for Arm64 via cross altjit and would like to get as much mileage as possible out of this mode. But I'm reluctant to plumb knowledge of the altjit deeper into the runtime so currently not sure how the altjit produced patchpoint info can be stored and retrieved. Also there would be challenges describing the layout of the cross altjit patchpoint blob for the natively hosted runtime. Currently the data in the patchpoint is opaque to the runtime -- all it cares about is the size of the data. So only the altjit would need to know the exact format. |
(7) Initial notes on Arm64 support Assuming we don't want to constrain the Tier0 frame shape, the biggest challenge is likely to get the OSR method epilog correct -- this requires knowledge of the Tier0 frame shape or possibly related data. Not clear yet what all we will need. Recall in OSR that the OSR frame sits on top of the Tier0 frame. All callee saves done by Tier0 are "undone" by the transition to the OSR method. When the OSR method is entered, SP points below the Tier0 frame. The OSR method re-saves the callee saves it uses. So when we go to exit the OSR frame, we first do the normal epilog for the OSR method, and then we need to pop off the Tier0 frame. Essentially, we need to be able to invoke Arm64's For example, on x64, if we have a Tier0 method with epilog: ;; Tier0 method
G_M8788_IG09:
4883C440 add rsp, 64
5D pop rbp
C3 ret and the OSR method also needs to set up a frame, its epilog will be ;; OSR method
G_M8788_IG07:
;; pop off the OSR frame
4883C420 add rsp, 32
5E pop rsi
;; pop off the Tier0 frame
4883C448 add rsp, 72
5D pop rbp
C3 ret The arm64 versions of this will be something like: ;; Tier0 method
G_M8788_IG09:
A8C37BFD ldp fp, lr, [sp],#48
D65F03C0 ret lr
;; OSR method (naive, wrong)
G_M8788_IG07:
F9400FF3 ldr x19, [sp,#24]
A8C27BFD ldp fp, lr, [sp],#32
D65F03C0 ret lr
;; OSR method (correct)
G_M8788_IG07:
;; pop off the OSR frame
F9400FF3 ldr x19, [sp,#24]
add sp, sp, #32
;; pop off the Tier0 frame
A8C37BFD ldp fp, lr, [sp],#48
D65F03C0 ret lr |
For now, I am going to do the following:
That way I can presumably iterate through the frame type options with cross-altjit and verify each would produce plausible looking code. |
Turns out I don't need frame type for arm64 after all. The only dependence is the total tier0 frame size and that the patchpoint-recorded offsets are virtual offsets (which they are). So, an arm64 altjit will produce plausible (if in incorrect) code when run on x64 method context. |
Based at #65675 there are 2 benchmarks games tests in the performance repo that would regress with the default strategy ( BenchmarkDotNet=v0.13.1.1694-nightly, OS=Windows 11 (10.0.22000.493/21H2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
[Host] : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
Job-AARFFF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-GTOQLX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-KHECGD : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-CHOGBR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
IterationTime=250.0000 ms MinIterationCount=15 WarmupCount=1
|
Here's that same set of benchmarks, this time with BenchmarkDotNet=v0.13.1.1694-nightly, OS=Windows 11 (10.0.22000.493/21H2) PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
|
Actually, some care is needed in interpreting these results, as the Mar1 "baseline" is NoPGO, not default. And in some cases PGO data is causing regressions. For example, here's the default perf for There's too much here to sort through by hand. @DrewScoggins looks like I could use an updated comparison report... |
This test regresses with OSR / Tier1 because we don't align the loop: ;; w/o OSR
7E22 jle SHORT G_M30591_IG04
0F1F80000000000F1F840000000000 align [15 bytes for IG03]
;; bbWeight=1 PerfScore 7.75
G_M30591_IG03: ;; offset=0040H
448BC2 mov r8d, edx
F34E0FB844C010 popcnt r8, qword ptr [rax+8*r8+16]
4103F0 add esi, r8d
FFC2 inc edx
3BCA cmp ecx, edx
7FED jg SHORT G_M30591_IG03
;; w/ OSR
7E14 jle SHORT G_M30591_IG04
G_M30591_IG03: ;; offset=001AH
448BC9 mov r9d, ecx
F34E0FB84CCA10 popcnt r9, qword ptr [rdx+8*r9+16]
4103C1 add eax, r9d
FFC1 inc ecx
443BC1 cmp r8d, ecx
7FEC jg SHORT G_M30591_IG03 We don't align because w/OSR or at Tier1 the loop body comes from an inlinee with PGO data, inlined into a caller without PGO data. We scale the profile during inlining, then downscale during
Without OSR, we initially try to jit the method at Tier0, and see there's a loop, and switch to FullOpts. Because of this we don't have the This is unfortunate because we're not pulling in any profile data for these eagerly optimized methods. It may also partially explain why we've seen less impact from default PGO than we might expect -- the method where we spend the most time are not using any PGO data, by default. There are two avenues to pursue here (perhaps together):
|
I wonder if the loop body of inverted loops always gets downscaled because the zero trip test is "given" half the flow. Seems bad.
Did you look why that is? |
Yeah this seems a bit odd. Weights for loop blocks in inverted loops go from 1 to 0.5 to 4. Generally we should assume such loops rarely zero-trip.
We don't upscale blocks with profile data. All this is a consequence of our rather simplistic weighting/reweighting approach. These problems get worse with PGO as we do a poor job of handling mixtures of profiled and unprofiled code (say inlining profiled callees into an unprofiled caller, or vice versa). OSR is highlighting some of this as (because of the inadvertent lack of BBOPT in the switch-to-optimized case) we are using PGO data "more often". I have ambitions to redo all this but thought we could get by with the current scheme for a while longer. |
Looks like
? |
Looking at the Note how |
Seems to be related to dotnet/BenchmarkDotNet#1780 where if the initial single invocation takes a long enough time then BDN has various threshold tests to decide if it should skip the pilot stages and just run one iteration per workload interval. In my case the run of
// * Warnings * |
I guess it's the second bit of BDN logic -- if So OSR is triggering this behavior by having less initial overhead than MAIN but not so much less that BDN decides it should run pilot stages to assess how many iterations are needed. |
Have an interim comparison report showing OSR vs MAIN. The top dozen or so regressions are the Overall this report shows 395 regressions and 591 improvements. The vast majority of these are noise. Drew is going to get me an updated comparison report that takes history into account; when that shows up I'll post an update. In the meantime looking through what's there for actual regressions, I actually get somewhat inconsistent results locally. And not clear yet why we can't hoist/cse this load. Need to look more closely what's going on. ;; FullOpt/Tier1 inner loop
align [0 bytes for IG04]
G_M46661_IG04: ;; offset=0030H
C5CB59D6 vmulsd xmm2, xmm6, xmm6
C5FB101DE4000000 vmovsd xmm3, qword ptr [reloc @RWD00]
C5E35CD2 vsubsd xmm2, xmm3, xmm2
C5EB59D6 vmulsd xmm2, xmm2, xmm6
C5FB5CD2 vsubsd xmm2, xmm0, xmm2
C5E8541DF0000000 vandps xmm3, xmm2, qword ptr [reloc @RWD32]
C5F92ECB vucomisd xmm1, xmm3
7746 ja SHORT G_M46661_IG06
C5CB591DF2000000 vmulsd xmm3, xmm6, qword ptr [reloc @RWD48]
C5E359DE vmulsd xmm3, xmm3, xmm6
C5E35C1DB6000000 vsubsd xmm3, xmm3, qword ptr [reloc @RWD00]
C5D857E4 vxorps xmm4, xmm4
C5F92EDC vucomisd xmm3, xmm4
7A02 jp SHORT G_M46661_IG05
7421 je SHORT G_M46661_IG06
;; bbWeight=16 PerfScore 597.33
G_M46661_IG05: ;; offset=0076H
C5EB5ED3 vdivsd xmm2, xmm2, xmm3
C5CB5CF2 vsubsd xmm6, xmm6, xmm2
C5E85415BA000000 vandps xmm2, xmm2, qword ptr [reloc @RWD32]
C5F92ECA vucomisd xmm1, xmm2
7707 ja SHORT G_M46661_IG06
FFC0 inc eax
83F80A cmp eax, 10
7E9D jle SHORT G_M46661_IG04
;; OSR inner loop
align [0 bytes for IG04]
G_M46661_IG04: ;; offset=0066H
C5FB109424A0000000 vmovsd xmm2, qword ptr [rsp+A0H] // extra
C5EB59DA vmulsd xmm3, xmm2, xmm2
C5FB1025C5010000 vmovsd xmm4, qword ptr [reloc @RWD16]
C5DB5CDB vsubsd xmm3, xmm4, xmm3
C5E359D2 vmulsd xmm2, xmm3, xmm2
C5FB5CD2 vsubsd xmm2, xmm0, xmm2
C5E8541DC1010000 vandps xmm3, xmm2, qword ptr [reloc @RWD32]
C5F92ECB vucomisd xmm1, xmm3
7768 ja SHORT G_M46661_IG06
C5FB109C24A0000000 vmovsd xmm3, qword ptr [rsp+A0H] // extra
C5E35925BA010000 vmulsd xmm4, xmm3, qword ptr [reloc @RWD48]
C5DB59DB vmulsd xmm3, xmm4, xmm3
C5E35C1D8E010000 vsubsd xmm3, xmm3, qword ptr [reloc @RWD16]
C5D857E4 vxorps xmm4, xmm4
C5F92EDC vucomisd xmm3, xmm4
7A02 jp SHORT G_M46661_IG05
7439 je SHORT G_M46661_IG06
;; bbWeight=16 PerfScore 693.33
G_M46661_IG05: ;; offset=00BEH
C5EB5ED3 vdivsd xmm2, xmm2, xmm3
C5FB109C24A0000000 vmovsd xmm3, qword ptr [rsp+A0H] // extra
C5E35CDA vsubsd xmm3, xmm3, xmm2
C5FB119C24A0000000 vmovsd qword ptr [rsp+A0H], xmm3
C5E8541570010000 vandps xmm2, xmm2, qword ptr [reloc @RWD32]
C5F92ECA vucomisd xmm1, xmm2
770B ja SHORT G_M46661_IG06
FFC3 inc ebx
83FB0A cmp ebx, 10
0F8E75FFFFFF jle G_M46661_IG04
;; OSR iterations
// AfterActualRun
WorkloadResult 1: 4 op, 369514200.00 ns, 92.3786 ms/op
WorkloadResult 2: 4 op, 361280500.00 ns, 90.3201 ms/op
WorkloadResult 3: 4 op, 357723200.00 ns, 89.4308 ms/op
WorkloadResult 4: 4 op, 361092200.00 ns, 90.2730 ms/op
WorkloadResult 5: 4 op, 355732600.00 ns, 88.9331 ms/op
WorkloadResult 6: 4 op, 352204300.00 ns, 88.0511 ms/op
WorkloadResult 7: 4 op, 299682100.00 ns, 74.9205 ms/op
WorkloadResult 8: 4 op, 282106800.00 ns, 70.5267 ms/op
WorkloadResult 9: 4 op, 283662600.00 ns, 70.9156 ms/op
WorkloadResult 10: 4 op, 283777300.00 ns, 70.9443 ms/op
WorkloadResult 11: 4 op, 284411100.00 ns, 71.1028 ms/op
WorkloadResult 12: 4 op, 288849500.00 ns, 72.2124 ms/op
WorkloadResult 13: 4 op, 281122300.00 ns, 70.2806 ms/op
WorkloadResult 14: 4 op, 281138100.00 ns, 70.2845 ms/op
WorkloadResult 15: 4 op, 282122600.00 ns, 70.5306 ms/op
WorkloadResult 16: 4 op, 286633300.00 ns, 71.6583 ms/op
WorkloadResult 17: 4 op, 286545500.00 ns, 71.6364 ms/op
WorkloadResult 18: 4 op, 282719600.00 ns, 70.6799 ms/op
WorkloadResult 19: 4 op, 284713200.00 ns, 71.1783 ms/op
WorkloadResult 20: 4 op, 285504900.00 ns, 71.3762 ms/op
In the Tier0 method, The exact impact of this depends on when the Tier1 method shows up -- in the above you can see it kicked in around iteration 7. |
C5FB119C24A0000000 vmovsd qword ptr [rsp+A0H], xmm3
CSE does not consider these trees despite knowing that the two loads (at least) will return the same value.
presumably this runs into: runtime/src/coreclr/jit/optcse.cpp Lines 3576 to 3577 in 798d52b
which seems a bit short sighted? We know (as in this case) we have some invariant/expensive local var loads, why not try CSEing them if we have the room. Will experiment. |
Seems viable, though simply turning it on (even requiring EG for benchmarks (via SPMI)
I recall a while back I was touting something like this as an alternative to EH Write Through but never pushed on it. Anyways, probably something too radical to land in a timely manner. |
Seeing a new issue now in general OSR testing, presumably from the interaction of OSR and #66257.
|
Suspect the fix is simple? Here we have BB01->bbNext == BB04. So we decide not to create a new block for runtime/src/coreclr/jit/loopcloning.cpp Lines 1903 to 1928 in ea4ebaa
but BB01 does not transfer control to BB04. |
Doesn't BB01->bbNext == BB02? What is the actual lexical block order? If you set |
Yeah, sorry, I was off base. As you noticed in #67067 we get into this code but weren't setting up the right Constrained view of the flowgraph: |
There's one more libraries issue I want to investigate... on ubuntu x64
|
#83910 improved a couple of the microbenchmarks, notably and might also fix some of the reported regressions -- taking a look. |
Seeing some benchmark cases where there are methods with stackalloc + loop that bypass tiering: #84264 (comment) and hence also bypass PGO. In particular runtime/src/libraries/System.Text.Json/src/System/Text/Json/Document/JsonDocument.TryGetProperty.cs Lines 135 to 150 in f2a55e2
Not sure how common this is but something to keep an eye on. Supporting stackalloc in its full generality with OSR would be hard, because we potentially would need to track 3 addressable segments of the stack, but it's not impossible. It might be easier to revise the BCL so this doesn't happen in places where we care about perf. The proposed mitigation would be to split the method into a caller that stackallocs and a callee that loops. These parts can be reunited (if deemed profitable) via normal inlining, or the callee marked with FYI @stephentoub -- possible pattern to avoid since it creates methods that can't benefit from Dynamic PGO. Forked this off as #85548 |
I think it is common because many developers (who cares about allocations and performance) are writing code like below nowadays. const int StackAllocSize = 128;
Span<T> buffer = length < StackAllocSize ? stackalloc T[length] : new T[length]; |
Possible next steps now that #32969 is merged, in rough order of priority.
Assert failure(PID 7028 [0x00001b74], Thread: 7084 [0x1bac]): ppInfo->m_osrMethodCode == NULL
-- likely the logic guarding against threads racing to build the patchpoint method needs adjusting (likely fixed by A couple of small OSR fixes #38165)look at how debuggers handle OSR frames; if the double-RBP restore is too confusing, think about relying on the original method's RBP (will still need split save areas). On further thought, it seems like (for x64) we can pass the tier0 method caller's RBP to the osr method and just have one unwind restore. This is what I'm doing for arm64 and it seems to be working out ok.(new plan is to revise arm64 to conform with how x64 will work, see below)Enable QJFL and OSR by default for x64 #61934Enable QJFL and OSR by default for x64 and arm64 #63642Enable QJFL and OSR by default for x64 and arm64 #65675Issues and fixes after OSR was enabled
Performance Regressions
Other ideas: enhancements or optimizations
cc @dotnet/jit-contrib
category:cq
theme:osr
skill-level:expert
cost:extra-large
The text was updated successfully, but these errors were encountered: