On Stack Replacement Next Steps #33658

AndyAyersMS · 2020-03-17T00:46:14Z

Possible next steps now that #32969 is merged, in rough order of priority.

Issues and fixes after OSR was enabled

Performance Regressions

Other ideas: enhancements or optimizations

cc @dotnet/jit-contrib

category:cq
theme:osr
skill-level:expert
cost:extra-large

The text was updated successfully, but these errors were encountered:

AndyAyersMS · 2020-03-17T00:53:23Z

And if you want to try this out (x64 only):

COMPlus_TC_QuickJitForLoops=1
COMPlus_TC_OnStackReplacement=1

By default a transition requires ~10,000 executions of a patchpoint. Unlike tiered compilation, transitions are currently synchronous, so there's no lag/delay once this threshold is reached.

To enable "aggressive OSR" -- where an OSR method is created the first time a patchpoint is encountered, also set

COMPlus_TC_OnStackReplacement_InitialCounter=1
COMPlus_OSR_HitLimit=1

To see OSR log messages from the runtime, also add

COMPlus_LogEnable=1
COMPlus_LogFacility=0x00400000
COMPlus_LogLevel=4
COMPlus_LogToConsole=1

AndyAyersMS · 2021-09-16T21:17:06Z

Design sketch for better supporting patchpoints within `try` regions

As noted above and elsewhere (#35687) the jit has a strong assumption that control can only enter a try region by branching to the first block in the try. If there's an OSR patchpoint within the try we currently branch directly from the prolog to the middle of the try, breaking this assumption.

The idea here is to add extra control flow to the starts of any try region that encloses the patchpoint, to conditionally transfer control from outside the try to the appropriate point. So all control flow still enters the try at the first block. If we're coming from method entry, we jump to the next enclosing try or the patchpoint as appropriate.

For example, suppose we have the following (Sn are arbitrary blocks of code):

S0;
try {
    S1;
    try {
         S2;
         for (...) {}
    }
}

The for will inspire a patchpoint that is nested within two try regions. We would transform this as follows:

__firstCall = 1;
goto pt1;
S0; 
pt1:
try {
    if (_firstCall == 1) goto pt2;
    S1;
    pt2:
    try {
        if (_firstCall == 1) goto pp;    
        S2;
        pp:
        _firstCall = 0;
        for (...)
    }
}

Because of the conditional control flow we'll end up importing some Sn blocks that we currently skip. If it turns out the various Sn are unreachable (which should be common) we should be able to optimize them away.

Since we can detect the try-nesting of a patchpoint when we're asked to create an OSR method we won't need extra info passed in the patchpoint records from Tier0. We just apply this transformation automatically if we see the patchpoint is within a try.

Note patchpoints within handers are still problematic, largely because of funclets. So if we see loops in catches or finallies we might still want to have the method bypass Tier0. These should be rare.

AndyAyersMS · 2021-09-28T16:14:44Z

Started working on the try region fix noted just above. Seems like we ought to wait until after the importer has done its "transitive closure" to add the OSR inspired try region "step blocks" -- if we add them before importation then we will make all the code within tries reachable and this may be hard to undo.

If after importation an enclosing try's entry block is unreachable then the step block can unconditionally transfer to the enclosed try entry and/or actual entry point.

Still checking if it's ok for a try entry to have multiple preds or if we actually expect there to be a unique fall through pred.

AndyAyersMS · 2021-09-28T21:00:56Z

Here's an example from the tailrecursetry test case with my current changes main..FixMidTryPatchpoint.

This is adding the new flow pre-importation; as noted above that might not be the best approach overall. Seems like perhaps we should defer this until fgRemoveEmptyBlocks to avoid wrangling with the importer, and then we can introduce conditional flow only if needed.

;; before (BB10->BB03 is a "mid try entry")

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB10 [0009]  1                             1       [???..???)-> BB03 (always)                     i internal 
BB01 [0000]  1                             1       [000..002)                                     i bwd bwd-target 
BB02 [0001]  1  0                          1       [002..006)-> BB04 (always) T0      try {       keep i try bwd 
BB03 [0002]  2  0                          1       [006..00E)                 T0                  i bwd bwd-target 
BB04 [0003]  2  0                          1       [00E..012)-> BB03 ( cond ) T0                  i bwd 
BB05 [0004]  1  0                          1       [012..014)-> BB07 (always) T0      }           i bwd 
BB06 [0005]  1     0                       1       [014..017)-> BB07 ( cret )    H0   catch { }   keep i bwd 
BB07 [0006]  2                             1       [017..01B)-> BB09 ( cond )                     i bwd 
BB08 [0007]  1                             1       [01B..01D)        (return)                     i 
BB09 [0008]  1                             1       [01D..02D)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

;; after (BB12 -> BB03 steps in via try entry BB02)

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB12 [0011]  1                             1       [???..???)-> BB02 (always)                     i internal 
BB01 [0000]  1                             1       [000..002)                                     i bwd bwd-target 
BB02 [0001]  2  0                          1       [???..???)-> BB03 ( cond ) T0      try {       keep i internal try bwd 
BB11 [0010]  1  0                          1       [002..006)-> BB04 (always) T0                  keep i bwd 
BB03 [0002]  2  0                          1       [???..???)                 T0                  i internal bwd bwd-target 
BB10 [0009]  1  0                          1       [006..00E)                 T0                  i bwd 
BB04 [0003]  2  0                          1       [00E..012)-> BB03 ( cond ) T0                  i bwd 
BB05 [0004]  1  0                          1       [012..014)-> BB07 (always) T0      }           i bwd 
BB06 [0005]  1     0                       1       [014..017)-> BB07 ( cret )    H0   catch { }   keep i bwd 
BB07 [0006]  2                             1       [017..01B)-> BB09 ( cond )                     i bwd 
BB08 [0007]  1                             1       [01B..01D)        (return)                     i 
BB09 [0008]  1                             1       [01D..02D)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

This change adds control flow to ensure that an OSR method for a patchpoint nested in try regions enters those regions try from the first block of each try rather than mid-try. This lets these OSR methods conform to the data flow expectations that the only way control flow can enter a try is via its first block. See #33658 for more details on the approach taken here. Fixes #35687.

AndyAyersMS · 2021-11-12T20:22:48Z

If there's an OSR method that has multiple patchpoints and multiple active threads we can trigger an "OSR storm" of sorts

Compiling  157 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier0
Compiling  158 Microsoft.FSharp.Collections.ArrayModule::Create, IL size = 105, hash=0xb95c3591 Tier0
Compiling  159 Program::direct@15, IL size = 183, hash=0x7e81a925 Tier0
Compiling  160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x150
Compiling  160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x119
Compiling  160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x137
Compiling  160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x21f
Compiling  160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x25d
Compiling  160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0xe5

This seems kind of unfortunate, but not sure there's any easy fix.

One could imagine, say, deferring OSR generation for a method if there's already an OSR method being built at any patchpoint in the method, with the hope that other threads eventually hit and can use the one OSR method being created. But there's no guarantee that any of these threads will ever hit other patchpoints.

AndyAyersMS · 2021-11-15T20:29:47Z

(1) Been looking into what would be required to enable QJFL=1 via OSR. There are still some methods that OSR can't handle, and some loops within methods that OSR can't handle.

So the proposal is that with QJFL=1 and OSR=1 (which would become the default on x64, say) and provided the JIT could guarantee escape the Tier0 method via OSR, we'd jit at Tier0. If the JIT can't guarantee this then we instead optimize. So a much larger set of methods would jit at Tier0.

We could reserve QJFL=2 say to mean "always jit at Tier0" regardless, if we wanted such capability.

Bailing out of Tier0+OSR because of the "method" checks this is easy enough as it can use the same pattern we already use for QJFL=0. But for the per-loop failures, it's not as clear how to handle the fallback -- as we need to import to detect the problem.

If these cases are rare enough, it could work like we do for minopts fallback, failing the Tier0 jit request with OSR set; then the driver could request a new jit with OSR not set.

Or we could try and cleanly architect a jit reset that bails from mid-import back to pre-import somehow, but there's a lot of ambient jit state around....

AndyAyersMS · 2021-11-17T01:24:42Z

(2) For OSR + {stack,loc}alloc, we'd need to have the jit and gc info support the idea of having 3 pointers into the frame. One would point at the original method's frame base (to access the OSR state), one at the OSR method's frame base (to access any state it introduces, eg spilled temps), and one to refer to the outgoing args area. The separation between these 3 would vary depending on how much allocation was done by the base method before triggering OSR, and how much is done by the OSR method.

Unfortunately the idea that there are at most two frame pointers (FP and SP) is baked in everywhere, including the GC encoding.

Fixing this is straightforward in principle, but also a fair amount of work. And while there are potentially other uses for three frame pointers (eg using the 3rd to reduce the need for large frame-relative offsets), at least initially this support would be exclusive to OSR, and within OSR exclusive to that subset of methods using localloc. So not faring well in terms of cost/benefit.

So thinking now that the right move is to initially exclude these methods from OSR by detecting this as one of the cases we can't handle, and bailing to optimized code (like we do currently for QJFL=0).

AndyAyersMS · 2021-11-17T02:02:19Z

(3) Reverse PInvoke methods currently don't seem to be eligible for tiering. At least the UnmanagedCallersOnly subset. If that ever changes, it seems easy enough for the to support OSR + Reverse PInvoke, but maybe not worth the trouble?

We'd need to update the patchpoint state to track the pinvoke frame var, give its OSR version a suitable location during lvaFixVirtualFrameOffsets, and update fgAddReversePInvokeEnterExit to just exit.

Moving it below the cut line for now.

AndyAyersMS · 2021-11-19T03:00:02Z

(4) Looked at powershell startup and impact from OSR would be minimal; almost no jitted code is run.

AndyAyersMS · 2021-11-19T22:58:29Z

(5) For explicit tail calls. I had thought this was working but didn't realize there was a secondary block.

There are likely two things we need to do here:

add a patchpoint at method entry
ensure that the epilog sequence is properly generated (it is more complex for OSR methods).

Note that enabling OSR for methods with explicit tail calls may not work out very well. Imagine a simple set of mutually tail calling methods -- eventually each call/jump to the method will force an OSR transition. These must be going via call counting stubs so perhaps instead we should just rely on normal tiering here.

Tail recursion is not converted to internal branching so should also be handled by tiering.

So it seems like we should suppress OSR in methods with explicit tail calls but not force them to switch to optimized.

AndyAyersMS · 2021-12-03T03:19:26Z

(6) OSR and altjit (specifically cross-altjits)

Two issues:

We currently throw away the runtime patchpoint info when we reset the jit for retry. So the real jit OSR request fails as it can't find this info.
Assuming we preserve the info, the patchpoint info is target specific, so it will be wrong for either the altjit or the real jit. We keep the real jit's version so the altjit OSR request will produce somewhat different code than we'd see if we were running on the actual host. Going forward we may need to add arch-dependent fields here, which will make it impossible for the altjit to work properly.

BruceForstall · 2021-12-03T17:22:57Z

For cross-altjit, will there ever be correct target-specific patchpoint info? We never ran the cross-altjit generated code.

AndyAyersMS · 2021-12-03T17:42:14Z

Yes, the right patchpoint info gets produced by the Tier0 altjit, but then it's thrown away because the altjit reports failure. So presumably we could hold onto this info somehow and supply it to the OSR altjit request. But not clear how to do that.

I guess there's another implicit assumption here, that when an altjit is specified it will process both the Tier0 and any OSR method. Currently that's the case.

Another option is to recognize in the jit that we have an altjit and have the "wrong" patchpoint info, and just make up plausible values for anything host or abi specific. Currently not sure what information we might need.

BruceForstall · 2021-12-03T17:56:57Z

So, specifically, setPatchpointInfo could, on the VM side, be associated with the (alt)JIT that produced it, then getOSRInfo could retrieve that specific information.

I guess the value you're trying to get here is, for example, testing both patchpoint info creation and OSR method consumption of an x64-hosted arm64-targeting altjit instead of requiring doing all OSR testing on arm64 hardware itself. And presumably you have a way to force OSR method creation for all (or some random collection of) patchpoints as a stress mode, so for cross-altjit it wouldn't depend on running (e.g., if we could force these compilations via PMI, that would be good). That does seem like a worthwhile goal.

AndyAyersMS · 2021-12-03T19:09:04Z

Right, I'm currently bringing up OSR for Arm64 via cross altjit and would like to get as much mileage as possible out of this mode.

But I'm reluctant to plumb knowledge of the altjit deeper into the runtime so currently not sure how the altjit produced patchpoint info can be stored and retrieved.

Also there would be challenges describing the layout of the cross altjit patchpoint blob for the natively hosted runtime. Currently the data in the patchpoint is opaque to the runtime -- all it cares about is the size of the data. So only the altjit would need to know the exact format.

AndyAyersMS · 2021-12-03T19:52:12Z

(7) Initial notes on Arm64 support

Assuming we don't want to constrain the Tier0 frame shape, the biggest challenge is likely to get the OSR method epilog correct -- this requires knowledge of the Tier0 frame shape or possibly related data. Not clear yet what all we will need.

Recall in OSR that the OSR frame sits on top of the Tier0 frame. All callee saves done by Tier0 are "undone" by the transition to the OSR method. When the OSR method is entered, SP points below the Tier0 frame.

The OSR method re-saves the callee saves it uses. So when we go to exit the OSR frame, we first do the normal epilog for the OSR method, and then we need to pop off the Tier0 frame.

Essentially, we need to be able to invoke Arm64's genPopCalleeSavedRegistersAndFreeLclFrame and only get the SP, FP, and LR adjustment that it does.

For example, on x64, if we have a Tier0 method with epilog:

;; Tier0 method
G_M8788_IG09:
       4883C440             add      rsp, 64
       5D                   pop      rbp
       C3                   ret

and the OSR method also needs to set up a frame, its epilog will be

;; OSR method
G_M8788_IG07:
       ;; pop off the OSR frame
       4883C420             add      rsp, 32
       5E                   pop      rsi

       ;; pop off the Tier0 frame
       4883C448             add      rsp, 72
       5D                   pop      rbp
       C3                   ret

The arm64 versions of this will be something like:

;; Tier0 method
G_M8788_IG09:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr

;; OSR method (naive, wrong)
G_M8788_IG07: 
        F9400FF3          ldr     x19, [sp,#24]
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr

;; OSR method (correct)
G_M8788_IG07:
        ;; pop off the OSR frame
        F9400FF3          ldr     x19, [sp,#24]
                          add     sp, sp, #32

        ;; pop off the Tier0 frame
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr

AndyAyersMS · 2021-12-03T21:53:42Z

For now, I am going to do the following:

record frame type in the patchpoint info. This field will be there for all archs, but will only be set by arm64 jits
if frame type was set, use that to guide arm64 OSR method epilog generation
if frame type was not set, assert that we are an altjit. Introduce a new config option to specify which Tier0 frame type to assume, and use that to guide the altjit's OSR epilog generation

That way I can presumably iterate through the frame type options with cross-altjit and verify each would produce plausible looking code.

AndyAyersMS · 2021-12-14T19:11:33Z

Turns out I don't need frame type for arm64 after all. The only dependence is the total tier0 frame size and that the patchpoint-recorded offsets are virtual offsets (which they are). So, an arm64 altjit will produce plausible (if in incorrect) code when run on x64 method context.

AndyAyersMS · 2022-02-24T16:37:59Z

Based at #65675 there are 2 benchmarks games tests in the performance repo that would regress with the default strategy (WarmupCount=1)

BenchmarkDotNet=v0.13.1.1694-nightly, OS=Windows 11 (10.0.22000.493/21H2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
  [Host]     : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
  Job-AARFFF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-GTOQLX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-KHECGD : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-CHOGBR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

IterationTime=250.0000 ms MinIterationCount=15  WarmupCount=1

Method	Toolchain	Mean	Ratio	Diff
MandelBrot_7	OSR	66,089.7 us	1.02
MandelBrot_7	Main	64,890.2 us	1.00
BinaryTrees_2	Main	96,001.0 us	1.00
BinaryTrees_2	OSR	114,198.5 us	1.19	Regression
Fasta_2	OSR	365.6 us	0.99
Fasta_2	Main	368.3 us	1.00
NBody_3	OSR	363,286.1 us	1.01
NBody_3	Main	360,440.7 us	1.00
SpectralNorm_1	OSR	807.9 us	0.98
SpectralNorm_1	Main	821.9 us	1.00
FannkuchRedux_5	OSR	24,287.8 us	0.91	Improvement
FannkuchRedux_5	Main	26,648.5 us	1.00
FannkuchRedux_2	OSR	146,505.7 us	0.96	Improvement
FannkuchRedux_2	Main	148,944.7 us	1.00
FannkuchRedux_9	OSR	220,402.3 us	0.99
FannkuchRedux_9	Main	222,524.9 us	1.00
PiDigits_3	OSR	529,273.2 us	0.99
PiDigits_3	Main	535,984.8 us	1.00
Mandelbrot_2	OSR	1,153,379.1 us	1.01
Mandelbrot_2	Main	1,137,558.3 us	1.00
Fasta_1	OSR	230.6 us	1.01
Fasta_1	Main	229.0 us	1.00
KNucleotide_1	OSR	216,297.6 us	0.99
KNucleotide_1	Main	218,618.9 us	1.00
KNucleotide_9	OSR	59,534.4 us	1.00
KNucleotide_9	Main	59,517.3 us	1.00
ReverseComplement_1	OSR	834.4 us	1.01
ReverseComplement_1	Main	823.1 us	1.00
ReverseComplement_6	OSR	30,207.6 us	0.98
ReverseComplement_6	Main	30,780.3 us	1.00
BinaryTrees_5	OSR	117,305.4 us	0.94	Improvement
BinaryTrees_5	Main	124,728.5 us	1.00
BinaryTrees_6	OSR	515,972.2 us	1.01
BinaryTrees_6	Main	511,917.2 us	1.00
SpectralNorm_3	OSR	3,287.2 us	1.21	Regression
SpectralNorm_3	Main	3,275.5 us	1.00
RegexRedux_1	OSR	45,682.4 us	0.97
RegexRedux_1	Main	46,738.9 us	1.00
RegexRedux_5	OSR	32,008.6 us	0.99
RegexRedux_5	Main	32,357.0 us	1.00
RegexRedux_5	OSR	5,787.7 us	0.99
RegexRedux_5	Main	5,814.0 us	1.00

AndyAyersMS · 2022-02-24T18:51:21Z

Here's that same set of benchmarks, this time with --warmupCount=30 to try and ensure BDN is measuring Tier1 code:

BenchmarkDotNet=v0.13.1.1694-nightly, OS=Windows 11 (10.0.22000.493/21H2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
[Host] : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
Job-BUXVXU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-UYRIYN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-XQMGDN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-IHENTS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MinIterationCount=15 WarmupCount=30

Method	Toolchain	Mean	Median	Ratio	OSR Impact
MandelBrot_7	OSR	64,545.3 us	64,427.3 us	1.01
MandelBrot_7	Main	64,007.3 us	63,941.7 us	1.00
BinaryTrees_2	OSR	91,795.9 us	90,897.6 us	0.96	Improvement
BinaryTrees_2	Main	95,891.1 us	95,522.7 us	1.00
Fasta_2	OSR	375.9 us	374.7 us	1.01
Fasta_2	Main	370.7 us	370.6 us	1.00
NBody_3	OSR	358,882.7 us	359,285.6 us	0.97	Improvement
NBody_3	Main	371,405.4 us	371,776.2 us	1.00
SpectralNorm_1	OSR	812.7 us	810.8 us	1.02
SpectralNorm_1	Main	801.3 us	798.5 us	1.00
FannkuchRedux_5	OSR	23,263.1 us	23,313.2 us	0.95	Improvement
FannkuchRedux_5	Main	25,023.1 us	25,368.3 us	1.00
FannkuchRedux_2	OSR	156,437.4 us	156,363.6 us	1.06	Regression
FannkuchRedux_2	Main	147,277.7 us	147,160.0 us	1.00
FannkuchRedux_9	OSR	226,931.0 us	226,657.6 us	0.95	Improvement
FannkuchRedux_9	Main	238,423.3 us	235,644.1 us	1.00
PiDigits_3	OSR	554,562.8 us	551,167.6 us	1.02
PiDigits_3	Main	542,265.1 us	540,793.3 us	1.00
Mandelbrot_2	OSR	1,126,527.5 us	1,124,591.6 us	1.00
Mandelbrot_2	Main	1,121,005.5 us	1,120,413.5 us	1.00
Fasta_1	OSR	229.0 us	228.7 us	1.00
Fasta_1	Main	228.0 us	227.9 us	1.00
KNucleotide_1	OSR	213,158.8 us	212,448.6 us	1.00
KNucleotide_1	Main	212,237.7 us	211,199.3 us	1.00
KNucleotide_9	OSR	61,770.6 us	61,192.8 us	1.00
KNucleotide_9	Main	61,383.4 us	61,151.9 us	1.00
ReverseComplement_1	OSR	806.5 us	805.1 us	1.00
ReverseComplement_1	Main	805.0 us	804.3 us	1.00
ReverseComplement_6	OSR	30,818.8 us	31,085.9 us	1.01
ReverseComplement_6	Main	30,650.2 us	30,900.1 us	1.00
BinaryTrees_5	OSR	122,881.9 us	124,089.6 us	0.98
BinaryTrees_5	Main	124,920.9 us	124,950.9 us	1.00
BinaryTrees_6	OSR	525,078.3 us	524,015.0 us	1.02
BinaryTrees_6	Main	516,264.8 us	518,383.3 us	1.00
SpectralNorm_3	OSR	3,304.6 us	3,448.3 us	1.19	Regression
SpectralNorm_3	Main	3,201.9 us	3,473.8 us	1.00
RegexRedux_1	OSR	44,930.3 us	45,025.3 us	0.99
RegexRedux_1	Main	45,191.1 us	45,111.4 us	1.00
RegexRedux_5	OSR	31,236.6 us	31,617.2 us	0.99
RegexRedux_5	Main	31,434.0 us	31,262.0 us	1.00
RegexRedux_5	OSR	5,722.6 us	5,698.0 us	0.95	Improvement
RegexRedux_5	Main	6,146.9 us	5,845.8 us	1.00

SpectralNorm_3 regresses in both cases, so that's the first case I'll investigate.

AndyAyersMS · 2022-03-13T20:00:48Z

Actually, some care is needed in interpreting these results, as the Mar1 "baseline" is NoPGO, not default. And in some cases PGO data is causing regressions. For example, here's the default perf for GetBytes(Input: Greek). Note it is around 220k, which is similar to what we see with OSR above.

There's too much here to sort through by hand. @DrewScoggins looks like I could use an updated comparison report...

AndyAyersMS · 2022-03-14T20:16:03Z

Here's an interesting case:

This test regresses with OSR / Tier1 because we don't align the loop:

;; w/o OSR

       7E22                 jle      SHORT G_M30591_IG04
       0F1F80000000000F1F840000000000 align    [15 bytes for IG03]
						;; bbWeight=1    PerfScore 7.75
G_M30591_IG03:              ;; offset=0040H
       448BC2               mov      r8d, edx
       F34E0FB844C010       popcnt   r8, qword ptr [rax+8*r8+16]
       4103F0               add      esi, r8d
       FFC2                 inc      edx
       3BCA                 cmp      ecx, edx
       7FED                 jg       SHORT G_M30591_IG03

;; w/ OSR

       7E14                 jle      SHORT G_M30591_IG04

G_M30591_IG03:              ;; offset=001AH
       448BC9               mov      r9d, ecx
       F34E0FB84CCA10       popcnt   r9, qword ptr [rdx+8*r9+16]
       4103C1               add      eax, r9d
       FFC1                 inc      ecx
       443BC1               cmp      r8d, ecx
       7FEC                 jg       SHORT G_M30591_IG03

We don't align because w/OSR or at Tier1 the loop body comes from an inlinee with PGO data, inlined into a caller without PGO data. We scale the profile during inlining, then downscale during optSetBlockWeights, then don't upscale during optScaleLoopBlocks. So the loop body's weight ends up at 0.5 and we don't think this merits alignment.

;; w/o OSR

Trees after Find loops

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..00C)-> BB03 ( cond )                     i hascall idxlen 
BB02 [0001]  2       BB01,BB02             4     0 [00C..021)-> BB02 ( cond )                     i Loop idxlen bwd bwd-target align 
BB03 [0003]  2       BB01,BB02             1       [021..023)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

;; w OSR (or Tier1)

Trees after Find loops

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1           [000..00C)-> BB03 ( cond )                     i idxlen 
BB02 [0001]  2       BB01,BB02             0.50  50  0 [00C..021)-> BB02 ( cond )                     i Loop idxlen bwd bwd-target IBC 
BB03 [0003]  2       BB01,BB02             1           [021..023)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

Without OSR, we initially try to jit the method at Tier0, and see there's a loop, and switch to FullOpts. Because of this we don't have the BBOPT flag and so the jit doesn't try to read profile data for the inlinee.

This is unfortunate because we're not pulling in any profile data for these eagerly optimized methods. It may also partially explain why we've seen less impact from default PGO than we might expect -- the method where we spend the most time are not using any PGO data, by default.

There are two avenues to pursue here (perhaps together):

make sure we read profile data when we switch to full opt
scale more sensibly in loops with a mixture of PGO and non-PGO blocks.

BruceForstall · 2022-03-14T21:05:01Z

then downscale during optSetBlockWeights

I wonder if the loop body of inverted loops always gets downscaled because the zero trip test is "given" half the flow. Seems bad.

then don't upscale during optScaleLoopBlocks

Did you look why that is?

AndyAyersMS · 2022-03-14T21:24:45Z

then downscale during optSetBlockWeights

I wonder if the loop body of inverted loops always gets downscaled because the zero trip test is "given" half the flow. Seems bad.

Yeah this seems a bit odd. Weights for loop blocks in inverted loops go from 1 to 0.5 to 4. Generally we should assume such loops rarely zero-trip.

then don't upscale during optScaleLoopBlocks

Did you look why that is?

We don't upscale blocks with profile data.

All this is a consequence of our rather simplistic weighting/reweighting approach.

These problems get worse with PGO as we do a poor job of handling mixtures of profiled and unprofiled code (say inlining profiled callees into an unprofiled caller, or vice versa). OSR is highlighting some of this as (because of the inadvertent lack of BBOPT in the switch-to-optimized case) we are using PGO data "more often".

I have ambitions to redo all this but thought we could get by with the current scheme for a while longer.

BruceForstall · 2022-03-14T21:42:31Z

Looks like optSetBlockWeights won't scale weights if the entire function is using profile data, i.e., fgIsUsingProfileWeights() is true. Looks like it should maybe not scale if a particular block has profile weights, the way optScaleLoopBlocks does:

        if (curBlk->hasProfileWeight())
        {
            reportBlockWeight(curBlk, "; unchanged: has profile weight");
            continue;
        }

?

AndyAyersMS · 2022-03-15T18:03:58Z

Looking at the SocketsHttpHandlerPerfTest.Get(ssl: true, ...) test, it seems like most of the time BDN chooses a strange iteration strategy where it skips the pilot and overhead stages and just runs the benchmark once per workload iteration. When that happens we get inaccurate measurements.

Note how operations is frequently very low. This seems to be some kind of issue with BDN that is exposed/exacerbated by using OSR but (given past history) seemingly can happen without OSR as well. Not sure what just yet.

AndyAyersMS · 2022-03-15T20:19:17Z

Seems to be related to dotnet/BenchmarkDotNet#1780 where if the initial single invocation takes a long enough time then BDN has various threshold tests to decide if it should skip the pilot stages and just run one iteration per workload interval.

In my case the run of main triggers the "second chance" heuristic and the second time around the runs are fast, so BDN decides to iterate a lot. But for OSR the initial jitting is below the magic threshold but still somehow high enough to preclude pilot runs, and BDN decides to just run one iteration. And this benchmark has some high first-run costs and that change in strategy totally skews the results.

;; main

OverheadJitting  1: 1 op, 281000.00 ns, 281.0000 us/op
WorkloadJitting  1: 1 op, 358407800.00 ns, 358.4078 ms/op   // exceeds 250 ms

OverheadJitting  2: 1 op, 100.00 ns, 100.0000 ns/op       // "second chance" logic
WorkloadJitting  2: 1 op, 2746000.00 ns, 2.7460 ms/op

OverheadJitting  3: 16 op, 404800.00 ns, 25.3000 us/op
WorkloadJitting  3: 16 op, 3052100.00 ns, 190.7562 us/op

WorkloadPilot    1: 16 op, 2517600.00 ns, 157.3500 us/op     // ok, let's compute iterations
WorkloadPilot    2: 1600 op, 149864700.00 ns, 93.6654 us/op
WorkloadPilot    3: 2672 op, 256163700.00 ns, 95.8696 us/op
WorkloadPilot    4: 2608 op, 197508100.00 ns, 75.7316 us/op
WorkloadPilot    5: 3312 op, 236875800.00 ns, 71.5205 us/op
WorkloadPilot    6: 3504 op, 250721500.00 ns, 71.5529 us/op

;;; OSR

OverheadJitting  1: 1 op, 270400.00 ns, 270.4000 us/op
WorkloadJitting  1: 1 op, 227972100.00 ns, 227.9721 ms/op // does not exceed 250ms? Seems like we should pilot. But no.

WorkloadWarmup   1: 1 op, 2733600.00 ns, 2.7336 ms/op

// BeforeActualRun
WorkloadActual   1: 1 op, 228100.00 ns, 228.1000 us/op     // just measure one invocation per
WorkloadActual   2: 1 op, 212200.00 ns, 212.2000 us/op
WorkloadActual   3: 1 op, 162400.00 ns, 162.4000 us/op

Method	Job	Toolchain	ssl	chunkedResponse	responseLength	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Allocated	Alloc Ratio
Get	Job-ERLBPQ	\main-rel\corerun.exe	True	False	1	70.82 us	0.789 us	0.700 us	70.66 us	70.01 us	72.41 us	1.00	0.00	0.2854	2.6 KB	1.00
Get	Job-IQYELO	\osr-rel\corerun.exe	True	False	1	159.31 us	26.011 us	28.911 us	152.20 us	117.60 us	228.10 us	2.37	0.38	-	3.59 KB	1.38

// * Warnings *
MinIterationTime
SocketsHttpHandlerPerfTest.Get: PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog, Toolchain=\osr-rel\corerun.exe, IterationTime=250.0000 ms, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1 -> The minimum observed iteration time is 117.6000 us which is very small. It's recommended to increase it to at least 100.0000 ms using more operations.

AndyAyersMS · 2022-03-15T20:28:22Z

I guess it's the second bit of BDN logic -- if desiredterationTime / singleIterationTime < 1.5 then BDN will still do a single iteration. Here we have 250 / 227.97221 == 1.0966 < 1.5.

https://github.com/dotnet/BenchmarkDotNet/blob/63e28c100a42a6492d09a0b93a8a4c141061bb0d/src/BenchmarkDotNet/Engines/EngineFactory.cs#L58-L63

So OSR is triggering this behavior by having less initial overhead than MAIN but not so much less that BDN decides it should run pilot stages to assess how many iterations are needed.

AndyAyersMS · 2022-03-16T21:47:00Z

Here's a case where I thought a regression was caused by OSR, but it's largely there in main too. Just unfortunate timing.

The OSR case numbers improved with the recent (3/12) update to BDN and is now on par with MAIN.

OSR

MAIN

AndyAyersMS · 2022-03-18T19:39:28Z

Have an interim comparison report showing OSR vs MAIN. The top dozen or so regressions are the ssl: true tests, these all should be "fixed" by the next BDN update in the performance repo: dotnet/performance#2323

Overall this report shows 395 regressions and 591 improvements. The vast majority of these are noise. Drew is going to get me an updated comparison report that takes history into account; when that shows up I'll post an update.

In the meantime looking through what's there for actual regressions, Benchstone.BenchF.NewtR.Test is slower with OSR. This benchmark runs 4 iterations per, so ends up executing OSR for about 1/3 of its workload iterations. The OSR codegen is similar but the inner loop has a couple of extra loads because of an OSR exposed local.

I actually get somewhat inconsistent results locally. And not clear yet why we can't hoist/cse this load. Need to look more closely what's going on.

;; FullOpt/Tier1 inner loop


                            align    [0 bytes for IG04]

G_M46661_IG04:              ;; offset=0030H
       C5CB59D6             vmulsd   xmm2, xmm6, xmm6
       C5FB101DE4000000     vmovsd   xmm3, qword ptr [reloc @RWD00]
       C5E35CD2             vsubsd   xmm2, xmm3, xmm2
       C5EB59D6             vmulsd   xmm2, xmm2, xmm6
       C5FB5CD2             vsubsd   xmm2, xmm0, xmm2
       C5E8541DF0000000     vandps   xmm3, xmm2, qword ptr [reloc @RWD32]
       C5F92ECB             vucomisd xmm1, xmm3
       7746                 ja       SHORT G_M46661_IG06
       C5CB591DF2000000     vmulsd   xmm3, xmm6, qword ptr [reloc @RWD48]
       C5E359DE             vmulsd   xmm3, xmm3, xmm6
       C5E35C1DB6000000     vsubsd   xmm3, xmm3, qword ptr [reloc @RWD00]
       C5D857E4             vxorps   xmm4, xmm4
       C5F92EDC             vucomisd xmm3, xmm4
       7A02                 jp       SHORT G_M46661_IG05
       7421                 je       SHORT G_M46661_IG06
						;; bbWeight=16    PerfScore 597.33
G_M46661_IG05:              ;; offset=0076H
       C5EB5ED3             vdivsd   xmm2, xmm2, xmm3
       C5CB5CF2             vsubsd   xmm6, xmm6, xmm2
       C5E85415BA000000     vandps   xmm2, xmm2, qword ptr [reloc @RWD32]
       C5F92ECA             vucomisd xmm1, xmm2
       7707                 ja       SHORT G_M46661_IG06
       FFC0                 inc      eax
       83F80A               cmp      eax, 10
       7E9D                 jle      SHORT G_M46661_IG04

;; OSR inner loop


                            align    [0 bytes for IG04]

G_M46661_IG04:              ;; offset=0066H
       C5FB109424A0000000   vmovsd   xmm2, qword ptr [rsp+A0H]       // extra
       C5EB59DA             vmulsd   xmm3, xmm2, xmm2
       C5FB1025C5010000     vmovsd   xmm4, qword ptr [reloc @RWD16]
       C5DB5CDB             vsubsd   xmm3, xmm4, xmm3
       C5E359D2             vmulsd   xmm2, xmm3, xmm2
       C5FB5CD2             vsubsd   xmm2, xmm0, xmm2
       C5E8541DC1010000     vandps   xmm3, xmm2, qword ptr [reloc @RWD32]
       C5F92ECB             vucomisd xmm1, xmm3
       7768                 ja       SHORT G_M46661_IG06
       C5FB109C24A0000000   vmovsd   xmm3, qword ptr [rsp+A0H]       // extra
       C5E35925BA010000     vmulsd   xmm4, xmm3, qword ptr [reloc @RWD48]
       C5DB59DB             vmulsd   xmm3, xmm4, xmm3
       C5E35C1D8E010000     vsubsd   xmm3, xmm3, qword ptr [reloc @RWD16]
       C5D857E4             vxorps   xmm4, xmm4
       C5F92EDC             vucomisd xmm3, xmm4
       7A02                 jp       SHORT G_M46661_IG05
       7439                 je       SHORT G_M46661_IG06
						;; bbWeight=16    PerfScore 693.33
G_M46661_IG05:              ;; offset=00BEH
       C5EB5ED3             vdivsd   xmm2, xmm2, xmm3
       C5FB109C24A0000000   vmovsd   xmm3, qword ptr [rsp+A0H]       // extra
       C5E35CDA             vsubsd   xmm3, xmm3, xmm2
       C5FB119C24A0000000   vmovsd   qword ptr [rsp+A0H], xmm3
       C5E8541570010000     vandps   xmm2, xmm2, qword ptr [reloc @RWD32]
       C5F92ECA             vucomisd xmm1, xmm2
       770B                 ja       SHORT G_M46661_IG06
       FFC3                 inc      ebx
       83FB0A               cmp      ebx, 10
       0F8E75FFFFFF         jle      G_M46661_IG04

;; OSR iterations

// AfterActualRun
WorkloadResult   1: 4 op, 369514200.00 ns, 92.3786 ms/op
WorkloadResult   2: 4 op, 361280500.00 ns, 90.3201 ms/op
WorkloadResult   3: 4 op, 357723200.00 ns, 89.4308 ms/op
WorkloadResult   4: 4 op, 361092200.00 ns, 90.2730 ms/op
WorkloadResult   5: 4 op, 355732600.00 ns, 88.9331 ms/op
WorkloadResult   6: 4 op, 352204300.00 ns, 88.0511 ms/op
WorkloadResult   7: 4 op, 299682100.00 ns, 74.9205 ms/op
WorkloadResult   8: 4 op, 282106800.00 ns, 70.5267 ms/op
WorkloadResult   9: 4 op, 283662600.00 ns, 70.9156 ms/op
WorkloadResult  10: 4 op, 283777300.00 ns, 70.9443 ms/op
WorkloadResult  11: 4 op, 284411100.00 ns, 71.1028 ms/op
WorkloadResult  12: 4 op, 288849500.00 ns, 72.2124 ms/op
WorkloadResult  13: 4 op, 281122300.00 ns, 70.2806 ms/op
WorkloadResult  14: 4 op, 281138100.00 ns, 70.2845 ms/op
WorkloadResult  15: 4 op, 282122600.00 ns, 70.5306 ms/op
WorkloadResult  16: 4 op, 286633300.00 ns, 71.6583 ms/op
WorkloadResult  17: 4 op, 286545500.00 ns, 71.6364 ms/op
WorkloadResult  18: 4 op, 282719600.00 ns, 70.6799 ms/op
WorkloadResult  19: 4 op, 284713200.00 ns, 71.1783 ms/op
WorkloadResult  20: 4 op, 285504900.00 ns, 71.3762 ms/op

In the Tier0 method, x0 is address exposed, so the OSR method has to reload it from memory each time.

The exact impact of this depends on when the Tier1 method shows up -- in the above you can see it kicked in around iteration 7.

AndyAyersMS · 2022-03-18T19:59:54Z

not clear yet why we can't hoist/cse this load. Need to look more closely what's going on.

       C5FB119C24A0000000   vmovsd   qword ptr [rsp+A0H], xmm3

x0 is modified in the inner loop. But it's unaliased over that scope. So presumably could be enregistered and spilled back to its stack home on loop exit. We don't have the technology to handle cases like this (scalar promotion or similar).

CSE does not consider these trees despite knowing that the two loads (at least) will return the same value.

N001 (  3,  4) [000110] ----G-------              \--*  LCL_VAR   double(AX) V03 loc2          $201
N001 (  3,  4) [000122] ----G-------              \--*  LCL_VAR   double(AX) V03 loc2          $201

presumably this runs into:

runtime/src/coreclr/jit/optcse.cpp

Lines 3576 to 3577 in 798d52b

    
           case GT_LCL_VAR: 
        
               return false; // Can't CSE a volatile LCL_VAR

which seems a bit short sighted? We know (as in this case) we have some invariant/expensive local var loads, why not try CSEing them if we have the room.

Will experiment.

AndyAyersMS · 2022-03-18T21:41:40Z

we have some invariant/expensive local var loads, why not try CSEing them if we have the room.

Seems viable, though simply turning it on (even requiring cost >= IND_COST_EX) leads to a lot of code churn and a few asserts about incompatible types during CSE.

EG for benchmarks (via SPMI)

[14:25:42]
[14:25:42] Total bytes of base: 10418151 (overridden on cmd)
[14:25:42] Total bytes of diff: 10445323 (overridden on cmd)
[14:25:42] Total bytes of delta: 27172 (0.26 % of base)

[14:25:42] Top method regressions (bytes):
[14:25:42]          666 ( 6.63% of base) : 1247.dasm - DynamicClass:Regex2_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[Char]):bool
[14:25:42]          546 (19.10% of base) : 9755.dasm - System.Xml.Serialization.XmlSerializationReaderILGen:GetArraySource(System.Xml.Serialization.TypeDesc,System.String,bool):System.String:this
[14:25:42]          486 (10.86% of base) : 9477.dasm - DynamicClass:Regex1_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[Char]):bool
[14:25:42]          486 (10.86% of base) : 9480.dasm - DynamicClass:Regex1_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[Char]):bool

[14:25:42] Top method improvements (bytes):
[14:25:42]         -601 (-13.04% of base) : 8376.dasm - Newtonsoft.Json.JsonWriter:WriteValue(Newtonsoft.Json.JsonWriter,int,System.Object)
[14:25:42]         -391 (-3.30% of base) : 2438.dasm - Utf8Json.Formatters.MicroBenchmarks_Serializers_LocationFormatter1:Serialize(byref,MicroBenchmarks.Serializers.Location,Utf8Json.IJsonFormatterResolver):this
[14:25:42]         -351 (-10.04% of base) : 26136.dasm - Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.LanguageParser:ParseNamespaceBody(byref,byref,byref,ushort):this
[14:25:42]         -306 (-21.32% of base) : 5590.dasm - System.Guid:TryParseExactN(System.ReadOnlySpan`1[Char],byref):bool
[14:25:42]         -302 (-19.48% of base) : 637.dasm - System.Guid:TryParseExactD(System.ReadOnlySpan`1[Char],byref):bool

[14:25:42] 3888 total methods with Code Size differences (1029 improved, 2859 regressed), 1627 unchanged.

I recall a while back I was touting something like this as an alternative to EH Write Through but never pushed on it.

Anyways, probably something too radical to land in a timely manner.

AndyAyersMS · 2022-03-23T16:24:20Z

Latest update to BDN has made its way to perf repo and OSR runs. This should fix the ssl: true "regressions" where OSR consistently fared worse than main.

AndyAyersMS · 2022-03-23T16:25:56Z

Seeing a new issue now in general OSR testing, presumably from the interaction of OSR and #66257.

Assert failure(PID 12384 [0x00003060], Thread: 10696 [0x29c8]): Assertion failed 'h2 != nullptr' in 'LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long' during 'Clone loops' (IL size 144; hash 0x891a5971; Tier1-OSR)

AndyAyersMS · 2022-03-23T17:50:53Z

The jit is trying to clone the outer loop here:

Cloning loop L00: [head: BB01, top: BB02, entry: BB04, bottom: BB09, child: L02].

AndyAyersMS · 2022-03-23T18:49:57Z

Suspect the fix is simple? Here we have BB01->bbNext == BB04. So we decide not to create a new block for h2 here:

runtime/src/coreclr/jit/loopcloning.cpp

Lines 1903 to 1928 in ea4ebaa

    
           // Now we'll make "h2", after "h" to go to "e" -- unless the loop is a do-while, 
        
           // so that "h" already falls through to "e" (e == t). 
        
           // It might look like this code is unreachable, since "h" must be a BBJ_ALWAYS, but 
        
           // later we will change "h" to a BBJ_COND along with a set of loop conditions. 
        
           // TODO: it still might be unreachable, since cloning currently is restricted to "do-while" loop forms. 
        
           BasicBlock* h2 = nullptr; 
        
           if (h->bbNext != loop.lpEntry) 
        
           { 
        
               assert(h->bbJumpKind == BBJ_ALWAYS); 
        
               JITDUMP("Create branch to entry of optimized loop\n"); 
        
               BasicBlock* h2 = fgNewBBafter(BBJ_ALWAYS, h, /*extendRegion*/ true); 
        
               JITDUMP("Adding " FMT_BB " after " FMT_BB "\n", h2->bbNum, h->bbNum); 
        
               h2->bbWeight = h2->isRunRarely() ? BB_ZERO_WEIGHT : ambientWeight; 
        
               // This is in the scope of a surrounding loop, if one exists -- the parent of the loop we're cloning. 
        
               h2->bbNatLoopNum = ambientLoop; 
        
               h2->bbJumpDest = loop.lpEntry; 
        
               fgReplacePred(loop.lpEntry, h, h2); 
        
               JITDUMP("Replace " FMT_BB " -> " FMT_BB " with " FMT_BB " -> " FMT_BB "\n", h->bbNum, loop.lpEntry->bbNum, 
        
                       h2->bbNum, loop.lpEntry->bbNum); 
        
               optUpdateLoopHead(loopInd, h, h2); 
        
               // NOTE: 'h' is no longer the loop head; 'h2' is! 
        
           }

but BB01 does not transfer control to BB04.

BruceForstall · 2022-03-23T21:17:22Z

Here we have BB01->bbNext == BB04

Doesn't BB01->bbNext == BB02? What is the actual lexical block order?

If you set COMPlus_JitDumpFgConstrained=1 what does it look like?

AndyAyersMS · 2022-03-23T23:41:19Z

Here we have BB01->bbNext == BB04

Doesn't BB01->bbNext == BB02? What is the actual lexical block order?

Yeah, sorry, I was off base. As you noticed in #67067 we get into this code but weren't setting up the right h2.

Constrained view of the flowgraph:

AndyAyersMS · 2022-03-25T15:27:48Z

There's one more libraries issue I want to investigate... on ubuntu x64

export COMPlus_TieredCompilation=1
export COMPlus_TC_OnStackReplacement=1
export COMPlus_TC_QuickJitForLoops=1

  Starting:    System.Text.Json.Tests (parallel test collections = on, max threads = 2)
    System.Text.Json.Tests.Utf8JsonReaderTests.ReadJsonStringsWithCommentsAndTrailingCommas(jsonString: "{\"Property1\": {\"Property1.1\": 42} // comment\n"...) [FAIL]
      System.Text.Json.JsonReaderException : The JSON object contains a trailing comma at the end which is not supported in this mode. Change the reader options. LineNumber: 1 | BytePositionInLine: 1
...

AndyAyersMS · 2022-03-29T01:38:54Z

Issue for above #67152 -- fix in PR at #67274.

AndyAyersMS · 2023-03-30T19:55:11Z

#83910 improved a couple of the microbenchmarks, notably

and might also fix some of the reported regressions -- taking a look.

AndyAyersMS · 2023-04-24T21:37:09Z

Seeing some benchmark cases where there are methods with stackalloc + loop that bypass tiering: #84264 (comment) and hence also bypass PGO.

In particular

runtime/src/libraries/System.Text.Json/src/System/Text/Json/Document/JsonDocument.TryGetProperty.cs

Lines 135 to 150 in f2a55e2

    
           private bool TryGetNamedPropertyValue( 
        
               int startIndex, 
        
               int endIndex, 
        
               ReadOnlySpan<byte> propertyName, 
        
               out JsonElement value) 
        
           { 
        
               ReadOnlySpan<byte> documentSpan = _utf8Json.Span; 
        
               Span<byte> utf8UnescapedStack = stackalloc byte[JsonConstants.StackallocByteThreshold]; 
        
               // Move to the row before the EndObject 
        
               int index = endIndex - DbRow.Size; 
        
               while (index > startIndex) 
        
               { 
        
                   DbRow row = _parsedData.Get(index); 
        
                   Debug.Assert(row.TokenType != JsonTokenType.PropertyName);

Not sure how common this is but something to keep an eye on. Supporting stackalloc in its full generality with OSR would be hard, because we potentially would need to track 3 addressable segments of the stack, but it's not impossible.

It might be easier to revise the BCL so this doesn't happen in places where we care about perf. The proposed mitigation would be to split the method into a caller that stackallocs and a callee that loops. These parts can be reunited (if deemed profitable) via normal inlining, or the callee marked with AggressiveInlining.

FYI @stephentoub -- possible pattern to avoid since it creates methods that can't benefit from Dynamic PGO.

Forked this off as #85548

hez2010 · 2023-05-10T13:55:46Z

Not sure how common this is but something to keep an eye on.

I think it is common because many developers (who cares about allocations and performance) are writing code like below nowadays.

const int StackAllocSize = 128;
Span<T> buffer = length < StackAllocSize ? stackalloc T[length] : new T[length];

AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 17, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 17, 2020

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Mar 17, 2020

BruceForstall added this to the Future milestone Mar 19, 2020

AndyAyersMS mentioned this issue Apr 24, 2020

Implement portable tailcall helpers #341

Merged

58 tasks

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

AndyAyersMS removed the JitUntriaged CLR JIT issues needing additional triage label Sep 16, 2021

AndyAyersMS mentioned this issue Sep 29, 2021

JIT: better support for patchpoints in try regions #59784

Merged

AndyAyersMS mentioned this issue Mar 14, 2022

JIT: enable reading PGO data when switching to optimized #66618

Merged

BruceForstall mentioned this issue Mar 15, 2022

optSetBlockWeights should not scale blocks that have PGO weights #66678

Open

AndyAyersMS mentioned this issue Mar 16, 2022

Incorrectly detected number of invocations dotnet/BenchmarkDotNet#1780

Open

AndyAyersMS mentioned this issue Mar 22, 2022

Regressions from enabling reading PGO data #66998

Closed

AndyAyersMS self-assigned this Apr 12, 2022

AndyAyersMS mentioned this issue May 2, 2022

What's new in .NET 7 Preview 4 [WIP] dotnet/core#7378

Closed

jakobbotsch mentioned this issue Jan 5, 2023

QuickJitForLoops causing regressions #80210

Open

On Stack Replacement Next Steps #33658

On Stack Replacement Next Steps #33658

Comments

AndyAyersMS commented Mar 17, 2020 • edited Loading

Possible next steps now that #32969 is merged, in rough order of priority.

Issues and fixes after OSR was enabled

Performance Regressions

Other ideas: enhancements or optimizations

AndyAyersMS commented Mar 17, 2020 • edited Loading

AndyAyersMS commented Sep 16, 2021

Design sketch for better supporting patchpoints within try regions

AndyAyersMS commented Sep 28, 2021

AndyAyersMS commented Sep 28, 2021

AndyAyersMS commented Nov 12, 2021

AndyAyersMS commented Nov 15, 2021 • edited Loading

AndyAyersMS commented Nov 17, 2021 • edited Loading

AndyAyersMS commented Nov 17, 2021 • edited Loading

AndyAyersMS commented Nov 19, 2021 • edited Loading

AndyAyersMS commented Nov 19, 2021 • edited Loading

AndyAyersMS commented Dec 3, 2021 • edited Loading

BruceForstall commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021

BruceForstall commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021 • edited Loading

AndyAyersMS commented Dec 3, 2021

AndyAyersMS commented Dec 14, 2021

AndyAyersMS commented Feb 24, 2022

AndyAyersMS commented Feb 24, 2022

AndyAyersMS commented Mar 13, 2022

AndyAyersMS commented Mar 14, 2022

BruceForstall commented Mar 14, 2022

AndyAyersMS commented Mar 14, 2022

BruceForstall commented Mar 14, 2022

AndyAyersMS commented Mar 15, 2022

AndyAyersMS commented Mar 15, 2022

AndyAyersMS commented Mar 15, 2022

AndyAyersMS commented Mar 16, 2022 • edited Loading

AndyAyersMS commented Mar 18, 2022 • edited Loading

AndyAyersMS commented Mar 18, 2022

AndyAyersMS commented Mar 18, 2022

AndyAyersMS commented Mar 23, 2022

AndyAyersMS commented Mar 23, 2022

AndyAyersMS commented Mar 23, 2022

AndyAyersMS commented Mar 23, 2022

BruceForstall commented Mar 23, 2022

AndyAyersMS commented Mar 23, 2022 • edited Loading

AndyAyersMS commented Mar 25, 2022

AndyAyersMS commented Mar 29, 2022

AndyAyersMS commented Mar 30, 2023

AndyAyersMS commented Apr 24, 2023 • edited Loading

hez2010 commented May 10, 2023

AndyAyersMS commented Mar 17, 2020 •

edited

Loading

AndyAyersMS commented Mar 17, 2020 •

edited

Loading

Design sketch for better supporting patchpoints within `try` regions

AndyAyersMS commented Nov 15, 2021 •

edited

Loading

AndyAyersMS commented Nov 17, 2021 •

edited

Loading

AndyAyersMS commented Nov 17, 2021 •

edited

Loading

AndyAyersMS commented Nov 19, 2021 •

edited

Loading

AndyAyersMS commented Nov 19, 2021 •

edited

Loading

AndyAyersMS commented Dec 3, 2021 •

edited

Loading

AndyAyersMS commented Dec 3, 2021 •

edited

Loading

AndyAyersMS commented Mar 16, 2022 •

edited

Loading

AndyAyersMS commented Mar 18, 2022 •

edited

Loading

AndyAyersMS commented Mar 23, 2022 •

edited

Loading

AndyAyersMS commented Apr 24, 2023 •

edited

Loading