-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Rename tiers and config options, add config option JitTier0ForLoops #23597
Conversation
I would split this into two PRs:
|
QuickJit is disabled by default but if someone wanted to enable it in the startup tier for perf, I would still like the default behavior to be to disable QuickJit for methods that contain loops to avoid exposing people to that issue. Enabling QuickJit for loops would be a more advanced/risky thing to do. So I'm thinking about including the loop detection and switching from tier 0 to tier 1 in preview 4 in some fashion. |
I agree that the loop detection is useful. I do not think that the implementation in this PR is good because of it depends on exception handling for control flow. My suggestion to split this was to reduce risk for P4: Getting the right defaults in place is high priority. Improving the heuristics for optional QuickJit is lower priority. |
Ok sounds good |
I have separated the minimal changes into #23599 for preview 4, and rebased the changes here on top of that commit, so ignore the first commit in this PR. |
Here are some perf results for startup time. I have excluded steady-state perf results because they don't have any significant diffs, and I have excluded cases with the cold-methods-with-hot-loops issue that are fully resolved by this change and #23599. Perf results:
JitBench startup
ASP.NET startup
JitBench startup - R2R disabled
ASP.NET startup - R2R disabled
|
@dotnet-bot test Ubuntu x64 Checked CoreFX Tests |
Aside from the preview4 / subsequent split, I would like to see the changes here that are not part of #23599 be split into a pure renaming part and then a policy/logic changes. In the meantime:
|
Adding a set of tables that compare perf after this change directly against before this change and to tiering disabled. Also added some steady-state perf numbers. Lower is better for all diffs. JitBench startup perf - time (ms)
ASP.NET startup perf - time (ms), server start + first request
JitBench startup perf - R2R disabled
ASP.NET startup perf - R2R disabled - time (ms), server start + first request
JitBench steady-state perf - time (ms)
ASP.NET steady-state perf - time (ns) per request
|
Done, the second commit now has only the renames, and the commit titled "Functional changes" has the first set of functional changes.
The call to Ends up here and call counting for the startup tier is disabled for the method: So the method will not move up a tier. For the default
I think I will move the Yes in the future the JIT may decide to turn on opts at at point, maybe if the code would be much larger or if missed optimizations would be too expensive. If the tier 0 code is good enough and it would not be beneficial to rejit then it can inform the VM in the same way and the VM will treat it as tier 1 code.
I was thinking if we add a new tier between tier 0 and tier 1 in the future that would be a mess. Giving them a name that reflects what that tier does I think would be good for code readability as well. I skipped renaming the JIT-side flags for now, as I'm not sure if it would be preferred to call them tier 0 and tier 1. The VM-side tiers don't have to necessarily match to JIT-side tiers (or flags). For example there could be a "CallCountingTier" in the future where the entry point would be a stub that counts calls before going to the code entry point (just one possible implementation, could be done in other ways).
That function is also called for recursive functions, so it's saying that a recursive function has a loop. I was intending on excluding those because they will build up call counts during the recursion to be promoted to the optimized tier. |
I had similar thoughts -- at least for me the right analog in naming (at least internally) is the optimization level settings used by C/C++ compilers, generally a numeric scale where larger numbers imply more optimization & slower compilation. Instrumentation (call or block counts) I think would be a somewhat separate concern. Generally systems tend to instrument less optimized code, but it is not a hard and fast rule.
Perhaps -- but we might also blow the stack in the meantime. So I would be in favor of considering explicitly tail recursive methods as being methods with loops. And similar concerns for methods with explicit tail call sites. I think one of the things we're seeing with the issues coming up with Tier0 -- at least from the jit standpoint -- is that we don't have as good a handle as we'd like on what codegen features end up being "requirements" in release, and when we vary from what Tier1 does, we end up with surprises. If we're lucky most of the "required" things will actually be fairly cheap to implement; certainly the box-related opts and devirt fall into that category, as well as tail call opts. |
Yea good point, will fix. Should |
Rebased on top of #23599, this PR begins from the commit titled "Rename tiers and config options, add config option QuickJitForLoops" |
I would like to separate the concept of "optimization level" from "tier". Perhaps achieving that would involve some more renames, something along the lines of:
I think that will give some flexibility. For example, if we were to introduce instrumentation, that could be a separate tier What do you think? |
The jit needs to call This recursive call detection in the importer comes too late to switch opt levels, as aspects of importer behavior depend on minopts (eg use of the box temp, possibly more). So if you want to detect this early you will need something customized.
I think instrumentation should be a flag (as it is now for IBC). My preference is generally for the runtime to specify intent and the jit to map that to behavior, so that on the jit side we have some level of control over the combinatorics possible in codegen. Whether that is called a tier or an opt level is less critical, but we should probably at the same time revamp the other "intent" style jit flags like |
Ok I feel that your opinion does not necessarily disagree with my proposal. I agree that there needs to be some sort of mapping between VM concepts and JIT concepts that make sense. My motivation was mostly to separate those two things such that the VM can conceptualize things in a way that make sense there without relevance to JIT concepts. A component may do the mapping from VM concepts to JIT concepts, and the mapping from a VM concept could simply be an intent that the JIT would remap to specific behaviors. Does that align with your thinking? |
I think so...? Once we have a concrete proposal we can see if we actually agree or not. |
I'll take a stab at it, others feel free to weigh in |
I'm fine with rename, but if the goal is future proofing I'm both accepting and expecting that future changes to the functionality will probably result in more name changes. For example if we introduced an interpreter in the future the names would probably churn again because in some places StartupTier refers to 'the 1st tier' and in other places it refers to 'the tier that has low opt jitted code'.
I'm not sure what tiers would be referring to if they didn't refer to optimization levels? I do think there is value in separating instrumentation from optimization, but I would just label code with 2 variables: JitCompile(JITFLAG_INSTRUMENTATION_X | JITFLAG_OPTIMIZATION_TIER_Y).
I think as tiering matures we are likely to want a variety of instrumented options. For example in my tier2 experiment I had three that I treated distinctly (T0+low instrumentation, T1+low instrumentation, T1+high instrumentation). We could define a combo of instrumentation+optimization level as a tier and give each one a unique name, but personally I found it easier to program against the underlying two variables without trying to abstract it further.
👍 |
src/inc/corinfo.h
Outdated
@@ -832,7 +832,7 @@ enum CorInfoFlag | |||
CORINFO_FLG_INTRINSIC = 0x00400000, // This method MAY have an intrinsic ID | |||
CORINFO_FLG_CONSTRUCTOR = 0x00800000, // This method is an instance or type initializer | |||
CORINFO_FLG_AGGRESSIVE_OPT = 0x01000000, // The method may contain hot code and should be aggressively optimized if possible | |||
// CORINFO_FLG_UNUSED = 0x02000000, | |||
CORINFO_FLG_ALLOW_TIER0_TO_TIER1 = 0x02000000, // Indicates that for a tier 0 compilation request, the JIT may choose to switch to tier 1 if appropriate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of this slightly differently - that the runtime made a request for an indeterminate optimization level and the JIT resolved it to a specific optimization level based on some criteria. Although arguably this is just semantics, my concern about describing it this way is that new developers are likely to take flags/constants that say Tier0 at face value without necessarily noticing that there is a special case that converts 0 into 1. Using alternate terminology such as TIER0_OR_1 or UNRESOLVED_TIER (and not specifying TIER0) feels much harder to misunderstand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requested intent from the VM seems to be "JIT quickly". The JIT may choose to do so or not for various reasons. For example if the method only has a small loop and a large amount of code outside it may choose to optimize the loop and to not optimize code outside. In that case the VM would still have to consider that as unoptimized code. Or if the JIT finds that the tier 0 code would be too slow because of too many high-impact optimizations being missed, it may choose to generate fully optimized code, in which case the VM would consider that as optimized code. Maybe something like "PREFER_TIER0" would be clearer intent, or something more generic like "JIT_QUICKLY".
I think I'll leave the JIT config flags as they are for now. For this and the preview 4 PR I would like to start with some decent names for the config options. It may not be a big deal though if we have to rename them again.
Yea it could be done in different ways:
(1) may be easier to work with but kind of makes the tier irrelevant. Configuring the whole combination would be a bit challenging, for example the call counting threshold for interpreter and quick JIT to transition to the next tier probably would not be the same. (2) makes it easy to tie config options to the tier, such as call counting threshold, options for configuring modes of interpreter, quick JIT, or instrumentation in that tier. Tiers would also map to code versions so it would be easy to describe a code version with the tier name. I'm leaning towards (2) at the moment, with that perhaps I would rename StartupTier to QuickJittedTier and that would keep those config options more stable in the future. Thoughts? |
In |
…T for loops - Renamed tier 0 / tier 1 to StartupTier, OptimizedTier - Added config option JitTier0ForLoops, which determines whether quick JIT, when enabled, may be used for methods that contain loops. Off by default, so after this change, QuickJit=1 would still not use quick JIT for methods that contain loops by default.
It looks like a tier 0 compilation is respecting explicit tail calls at least in some cases. I tried the following with QuickJit enabled: IL: .method private hidebysig static int32
CalculateSumWithTailCall(int32 n,
[opt] int32 sum) cil managed
{
.param [2] = int32(0x00000000)
// Code size 27 (0x1b)
.maxstack 8
IL_0000: ldarg.0
IL_0001: ldc.i4.0
IL_0002: bgt.s IL_0006
IL_0004: ldc.i4.0
IL_0005: ret
IL_0006: ldarg.1
IL_0007: ldarg.0
IL_0008: add
IL_0009: starg.s sum
IL_000b: ldarg.0
IL_000c: ldc.i4.1
IL_000d: bne.un.s IL_0011
IL_000f: ldarg.1
IL_0010: ret
IL_0011: ldarg.0
IL_0012: ldc.i4.1
IL_0013: sub
IL_0014: ldarg.1
IL_0015: tail. call int32 ExplicitTailCallNoSO::CalculateSumWithTailCall(int32,
int32)
IL_001a: ret
} // end of method ExplicitTailCallNoSO::CalculateSumWithTailCall Effectively equivalent to the following except with an explicit tail call: private static int CalculateSumWithTailCall(int n, int sum = 0)
{
if (n <= 0)
return 0;
sum += n;
if (n == 1)
return sum;
return CalculateSumWithTailCall(n - 1, sum);
} ; Assembly listing for method ExplicitTailCallNoSO:CalculateSumWithTailCall(int,int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; compiler->opts.MinOpts() is true
; rbp based frame
; fully interruptible
; Final local variable assignments
;
; V00 arg0 [V00 ] ( 1, 1 ) int -> [rbp+0x10]
; V01 arg1 [V01 ] ( 1, 1 ) int -> [rbp+0x18]
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
; V03 tmp1 [V03 ] ( 1, 1 ) int -> [rbp-0x04] "arg temp"
;
; Lcl frame size = 16
G_M30597_IG01:
55 push rbp
4883EC10 sub rsp, 16
488D6C2410 lea rbp, [rsp+10H]
894D10 mov dword ptr [rbp+10H], ecx
895518 mov dword ptr [rbp+18H], edx
G_M30597_IG02:
837D1000 cmp dword ptr [rbp+10H], 0
7F08 jg SHORT G_M30597_IG04
33C0 xor eax, eax
G_M30597_IG03:
488D6500 lea rsp, [rbp]
5D pop rbp
C3 ret
G_M30597_IG04:
8B4518 mov eax, dword ptr [rbp+18H]
034510 add eax, dword ptr [rbp+10H]
894518 mov dword ptr [rbp+18H], eax
837D1001 cmp dword ptr [rbp+10H], 1
7509 jne SHORT G_M30597_IG06
8B4518 mov eax, dword ptr [rbp+18H]
G_M30597_IG05:
488D6500 lea rsp, [rbp]
5D pop rbp
C3 ret
G_M30597_IG06:
8B4510 mov eax, dword ptr [rbp+10H]
FFC8 dec eax
8945FC mov dword ptr [rbp-04H], eax
8B45FC mov eax, dword ptr [rbp-04H]
894510 mov dword ptr [rbp+10H], eax
EBCA jmp SHORT G_M30597_IG02 Are there explicit tail call cases where a tier 0 compilation would overflow the stack and a tier 1 compilation would not? |
Although methods like above should be treated as though they contain a loop |
Fine by me. To whatever limited degree I can predict the future, I agree this tier name sounds less likely than others to need a future change. |
RE QuickJittedTier, using that would probably also mean adding a separate tier PrecompiledTier such that precompiled code would not be in the QuickJittedTier. I think it would be useful anyway in PerfView to be able to distinguish time spent in precompiled vs quick-jitted code, although there may be other ways to get that info. There may be other benefits to distinguish those. I figure if for a method the precompiled code is good enough to consider fully optimized, it would instead be in the OptimizedTier. |
Closing for now |
Depends on #23599