-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: efficient profiling schemes #46882
Comments
Some notes on optimal profilingFollowing ideas in Ball's paper Optimally Profiling and Tracing Programs we wish to minimize the overall size and time cost of block profiling. The paper shows efficient profiling can greatly reduce instrumentation overhead, and we should see similar benefits. However, there are several novel factors to take into consideration:
The approach outlined below incorporates ideas from the weighted spanning tree approach but is fairly simplistic. BackgroundBall's approach is to instrument edges, not blocks; he shows that edge-based instrumentation is in general more efficient. The approach first forms a maximum weight spanning tree on a slightly modified CFG. The weights reflect the cost of instrumenting an edge; roughly speaking this is the normalized execution frequency of the edge, so all weights are non-negative. The flow graph is modified to add synthetic edges from each exit point back to the method entry. The maximum weight spanning tree is then found using a standard algorithm; note that finding maximum weight spanning trees in directed graphs is somewhat more costly than for an undirected graph. The synthetic edges are handled specially and must be non-tree edges. Once the spanning tree is found, instrumentation is then added to non-tree edges. Critical edges that need instrumentation will require splitting; otherwise instrumentation can be added either to the source or the target of the edge. To reconstruct the full set of block and edge counts, we can use a worklist algorithm. Edges that were instrumented either at source or target immediately supply counts for the associated blocks. We then iterate on blocks looking for cases where the block count is known, and just one edge count is unknown, and deduce the missing edge counts; this process converges. Note that because the number of instrumentation probes is minimal any set of values for the counters represents a unique and consistent set of counts; thus we do not expect count inconsistencies to arise during reconstruction (though the reconstructed counts may not match the actual execution counts). More on this later. ObservationsWe first note that Ball's approach generally will instrument returns instead of method entries; method entry count can be deduced by summing up all the return counts. While this may seem odd it actually provides some nice advantages; in particular it "solves" a case that is problematic currently -- the case where the IL branches back to offset 0. We also note that for an SCC enough edges must be instrumented to disconnect the SCC into a dag. In particular in simple loops, loop back edges will be instrumented. ProposalGiven the various constraints and the observations above, I propose we implement the following. We do the instrumentation planning quite early, basically as the very first phase or part of the pre-phase. At this point we have empty basic blocks with correctly set bbJumpKinds, and so can enumerate block successors. We do not have any representation for edges, so will need to create something. We evolve a spanning tree via a greedy DFS. This preferentially visits critical edges and preferentially avoids edges from non-rare blocks to rare blocks.
Still to be determined is whether we create schema records on the fly as we do the above, or simply do bookkeeping and defer schema creation until we add instrumentation. Schema records for counts will contain both the source and target IL offsets. if we do create scheme entries on the fly during DFS we may want to sort them after the fact (say by ascending source IL offset) so that subsequent lookups can binary search. We run reconstruction at basically the same point, following Ball's approach. To make this efficient we may need to build some sort of priority queue, but initially we can just keep a simple worklist of unresolved blocks and iterate through that; there should always be at least one block on the list whose counts can be resolved. Note this means we also have edge weights available "early" -- so we should also consider setting those values and / or starting the work on converting them to successor likelihoods. For partially imported methods we'll still reconstruct counts across the entire flow graph; at this early stage we haven't yet discarded any blocks. Once the importation happens the counts may well end up inconsistent, but that reflects lack of context sensitivity, not some defect in the instrumentation. |
An ExampleLet's see how the ideas above hold up on a realistic example, runtime/src/libraries/System.Private.CoreLib/src/System/Array.cs Lines 2063 to 2085 in acd4855
The early flow graph looks like the following, with each block annotated with start and end IL offsets in hex: Here critical edges are red and the synthetic edge from EXIT to ENTER is dashed. The greedy DFS produces the following spanning tree, with tree edges in bold, and non-tree edges dashed: There are a total of 5 probes needed, and two of them require edge splits:
Thus we'd need a total of 5 probes instead of the 9 we'd currently use with block profiling. Assuming dynamic probe cost is dominated by the inner loop Now imagine we end up with the following count values.
To reconstruct, we find nodes where all incoming edge counts are known, and then solve for any outgoing edge counts are now determined
Note no matter the values of A, B, C, D, E we have a consistent solution, and if all counts are non-negative then all block weights are non-negative. However we have two edges whose weights are determined via subtraction: |
FYI @dotnet/jit-contrib @davidwrighton Should cut down number of count probes by a factor of two or so, and reduce runtime overhead somewhat. But won't easily be compatible with IBC, so if/when I start implementing this, it will be opt-in behavior. |
Prototyping the above, on the EH example:
|
Data from the prototype, crossgenning SPC All methods
Ignoring methods with just 1 block, which we could trivially skip with a block instrumentation scheme too:
so as expected we're eliminating over half the probes with this approach. |
One unresolved issue from the prototype is handling of BBJ_THROW blocks. My initial thought was that these should be treated similarly to BBJ_RETURN (and so will end up getting instrumented); Ball suggests instead that the pseudo-edge target EXIT and only accumulate a count if the throw actually happens. I think this amounts to the same thing, and simply instrumenting throws as if they were returns will work out, save perhaps in the rare case where a method throws an exception and then catches it. |
Given all the above, the first bits of prospecting for efficient edge instrumentation look promising, so I'll start working on actually doing the instrumentation and producing the necessary schema, and from there work on count reconstruction. |
Some thoughts on instrumenting partially imported methodsIf we ever have ambitions of instrumenting optimized code, or of enabling some of the importer optimizations in Tier0, we may find cases where we only import parts of a method that we also wish to instrument. There is no point in emitting instrumentation code for the parts of the method that aren't imported, but it's less clear if the instrumentation schema we produce should reflect the entire method or just the part of the method that was imported. The concern is that if the schema just reflects the imported subset, we could end up with a number of different hard to reconcile schemas for a method (say for instance, we run the instrumented code on different OSs or ISAs and it imports differently based on that). If so, it may not be possible to merge the schemas (and any associated data) into one master schema. If we always produce a schema that reflects the entire method's IL then we avoid this problem; all schemas will agree. But it also seems we could initially build the full schema, and then prune out any schema entry that would be associated with a suppressed probe (an edge probe that would be placed in an un-imported block, or a critical edge probe between two un-imported blocks). Then the diverse schema are mutually compatible and in principle, simply mergeable. In the near term we don't need to solve this problem -- though we might choose to anyways, for methods with clearly unreachable code. |
I'ts looking inevitable that edge profiling will need to handle edges involving BBF_INTERNAL blocks. Among other things the entry block can be marked this way, if there's a try region starting at IL offset 0:
and those blocks have plausible-looking IL offsets. My thought is to flag those blocks by using say This means edge-based instrumentation (unlike block based) needs to look at BBF_INTERNAL blocks during schema building and instrumentation. So will do one more refactoring of base support to enable this variant behavior. |
Trying to build on #47646 and #47476, there's a snag -- we incorporate before import, and instrument after. That means
This violates my rule of thumb that instrumentation and incorporation must happen at the same point. Since incorporation has to happen before importation, this implies some aspect of instrumentation has to happen then too. We can't fully instrument before importation since we don't want to instrument blocks that end up not getting imported. But I think we can have the sparse instrumenter plan its work before importation and then build the schema and instrument after importation. Doing so will avoid the need to key any internal blocks, which is a bonus. But it's possible the FG shape is a bit odd before we get through the "eh canon" stuff. Will have to see. Seems like the fix here is to let the sparse instrumenter plan its work before |
Working on the before/after import split approach:
A few new wrinkles:
|
Reconstruction finally showing signs of life.
|
Running through Pri1 tests -- have 30 odd asserting cases. Once those are good I want to do some more spot checking or a run with more aggressive assertions turned on (eg we assert if we fail to solve) and look at into how many solver passes are needed worst case. Looks like OSR poses a new set of complications.
For now I'm thinking we can just drop back to block instrumentation if OSR is enabled and we have patchpoints in the method. |
Note even with block profiles the OSR profile data is a bit funky:
The original method never made it out of the BB02, BB03 loop so the return counts are all zero (and hence edge instrumentation would impute the method entry count is zero). The new OSR entry profile count is not set right. Opened #47942 to sort all this out. |
Pri1 tests mostly running w/o issue, but hit one odd issue. We are contaminating the inlinee compiler with details of the root method EH. Not clear why. This makes walking the inlinee flow graph tricky as part of the walk needs to consider EH entries, and so we're wandering off into the root method blocks. Hopefully all this is unnecessary and will just cause confusion. If we ever decide to support inlining methods with EH we might need to reconsider, but my guess is we would not just concatenate inlinee entries, they'd need to be more carefully merged into the root table. runtime/src/coreclr/jit/fgbasic.cpp Lines 2368 to 2373 in 0429344
|
Add a new instrumentation mode that only instruments a subset of the edges in the control flow graph. This reduces the total number of counters and so has less compile time and runtime overhead than instrumenting every block. Add a matching count reconstruction algorithm that recovers the missing edge counts and all block counts. See dotnet#46882 for more details on this approach.
Add a new instrumentation mode that only instruments a subset of the edges in the control flow graph. This reduces the total number of counters and so has less compile time and runtime overhead than instrumenting every block. Add a matching count reconstruction algorithm that recovers the missing edge counts and all block counts. See #46882 for more details on this approach. Also in runtime pgo support, add the header offset to the copy source. This fixes #47930.
Implemented via #47959. |
The jit has traditionally done simplistic basic block profiling: each block gets instrumented with a counter probe.
More space- and time-efficient methods are well known, see for example Ball and Larus's Optimally Profiling and Tracing Programs. Generally these techniques profile edges, not blocks, and use a maximum-weight spanning tree to identify a minimal-cost set of edges that require probes.
With the advent of more flexible instrumentation schema in #46638 we can now consider changing the probing scheme used by the jit. Uniquely identifying edge probes requires two IL offsets; the new schema can support this.
This issue is to explore adopting these these techniques to reduce both the performance impact of instrumentation as well as the size required to hold the instrumentation produced data.
category:cq
theme:profile-feedback
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: