Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: efficient profiling schemes #46882

Closed
AndyAyersMS opened this issue Jan 12, 2021 · 18 comments
Closed

JIT: efficient profiling schemes #46882

AndyAyersMS opened this issue Jan 12, 2021 · 18 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jan 12, 2021

The jit has traditionally done simplistic basic block profiling: each block gets instrumented with a counter probe.

More space- and time-efficient methods are well known, see for example Ball and Larus's Optimally Profiling and Tracing Programs. Generally these techniques profile edges, not blocks, and use a maximum-weight spanning tree to identify a minimal-cost set of edges that require probes.

With the advent of more flexible instrumentation schema in #46638 we can now consider changing the probing scheme used by the jit. Uniquely identifying edge probes requires two IL offsets; the new schema can support this.

This issue is to explore adopting these these techniques to reduce both the performance impact of instrumentation as well as the size required to hold the instrumentation produced data.

category:cq
theme:profile-feedback
skill-level:expert
cost:medium

@AndyAyersMS AndyAyersMS mentioned this issue Jan 12, 2021
54 tasks
@AndyAyersMS AndyAyersMS added this to the 6.0.0 milestone Jan 12, 2021
@AndyAyersMS AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 12, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jan 12, 2021
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jan 12, 2021
@AndyAyersMS AndyAyersMS self-assigned this Jan 14, 2021
@AndyAyersMS
Copy link
Member Author

Some notes on optimal profiling

Following ideas in Ball's paper Optimally Profiling and Tracing Programs we wish to minimize the overall size and time cost of block profiling. The paper shows efficient profiling can greatly reduce instrumentation overhead, and we should see similar benefits.

However, there are several novel factors to take into consideration:

  • we're unlikely to have meaningful edge frequency data, except in some special cases. In particular we won't have IR, so any sort of IR-based heuristic for guessing at branch likelihoods is not possible. So we can form only an approximate notion of a maximum weight spanning tree.
  • the flow graph may be disconnected because of EH.
  • we generally don't need to track flow across EH edges, save perhaps for the flow from a call finally to its target and back.
  • we need to make sure we can recover counts in cases where methods are partially imported.
  • we need to make sure we can handle cases where methods branch to IL offset 0.
  • we won't have or (likely) can't afford to run loop or SCC detection up front.

The approach outlined below incorporates ideas from the weighted spanning tree approach but is fairly simplistic.

Background

Ball's approach is to instrument edges, not blocks; he shows that edge-based instrumentation is in general more efficient.

The approach first forms a maximum weight spanning tree on a slightly modified CFG. The weights reflect the cost of instrumenting an edge; roughly speaking this is the normalized execution frequency of the edge, so all weights are non-negative. The flow graph is modified to add synthetic edges from each exit point back to the method entry.

The maximum weight spanning tree is then found using a standard algorithm; note that finding maximum weight spanning trees in directed graphs is somewhat more costly than for an undirected graph. The synthetic edges are handled specially and must be non-tree edges.

Once the spanning tree is found, instrumentation is then added to non-tree edges. Critical edges that need instrumentation will require splitting; otherwise instrumentation can be added either to the source or the target of the edge.

To reconstruct the full set of block and edge counts, we can use a worklist algorithm. Edges that were instrumented either at source or target immediately supply counts for the associated blocks. We then iterate on blocks looking for cases where the block count is known, and just one edge count is unknown, and deduce the missing edge counts; this process converges.

Note that because the number of instrumentation probes is minimal any set of values for the counters represents a unique and consistent set of counts; thus we do not expect count inconsistencies to arise during reconstruction (though the reconstructed counts may not match the actual execution counts). More on this later.

Observations

We first note that Ball's approach generally will instrument returns instead of method entries; method entry count can be deduced by summing up all the return counts. While this may seem odd it actually provides some nice advantages; in particular it "solves" a case that is problematic currently -- the case where the IL branches back to offset 0.

We also note that for an SCC enough edges must be instrumented to disconnect the SCC into a dag. In particular in simple loops, loop back edges will be instrumented.

Proposal

Given the various constraints and the observations above, I propose we implement the following.

We do the instrumentation planning quite early, basically as the very first phase or part of the pre-phase. At this point we have empty basic blocks with correctly set bbJumpKinds, and so can enumerate block successors. We do not have any representation for edges, so will need to create something.

We evolve a spanning tree via a greedy DFS. This preferentially visits critical edges and preferentially avoids edges from non-rare blocks to rare blocks.

  • when we find a back or non-tree edge, we decide whether we should instrument the source, target, or need to split the edge. For blocks we might convey this information by setting a block flag.
  • when we reach a return we mark its "edge" to the entry block as needing instrumentation; since the source block has no control flow this means we commit to instrumenting each return block.
  • EH is handled by initiating a new DFS from each unvisited handler entry; this effectively treats each EH handler region as a separate flow graph. EH handler exits are treated analogously to method returns; the upshot is that we will also instrument EH "returns".
  • This information guides schema creation and insertion of counter probes.

Still to be determined is whether we create schema records on the fly as we do the above, or simply do bookkeeping and defer schema creation until we add instrumentation.

Schema records for counts will contain both the source and target IL offsets. if we do create scheme entries on the fly during DFS we may want to sort them after the fact (say by ascending source IL offset) so that subsequent lookups can binary search.

We run reconstruction at basically the same point, following Ball's approach. To make this efficient we may need to build some sort of priority queue, but initially we can just keep a simple worklist of unresolved blocks and iterate through that; there should always be at least one block on the list whose counts can be resolved.

Note this means we also have edge weights available "early" -- so we should also consider setting those values and / or starting the work on converting them to successor likelihoods.

For partially imported methods we'll still reconstruct counts across the entire flow graph; at this early stage we haven't yet discarded any blocks. Once the importation happens the counts may well end up inconsistent, but that reflects lack of context sensitivity, not some defect in the instrumentation.

@AndyAyersMS
Copy link
Member Author

An Example

Let's see how the ideas above hold up on a realistic example, InsertionSort on arrays.

private void InsertionSort(int lo, int hi)
{
int i, j;
object t;
object? ti;
for (i = lo; i < hi; i++)
{
j = i;
t = keys[i + 1];
ti = items?[i + 1];
while (j >= lo && comparer.Compare(t, keys[j]) < 0)
{
keys[j + 1] = keys[j];
if (items != null)
items[j + 1] = items[j];
j--;
}
keys[j + 1] = t;
if (items != null)
items[j + 1] = ti;
}
}
}

The early flow graph looks like the following, with each block annotated with start and end IL offsets in hex:
image - 2021-01-14T124923 650

Here critical edges are red and the synthetic edge from EXIT to ENTER is dashed.

The greedy DFS produces the following spanning tree, with tree edges in bold, and non-tree edges dashed:
image - 2021-01-14T125847 612

There are a total of 5 probes needed, and two of them require edge splits:

  • BB03->BB04 edge can be handled by instrumenting BB03
  • BB07->BB08 edge can be handled by instrumenting BB07
  • BB09->BB01 edge can be handled by instrumenting BB09
  • BB06->BB03 edge requires splitting
  • BB06->BB07 edge requires splitting

Thus we'd need a total of 5 probes instead of the 9 we'd currently use with block profiling. Assuming dynamic probe cost is dominated by the inner loop BB04, BB05, BB06, BB03 we go from having 4 probes in the loop down to 3.

Now imagine we end up with the following count values.

  • BB03->BB04 A
  • BB07->BB08 B
  • BB09->BB01 C
  • BB06->BB03 D
  • BB06->BB07 E

To reconstruct, we find nodes where all incoming edge counts are known, and then solve for any outgoing edge counts are now determined

  • BB03's count is A
  • BB07's count is B
  • BB09's count is C
  • BB01's count is C
  • BB08's count is B + C
  • BB02's count is B
  • BB04's count is A + B
  • BB06's count is D + E
  • BB05's count is A + E
    image - 2021-01-14T133206 827

Note no matter the values of A, B, C, D, E we have a consistent solution, and if all counts are non-negative then all block weights are non-negative.

However we have two edges whose weights are determined via subtraction: BB05->BB03's weight is A - D, and BB04->BB07's weight is B - E. So it's not the case that an arbitrary set of values will lead to a fully non-negative solution. So we'll have to watch out when doing subtractions.

@AndyAyersMS
Copy link
Member Author

FYI @dotnet/jit-contrib @davidwrighton

Should cut down number of count probes by a factor of two or so, and reduce runtime overhead somewhat. But won't easily be compatible with IBC, so if/when I start implementing this, it will be opt-in behavior.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jan 15, 2021

EH Example

We now consider a simple method with some EH constructs:

    public static int Main(string[] args)
    {
        int result = 0;
        try
        {
            using (StreamReader sr = new StreamReader(args[0]))
            {
                string line;
                while ((line = sr.ReadLine()) != null)
                {
                    Console.WriteLine(line);
                }
            }

            result = 100;
        }
        catch (Exception e)
        {
            Console.WriteLine(e.Message);
        }

        return result;
    }

The flowgraph for this is shown below, with boxes used to show which nodes belong to the various EH regions.

Note that each EH handler region has a dotted edge from region exit back to entry, as does the "main" method region outside of any EH. There is no explicit flow to the starts of the handlers. There's dotted line from the catch return back to the main method; this represents non-splittable EH flow, which we will subsequently ignore. As before the red edges represent critical edges.

image - 2021-01-14T172439 632

The spanning tree for this (a spanning forest, actually) is shown below. There are 5 edges that require instrumentation (out of 12 nodes).
image - 2021-01-14T172346 908

Here we would instrument as follows:

  • BB08->BB09 in BB08
  • BB09->BB07 in BB09
  • BB04->BB05 in BB04
  • BB11->BB11 in BB11
  • BB12->BB01 in BB12

So no edge splitting would be needed.

@AndyAyersMS
Copy link
Member Author

Prototyping the above, on the EH example:

In fgInstrumentSpanningTree()

New BlockSet epoch 1, # of blocks (including unused BB00): 13, bitset array size: 1 (short)
0: Instrument handler leave edge BB11->BB11
1: Instrument handler leave edge BB09->BB07
2: Instrument edge BB08->BB09
3: Instrument edge BB04->BB05
4: Instrument return edge BB12->BB01
12 blocks, 5 probes

@AndyAyersMS
Copy link
Member Author

Data from the prototype, crossgenning SPC

All methods

 29777 methods
116348 blocks
 51918 probes

Ignoring methods with just 1 block, which we could trivially skip with a block instrumentation scheme too:

 11443 methods
 98014 blocks
 51218 probes

so as expected we're eliminating over half the probes with this approach.

@AndyAyersMS
Copy link
Member Author

One unresolved issue from the prototype is handling of BBJ_THROW blocks. My initial thought was that these should be treated similarly to BBJ_RETURN (and so will end up getting instrumented); Ball suggests instead that the pseudo-edge target EXIT and only accumulate a count if the throw actually happens.

I think this amounts to the same thing, and simply instrumenting throws as if they were returns will work out, save perhaps in the rare case where a method throws an exception and then catches it.

@AndyAyersMS
Copy link
Member Author

A few cases have as many edge probes as block probes, eg System.Delegate:RemoveAll(System.Delegate,System.Delegate):System.Delegate
image - 2021-01-15T144309 653

Critical edge probe density (out of all probes) is around 20%. About 450 methods (out of 11433) have more that 50% of their probes as critical probes. This measure is interesting because critical probes will incur an extra cost because an extra jump will be required, they are more costly in terms of both code size and execution count.

A few methods have a lot of critical probes, eg System.Numerics.Matrix4x4:get_IsIdentity():bool:this has 17 blocks, 16 probes, and 14 of those probes require edge splitting:

image - 2021-01-15T145955 180

@AndyAyersMS
Copy link
Member Author

Given all the above, the first bits of prospecting for efficient edge instrumentation look promising, so I'll start working on actually doing the instrumentation and producing the necessary schema, and from there work on count reconstruction.

@AndyAyersMS
Copy link
Member Author

Some thoughts on instrumenting partially imported methods

If we ever have ambitions of instrumenting optimized code, or of enabling some of the importer optimizations in Tier0, we may find cases where we only import parts of a method that we also wish to instrument.

There is no point in emitting instrumentation code for the parts of the method that aren't imported, but it's less clear if the instrumentation schema we produce should reflect the entire method or just the part of the method that was imported.

The concern is that if the schema just reflects the imported subset, we could end up with a number of different hard to reconcile schemas for a method (say for instance, we run the instrumented code on different OSs or ISAs and it imports differently based on that). If so, it may not be possible to merge the schemas (and any associated data) into one master schema.

If we always produce a schema that reflects the entire method's IL then we avoid this problem; all schemas will agree.

But it also seems we could initially build the full schema, and then prune out any schema entry that would be associated with a suppressed probe (an edge probe that would be placed in an un-imported block, or a critical edge probe between two un-imported blocks). Then the diverse schema are mutually compatible and in principle, simply mergeable.

In the near term we don't need to solve this problem -- though we might choose to anyways, for methods with clearly unreachable code.

@AndyAyersMS
Copy link
Member Author

I'ts looking inevitable that edge profiling will need to handle edges involving BBF_INTERNAL blocks. Among other things the entry block can be marked this way, if there's a try region starting at IL offset 0:

Basic block list for 'OVFTest:Test_byte(ubyte):ubyte'

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0004]  1  1                          1       [000..000)                 T1      try {       keep internal try label
BB02 [0000]  1  0                          1       [000..00C)-> BB05 (leave ) T0      try { }     keep try label
BB03 [0001]  1  1  0                       1       [00C..011)-> BB05 (leave ) T1 H0   catch { } } keep label target
BB04 [0002]  1     1                       1       [011..01A)        (finret)    H1   finally { } keep label target
BB05 [0003]  2                             1       [01A..01C)        (return)
-----------------------------------------------------------------------------------------------------------------------------------------

and those blocks have plausible-looking IL offsets.

My thought is to flag those blocks by using say bbID as the key, along with a high bit (since IL offsets in practice can't span the full 32 bit range).

This means edge-based instrumentation (unlike block based) needs to look at BBF_INTERNAL blocks during schema building and instrumentation. So will do one more refactoring of base support to enable this variant behavior.

@AndyAyersMS
Copy link
Member Author

Trying to build on #47646 and #47476, there's a snag -- we incorporate before import, and instrument after. That means

  • any block introduced by the importer will not be visible when we incorporate
  • bbNums may not be a reliable keying mechansim as the importer can renumber BBs.

This violates my rule of thumb that instrumentation and incorporation must happen at the same point. Since incorporation has to happen before importation, this implies some aspect of instrumentation has to happen then too.

We can't fully instrument before importation since we don't want to instrument blocks that end up not getting imported.

But I think we can have the sparse instrumenter plan its work before importation and then build the schema and instrument after importation. Doing so will avoid the need to key any internal blocks, which is a bonus. But it's possible the FG shape is a bit odd before we get through the "eh canon" stuff. Will have to see.

Seems like the fix here is to let the sparse instrumenter plan its work before

@AndyAyersMS
Copy link
Member Author

Working on the before/after import split approach:

  • plan instrumentation before importing
  • instrument after importing

A few new wrinkles:

  • we still see internal blocks pre-import from fgNormalizeEh.
  • in some cases we have incomplete flow before importing, and don't badcode. So the planning phase needs to be tolerant of such cases (basically bbJumpTarget might be nullptr for things that jump).
  • we need to safeguard in case importing might introduce new internal blocks. I don't think this happens.

@AndyAyersMS
Copy link
Member Author

Reconstruction finally showing signs of life.

*************** Starting PHASE Profile incorporation
Have profile data: 3 schema records (schema at 000001E3512A3DB0, data at 000001E3512A3DF8)
Profile summary: 1 runs, 0 block probes, 3 edge probes, 0 class profiles, 0 other records

Reconstructing block counts from sparse edge instrumentation
... adding known edge BB02 -> BB03: weight 0.000000
... adding known edge BB04 -> BB05: weight 14418.000000
... adding known edge BB05 -> BB01: weight 43252.000000

New BlockSet epoch 1, # of blocks (including unused BB00): 6, bitset array size: 1 (short)
 ... unknown edge BB01 -> BB02
 ... unknown edge BB01 -> BB03
 ... unknown edge BB03 -> BB04
 ... unknown edge BB03 -> BB05

Solver: 5 blocks, 5 unknown; 7 edges, 4 unknown, 0 zero (and so ignored)

Pass [1]: 5 unknown blocks, 4 unknown edges
BB01: 0 incoming unknown, 2 outgoing unknown
BB01: all incoming edge weights known, summming...
  BB05 -> BB01 has weight 43252.000000
BB01: all incoming edge weights known, sum is 43252.000000
BB02: 1 incoming unknown, 0 outgoing unknown
BB02: all outgoing edge weights known, summming...
  BB02 -> BB03 has weight 0.000000
BB02: all outgoing edge weights known, sum is 0.000000
BB01 -> BB02: target block weight and all other incoming edge weights known, so weight is 0.000000
BB03: 1 incoming unknown, 2 outgoing unknown
BB04: 1 incoming unknown, 0 outgoing unknown
BB04: all outgoing edge weights known, summming...
  BB04 -> BB05 has weight 14418.000000
BB04: all outgoing edge weights known, sum is 14418.000000
BB03 -> BB04: target block weight and all other incoming edge weights known, so weight is 14418.000000
BB05: 1 incoming unknown, 0 outgoing unknown
BB05: all outgoing edge weights known, summming...
  BB05 -> BB01 has weight 43252.000000
BB05: all outgoing edge weights known, sum is 43252.000000
BB03 -> BB05: target block weight and all other incoming edge weights known, so weight is 28834.000000

Pass [2]: 1 unknown blocks, 1 unknown edges
BB01 -> BB03: source block weight and all other outgoing edge weights known, so weight is 43252.000000
BB03: 0 incoming unknown, 0 outgoing unknown
BB03: all incoming edge weights known, summming...
  BB02 -> BB03 has weight 0.000000
  BB01 -> BB03 has weight 43252.000000
BB03: all incoming edge weights known, sum is 43252.000000

Solver: converged in 2 passes

*************** Finishing PHASE Profile incorporation
Trees after Profile incorporation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight     IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                           43252. 43252    [000..003)-> BB03 ( cond )                     IBC 
BB02 [0001]  1                             0        0    [003..00F)                                     IBC 
BB03 [0002]  2                           43252. 43252    [00F..014)-> BB05 ( cond )                     IBC 
BB04 [0003]  1                           14418. 14418    [014..020)                                     IBC 
BB05 [0004]  2                           43252. 43252    [020..021)        (return)                     IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 6, 2021

Running through Pri1 tests -- have 30 odd asserting cases. Once those are good I want to do some more spot checking or a run with more aggressive assertions turned on (eg we assert if we fail to solve) and look at into how many solver passes are needed worst case.

Looks like OSR poses a new set of complications.

  • during instrumentation we might need to induce a pseudo-edge from each patchpoint to method entry (similar to what we do for return and throw)... not 100% sure yet.
  • during reconstruction we alter the flow graph early; we either need to defer this or compensate for it as otherwise the base method and OSR method spanning trees will diverge.

For now I'm thinking we can just drop back to block instrumentation if OSR is enabled and we have patchpoints in the method.

@AndyAyersMS
Copy link
Member Author

Note even with block profiles the OSR profile data is a bit funky:

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight     IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB07 [0006]  1                             0        0    [???..???)-> BB02 (always)                     i internal rare label target IBC 
BB01 [0000]  1                             1        1    [000..007)-> BB03 (always)                     IBC 
BB02 [0001]  2                            9999.  9999    [007..010)                                     bwd bwd-target IBC 
BB03 [0002]  2                           10000. 10000    [010..018)-> BB02 ( cond )                     bwd IBC 
BB04 [0003]  1                             0        0    [018..024)-> BB06 ( cond )                     rare IBC 
BB05 [0004]  1                             0        0    [024..026)        (return)                     rare IBC 
BB06 [0005]  1                             0        0    [026..029)        (return)                     rare IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

The original method never made it out of the BB02, BB03 loop so the return counts are all zero (and hence edge instrumentation would impute the method entry count is zero). The new OSR entry profile count is not set right.

Opened #47942 to sort all this out.

@AndyAyersMS
Copy link
Member Author

Pri1 tests mostly running w/o issue, but hit one odd issue.

We are contaminating the inlinee compiler with details of the root method EH. Not clear why. This makes walking the inlinee flow graph tricky as part of the walk needs to consider EH entries, and so we're wandering off into the root method blocks.

Hopefully all this is unnecessary and will just cause confusion. If we ever decide to support inlining methods with EH we might need to reconsider, but my guess is we would not just concatenate inlinee entries, they'd need to be more carefully merged into the root table.

noway_assert(info.compXcptnsCount == 0);
compHndBBtab = impInlineInfo->InlinerCompiler->compHndBBtab;
compHndBBtabAllocCount =
impInlineInfo->InlinerCompiler->compHndBBtabAllocCount; // we probably only use the table, not add to it.
compHndBBtabCount = impInlineInfo->InlinerCompiler->compHndBBtabCount;
info.compXcptnsCount = impInlineInfo->InlinerCompiler->info.compXcptnsCount;

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Feb 6, 2021
Add a new instrumentation mode that only instruments a subset of the edges in
the control flow graph. This reduces the total number of counters and so has
less compile time and runtime overhead than instrumenting every block.

Add a matching count reconstruction algorithm that recovers the missing edge
counts and all block counts.

See dotnet#46882 for more details on this approach.
AndyAyersMS added a commit that referenced this issue Feb 9, 2021
Add a new instrumentation mode that only instruments a subset of the edges in
the control flow graph. This reduces the total number of counters and so has
less compile time and runtime overhead than instrumenting every block.

Add a matching count reconstruction algorithm that recovers the missing edge
counts and all block counts.

See #46882 for more details on this approach.

Also in runtime pgo support, add the header offset to the copy source.
This fixes #47930.
@AndyAyersMS
Copy link
Member Author

Implemented via #47959.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

No branches or pull requests

1 participant