-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanning tree instrumentation #47959
Spanning tree instrumentation #47959
Conversation
Add a new instrumentation mode that only instruments a subset of the edges in the control flow graph. This reduces the total number of counters and so has less compile time and runtime overhead than instrumenting every block. Add a matching count reconstruction algorithm that recovers the missing edge counts and all block counts. See dotnet#46882 for more details on this approach.
Wrote a simple checker to compare block probe based counts with sparse edge probe based reconstructed block counts. Looks like the algorithm is generally doing quite well in getting to the same normalized block count (absolute counts vary from run to run based on the vagaries of tier1 timing). Eg in the deserialization benchmark, only 4 blocks differ (to two decimal places) and only by 0.01 each time:
Spot checking, the counts also appear to be fairly consistent, so the copying the runtime is doing before handing data back to the jit seems to be fairly effective at preventing counter skew. However, I'm also consistently seeing a convergence failure in |
The flowgraph / spanning tree for I may also look into why we are creating linear flow here; seems like we should be smart enough not to start a new basic block for non-joins... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good
src/coreclr/jit/block.h
Outdated
BasicBlock* bbIDom; // Represent the closest dominator to this block (called the Immediate | ||
// Dominator) used to compute the dominance tree. | ||
void* bbProbeList; // Used early on by fgInstrument | ||
void* bbInfo; // Used early on by fgIncorporate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could you use a more specific name than Info
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
@@ -450,7 +453,10 @@ void Compiler::fgReplaceSwitchJumpTarget(BasicBlock* blockSwitch, BasicBlock* ne | |||
{ | |||
// Remove the old edge [oldTarget from blockSwitch] | |||
// | |||
fgRemoveAllRefPreds(oldTarget, blockSwitch); | |||
if (fgComputePredsDone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the header comment that says "We also must update the predecessor lists for 'oldTarget' and 'newPred'."?
It seems like our basic manipulation routines need to document if they can run without preds, maybe with an appropriate assert if they can't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this one in my first review feedback commit.
src/coreclr/jit/fgprofile.cpp
Outdated
} | ||
case BBJ_EHFILTERRET: | ||
{ | ||
// Ignore filters; they are single block and only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no limitation that a filter be a single basic block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why I thought this (maybe a language restriction?) We can handle filter like other handler regions, then.
// Since our keying scheme is IL based and this | ||
// block has no IL offset, we'd need to invent | ||
// some new keying scheme. For now we just | ||
// ignore this (rare) case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying it's a rare case that a finally has a throw
(for instance) and hence doesn't return? I suppose that's true, though certainly possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We're not going to get an accurate picture of flow in methods that throw as the point at which control resumes is ambiguous. But we likely don't need an accurate picture; if a method throws with any frequency then it's unlikely to be performance sensitive.
// | ||
// Note if the throw is caught locally this will over-state the profile | ||
// count for method entry. But we likely don't care too much about | ||
// profiles for methods that throw lots of exceptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you don't want to (or can't) add an edge to all possible in-function handlers as well as a pseudo-edge to the method entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are multiple out-edges we'll end up wanting to instrument some of them, which we can't do. In the paper, they suggest (in the context of longjmp
) fixing his at runtime by having the runtime increment the appropriate edge counter, once it resolves the stack target. Again not an option for us. So, as noted above, we'll just not worry about these cases as they only arise when methods throw.
{ | ||
continue; | ||
sourceOffset = block->bbNum | IL_OFFSETX_CALLINSTRUCTIONBIT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you need to define your own high bit here, and maybe even accessor/settors, for extreme clarity.
src/coreclr/jit/fgprofile.cpp
Outdated
// | ||
// Solving is done in four steps: | ||
// * Prepare | ||
// * walk the blocks settting up per block info, and a map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: settting
src/coreclr/jit/fgprofile.cpp
Outdated
// add in an unknown count edge. | ||
// * Solve | ||
// * repeatedly walk blocks, looking for blocks where all | ||
// incoming our outgoing edges are known. This determines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our => or?
src/coreclr/jit/fgprofile.cpp
Outdated
private: | ||
Compiler* m_comp; | ||
CompAllocator m_allocator; | ||
uint32_t m_blocks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you using uint32_t / int32_t here? Can't we use the normal C++ types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. There will be a bit left as some of it comes from the instrumentation schema declaration in corjit.h
src/coreclr/jit/block.h
Outdated
union { | ||
BasicBlock* bbIDom; // Represent the closest dominator to this block (called the Immediate | ||
// Dominator) used to compute the dominance tree. | ||
void* bbProbeList; // Used early on by fgInstrument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have something that nulls these out before the phases that uses them? Or initializes them before they are read? I see somewhere that you assert bbInfo is != nullptr, for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instrumentation this happens in the VisitBlock
method during the tree walk. For reconstruction the value is set during Prepare
before it is read.
Realized that the ideal solver order is likely reverse post order over the depth first spanning tree. We could either materialize that order somehow or perhaps even integrate the solve and traversal passes together. For now, a simpler fix is to just solve for the blocks from last to first, instead of first to last, as most edges are lexically forward. With that change the example above (
|
One last note on that example. In my test run most of the blocks had zero count, because that method has an early return if all the strings it sees are ascii, and that's what happened in the test. So the solver is doing a fair amount of work to set most blocks to zero:
There is an optimization I contemplated that leaves zero weight edges out of the solve graph that might also reduce the amount of work needed to reach a solution. I left it out because if we subsequently want to deduce edge likelihoods, or perhaps also run synthesis, it seemed prudent to have edge objects for each successor edge. |
@BruceForstall think I covered most of your feedback. |
/azp run runtime-jit-experimental |
Azure Pipelines successfully started running 1 pipeline(s). |
Mono build failure:
|
Thanks. Presumably when you take this out of "Draft" mode you'll ask for "final" reviews. |
No longer a draft PR. |
OSX build also hitting nuget issues:
Am planning on ignoring this and the mono failure. Also expect jit-experimentail to fail overall with OSR/EhWriteThru issues. None of the PGO legs should fail. |
Windows jit-experimental had only the "expected failures" -- waiting on Linux. Also, the failures noted in #47930 are fixed. |
Ditto for Linux. |
@dotnet/jit-contrib think this is ready for some review.
See #46882 for notes on the overall approach.
This turned out to be a bit more involved than I'd thought, between work to uncover and cope with or remove complications I hit while running analyses this early in the jit's phase pipeline, and the need to navigate between several different representations for data.
A rough guide to reviewing the code:
fgSplitEdge
to be called before we have pred lists, defer computation of inlinee scale until profile incorporationFor sparse edge profiling both instrumentation and reconstruction rely on being able to create a spanning tree, so this part is factored out into a walker/visitor pattern. This tree is evolved by a DFS that is mostly normal but has a few odd aspects:
The reconstruction algorithm uses a fair amount of auxiliary data and maps. There may be less costly ways to encode some of this; suggestions welcome. It also repeatedly iterates over blocks when ideally it would be using some kind of priority queue or other more focused organization of the remaining work.
There are a number of tricky points:
BlockSet
in inlinees is tricky as basic blocks always live in the root compiler's "space"NumSucc
we are always using the root instance (mainly so any switch descriptor stuff happens up at the root) -- one could imagine hiding this detail in the implementation of this machinery insteadfgCalledCount
.