-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: use reverse post-order (RPO) traversal for morph #93246
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsMorph, like many JIT phases, visits all the blocks in a method by following the For "forward" phases like morph it is often preferable to visit the blocks in reverse post-order (RPO). An RPO ensures that for most blocks all the predecessors of the block have been visited before the block itself. Currently value numbering implements an RPO visit. It's also possible to create an RPO using The initial goal of this work is to modify morph to rely on RPO, and then to enable a simple form of global assertion prop for Morph (aka "cross-block local assertion prop") that can push facts forward across block boundaries. #86822 has a prototype implementation. The main remaining challenges there are to make the RPO efficient and to properly handle cases where control flow is altered.
There will be extra TP cost from the RPO and from the assertion changes; the hope is that these are paid back by improved optimizations and that this entire change can be zero-cost. Note for "backwards" phases post-order provides similar benefits (all successors likely visited before a block is visited). cc @dotnet/jit-contrib
|
Instead of merging returns to the common return block in morph, do all the merging in `fgAddInternal` (where we already did some merging). This removes a case where morph would add a control flow edge in a way that might disrupt an ongoing RPO. Earlier merging also opens up the possibility of tail merging some of the copies into the canonical return local, and possibly even some of the computations that feed the copies. Modify the flow alterations done by morph. Previously if a tail call was expressed via a call to a `CORINFO_TAILCALL_HELPER`, morph would change the block kind to `BBJ_RETURN` and then merge the return, changing the block kind to `BBJ_ALWAYS`. Since merging now happens before moprh, morph needs to leave the block kind alone. Generalize the post-tail-call sanity check in morph to recognize one new case that can come up. Contributes to dotnet#93246.
Adapt `fgValidateIRForTailCall` to use as a utility to verify that the JIT has not added IR that would make a tail call invalid. Currently all our cases pass this check so no tail calls are invalidated. Remove `fgCheckStmtAfterTailCall` as it did less thorough and less correct checking. Contributes to dotnet#93246.
Remove `fgCheckStmtAfterTailCall` as it did less thorough and less correct checking. Contributes to #93246.
Contributes to dotnet#93246.
For the "dynamic RPO" -- implementing this for morph seems problematic. Value Numbering (VN) has one, but it relies on the loop table, which (roughly speaking) encodes the results of a prior DFS traversal. VN keeps track of two sets of blocks, those not yet visited with all preds visited, and those not yet visited with some preds visited. As a visit finishes it walks the pred lists of all successors, looking to see if any of those are unvisited and have all preds visited. If so, the block is added to the first set; if not, it is added to the second. Blocks are preferentially processed from the "all preds visited" set, until this is empty. At that point, if the "some preds visited" set is not empty, VN needs to choose a block from this set, and to do so it relies on loop structure to find an outermost loop block, to maximize the odds of subsequent visits being all preds visited cases. Without this hint there's no guarantee that the visitation order for VN resembles an RPO. (Aside: it also seems like VN's approach could be streamlined somewhat by keeping track of the number of unvisited preds; when a visit finishes and we're enumerating the block successors we can just decrement their waiting numbers, moving them to the zero list where appropriate, rather than scanning all their preds. But I don't recall the value number state manipulation being especially costly, so perhaps not worth the trouble?) For morph there's no loop structure around to consult; we'd need a DFS or equivalent to establish one, at which point we may as well just use the DFS itself to establish an RPO. |
In the "dynamic" case in the absence of a loop table, you could rely on the evolving spanning tree to figure out which of the "some preds visited" set you should visit first; preferentially choose based on (a) fewer unvisited preds; then (b) closest to root visited pred (or maybe just closest to root for the pred that was visited first). Say in a doubly nested loop both loop entries have one unvisited pred: the outer loop entry would have had preds visited before any inner loop pred, so it should go first (the outer loop entry should be a spanning tree ancestor of the inner loop entry). This would require a bit of extra state or you could perhaps reconstruct the spanning tree from the existing data (or leverage the slots in the blocks with pre/post order numbers and do it that way, and/or manipulate state on edges or give edges an ID and track which ones are in the spanning tree via a bit vector or something). |
Thinking about it some more, it seems difficult to get the dynamic case to work properly without some kind of prior analysis of the graph structure. EG for Once A has been processed, a dynamic RPO needs to decide if B or C should come next. The right answer is B, but both B and C are at the same depth in the spanning tree, and have the same spanning tree parent, so there's no way to use the spanning tree to decide which node should come next. Here B has an obvious self-loop, and C has no successors, but straightforward elaborations of the graph above preserve the problem of choosing B or C without having those features. It seems plausible that replacing the "dynamic PGO" done in value numbering with an actual RPO might both improve throughput and give better results, since not all loops are in the loop table (eg cold clones). Worth trying someday. |
When optimizing, process blocks in RPO. Disallow creation of new blocks and new flow edges (the latter with certain preapproved exceptions). Morph does not yet take advantage of the RPO to enable more optimization. Contributes to dotnet#93246.
When optimizing, process blocks in RPO. Disallow creation of new blocks and new flow edges (the latter with certain preapproved exceptions). Morph does not yet take advantage of the RPO to enable more optimization. Contributes to #93246.
Challenges in modifying local assertion prop:
The plan for now:
|
…n tracking Track the set of active local assertions via a bit vector, rather than assuming all entries in the table are live. Doing so required a number of changes in assertion prop to ensure the vector is consulted before deciding an assertion is valid. This will (eventually) allow us to propagate assertions cross-block. For now we reset the bit vector and assertion table back to empty at the start of each block so nothing propagates past the end of a block. The table can fill and cause the JIT to miss assertions in very large blocks as morph will no longer remove assertions while processing a block. Previously this would happen if there were more than 64 live assertions in a block, and now it can happen if there are more than 64 assertions in block (so somewhat more frequently). Contributes to dotnet#93246.
…n tracking (#94322) Track the set of active local assertions via a bit vector, rather than assuming all entries in the table are live. Doing so required a number of changes in assertion prop to ensure the vector is consulted before deciding an assertion is valid. This will (eventually) allow us to propagate assertions cross-block. For now we reset the bit vector and assertion table back to empty at the start of each block so nothing propagates past the end of a block. The table can fill and cause the JIT to miss assertions in very large blocks as morph will no longer remove assertions while processing a block. Previously this would happen if there were more than 64 live assertions in a block, and now it can happen if there are more than 64 assertions in block (so somewhat more frequently). Contributes to #93246.
During global morph, allow assertions to propagate to a block from the block's predecessors. Handle special cases where we can't allow this to happen: * block has preds that have not yet been morphed * block has no preds * block is specially flagged as one that might gain new preds during morph * block is an EH handler entry Contributes to dotnet#93246.
Thoughts about how to get this to the point where it can be merged(copied from #94363 (comment)) ThroughputI think I can claw some of the TP back by (1) optimizing search loops in assertion prop. We should generally never search the entire table (save when adding an assertion) and instead walk the live assertion set, and for local prop we should intersect that set with the dep vector. The bit vector operations aren't free so it might make sense to do this only when there are sufficient numbers of assertions, though that makes the code uglier. Perhaps it can all be hidden behind a suitably smart enumerator. (2) stop morphing blocks that are unreachable or become unreachable because of local assertion prop. Aside from removing all the morph related overhead, I suspect sometimes this dead block morphing creates issues for live blocks, eg either useless assertions that cause the table to fill faster, or else global actions (say DNER) that are hard to undo. Currently the RPO strategy won't allow for removal of dead EH and may or may not make it easy to remove unreachable cyclic flow. Both of these can be spun off as preliminary checkins, though the full benefits may not be seen without this change, as main has small bit vectors and isn't able to propagate constants nearly as aggressively. At the end of the day though I expect there will still be a TP impact. Code Size / Code QualityWith the massive numbers of diffs any sort of manual assessment is going to be a random sampling at best. I will try and look at some of the bigger regressions, possibly suppressing cloning to help remove that as a factor. The more aggressive copy prop that this enables puts pressure on LSRA, as copies to temps provide natural split points. I am not sure we can find effective heuristics to counterbalance this. I think it makes sense to check all this in initially as off-by-default, and enable runs in the perf lab, and use that data to both identify CQ regressions and to decide if the CQ improvements justify the additional TP costs. |
One other note here for posterity -- I spent some time trying to see if there was a way to remove assertions to try and keep the index set more compact. For example, if an assertion born in some block (created for the first time) is no longer live at the end of the block, it can be erased with somewhat minimal effort: scrub it from any dep vector, remove it from the table, and adjust the bits in the I gathered some data on this and it did not look promising. Most times assertions don't get killed within blocks; they die at merge points, and most of the time when they do get killed they're not numbered higher than any new live one. So I did not pursue this. There is also one test case in It's still possible to fill the table with junk assertions; I will need to look at some data on how often we lose assertions with the new table resizing. But some of this is inevitable with a simplistic forward push of facts. One can imagine trying to prioritize locals (we have RCS_EARLY counts we could use say), but I don't know yet if something like that is worth the trouble. |
We don't need quite this broad of a start node set, as the DFS will find unreachable blocks on its own. Also lay the groundwork for removing unreachable blocks, by tracking the postorder number of the last reachable block. Contributes to dotnet#93246.
During global morph, allow assertions to propagate to a block from the block's predecessors. Handle special cases where we can't allow this to happen: * block has preds that have not yet been morphed * block has no preds * block is specially flagged as one that might gain new preds during morph * block is an EH handler entry Contributes to #93246. When enabled, size the assertion table based on the number of locals, up to the tracked limit. Disabled by default; use 'DOTNET_JitEnableCrossBlockLocalAssertionProp=1` to enable.
Remove the up-front computation of enter blocks and dom start nodes from the DFS computations used for RPO. Also lay the groundwork for removing unreachable blocks, by tracking the postorder number of the last reachable block. Contributes to #93246.
Leverage the "dep vectors" to avoid the search the assertion table during local assertion prop. Helps the current (small table) behavior some, helps the future cross-block (larger table) behavior more. Similar tricks may be possible for global AP, though the set of assertions there is more varied. Avoid merging assertions from statically unreachable preds. For OSR methods ensure the original method entry is considered reachable (as it may be). Contributes to #93246.
Now that we can propagate assertions across block boundaries we can generate assertions for true/false branch conditions and propagate them along the corresponding edges. Contributes to dotnet#93246.
Fix condition under which we can share a pred's assertion out vector. Add the ability to disable cross-block local assertion prop via range. Contributes to dotnet#93246.
Fix condition under which we can share a pred's assertion out vector. Add the ability to disable cross-block local assertion prop via range. Contributes to #93246.
Float relop equality does not imply bitwise equality. So skip making local jtrue assertions about floating types. Contributes to dotnet#93246.
Float relop equality does not imply bitwise equality. So skip making local jtrue assertions about floating types. Contributes to #93246.
Completed work items for .NET 9. |
Morph, like many JIT phases, visits all the blocks in a method by following the
bbNext
chain. There's a missed opportunity here to cheaply propagate some information from block to block.For "forward" phases like morph it is often preferable to visit the blocks in reverse post-order (RPO). An RPO ensures that for most blocks all the predecessors of the block have been visited before the block itself.
Currently value numbering implements an RPO visit. It's also possible to create an RPO using
fgDfsReversePostorder
.The initial goal of this work is to modify morph to rely on RPO, and then to enable a simple form of global assertion prop for Morph (aka "cross-block local assertion prop") that can push facts forward across block boundaries. #86822 has a prototype implementation. The main remaining challenges there are to make the RPO efficient and to properly handle cases where control flow is altered.
fgAddRefPred
. Removing preds should be OK as it won't invalidate the RPO, but adding preds is problematic. Trying this out, one violation (given the above PRs to stop adding new blocks) is to add edges togenReturnBB
. Along with those edges, IR is added to copy return values and that IR is morphed "out of sequence" -- meaning it would also be problematic for assertion prop. Seems like all that flow merging and copying could happen earlier, say infgAddInternal
; in fact there may be some benefit as perhaps some struct copies can also be merged.BBJ_RETURN
(plusBBF_HAS_JMP
) orBBJ_THROW
; a failed tail call will end up asBBJ_ALWAYS
, and a tail call helper dispatched call likewise. So we can't merge early and we can't rely on not changing flow later.genReturnBB
) and then during morph, if there are tail calls, treat thegenReturnBB
as a flagged block (similar to what we propose for the tail to loop below).BBJ_RETURN
blocks -- this would just fall out if we could merge all returns earlier, but since we can't, we need to do extra work here.Work in progress: JIT: move return merging earlier #93997no longer needed?bbAssertionIn
orbbAssertionGen
we could hijack one of those slots to hold the data. Then when looking at pred info block would have to know the pred has two sets and pick the right one. Seems quite doable.QMARK
expansion before morph, remove bespoke handling in morph forQMARK
assertions (see JIT: expand qmarks earlier #86778)There will be extra TP cost from the RPO and from the assertion changes; the hope is that these are paid back by improved optimizations and that this entire change can be zero-cost.
Note for "backwards" phases post-order provides similar benefits (all successors likely visited before a block is visited).
cc @dotnet/jit-contrib
The text was updated successfully, but these errors were encountered: