-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: resolve issues with OSR and PGO #47942
Comments
There's a workaround in place (use block profiling with OSR) that's good enough for now. So, moving to Future. |
The interaction of OSR and PGO is a bit more involved than I'd been thinking. There are a few sub-problems:
It seems plausible that we can fix 2, 3, and 4 by modelling OSR flow in the original method, deferring OSR flow updates until after the profile model has been built (which is happening as part of #59784) and then adding instrumentation to the root part of the OSR method (this is a bit novel, but seems doable, we just use the schema binding we get from reading the PGO data in the OSR method to emit "redundant" probes in the OSR method itself). The downside here is that the OSR method has some extra overhead, but maybe that's acceptable. But fixing 1 seems hard...unless we have the OSR method return back to the original method , once we've exited the part of the OSR method flowgraph that is post-dominated by the patchpoint. Note there could be many such points, and each one might need a custom "fixup" action to restore the IL state in the original method from the live state in the OSR method; plus we might need to do lifetime extensions to ensure that we don't dead code something in the OSR method that's still needed back in the original. |
Rough plan of attack.
So OSR methods won't be competitive with their Tier1 counterparts; this should be OK. But they should be much faster than Tier0 and perhaps that's good enough. |
Working on the abovem ran into a complication. It's not possible to get the OSR method to produce the exact same schema as the Tier0 method. They produce the same flow graph initially (since it's based on method IL). But they do different selective importation, and currently block and class instrumentation is gated on this: runtime/src/coreclr/jit/fgprofile.cpp Lines 346 to 349 in 1d352fc
runtime/src/coreclr/jit/fgprofile.cpp Lines 1427 to 1430 in 1d352fc
We could fix this easily enough for counts by "over allocating" the schema for both Tier0 and OSR (basically pretending like we're going to import the entire method) because count probes don't depend on the IR in the block. The downside here is using more schema/data memory then necessary. But for class probes the probe structure depends on the IR so if we import differently we can't hope to match class probe schema entries. Some alternatives to explore:
|
One other issue:
|
Enable edge based profiles for OSR, partial compilation, and optimized plus instrumented cases. For OSR this requires deferring flow graph modifications until after we have built the initial probe list, so that the initial list reflects the entirety of the method. This set of candidate edge probes is thus the same no matter how the method is compiled. A given compile may schematize a subset of these probes and materialize a subset of what gets schematized; this is tolerated by the PGO mechanism provided that the initial instrumented jitting produces a schema which is a superset of the schema produced by any subsequent instrumented rejitting. This is normally the case. Partial compilation may still need some work to ensure full schematization but it is currently off by default. Will address this subsequently. For optimized compiles we give the EfficientEdgeCountInstrumentor the same kind of probe relocation abilities that we have in the BlockCountInstrumentor. In particular we need to move probes that might appear in return blocks that follow implicit tail call blocks, since those return blocks must remain empty. The details on how we do this are a bit different but the idea is the same: we create duplicate copies of any probe that was going to appear in the return block and instead instrument each pred. If the pred reached the return via a critical edge, we split the edge and put the probe there. This analysis relies on cheap preds, so to ensure we can use them we move all the critial edge splitting so it happens before we need the cheap pred lists. The ability to do block profiling is retained but will no longer be used without special config settings. There were also a few bug fixes in the spanning tree visitor. It must visit a superset of the blocks we end up importing and was missing visits in some cases. This should improve jit time and code quality for instrumented code. Fixes dotnet#47942. Fixes dotnet#66101. Contributes to dotnet#74873.
OSR and PGO don't play nicely together.
See #46882 (comment) and follow on comments.
The text was updated successfully, but these errors were encountered: