Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable osr stress tests for libraries and add fixes #64116

Merged
merged 8 commits into from
Jan 28, 2022

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Jan 21, 2022

Enables OSR stress for libraries.

This uncovered a few issues that are fixed with subsequent commits:

  • generics context reporting for OSR methods was broken, when the Tier0 method did not need to report the context
  • tail recursion detection was broken when the call in tail position was devirtualized.
  • with OSR enabled, logic for when we decide to switch a method from Tier0 to optimized was switching too many methods

@ghost
Copy link

ghost commented Jan 21, 2022

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: AndyAyersMS
Assignees: -
Labels:

area-Infrastructure-libraries

Milestone: -

@AndyAyersMS
Copy link
Member Author

Seeing issues with libraries + osr for arm64 over in #63720, want to get a broader view.

@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me, but to be more consistent with the coreclr tests, maybe there should be a "jit-experimental" testGroup (and pipeline) for libraries.

Or maybe at this point the OSR stuff should "graduate" to its own "osr" testGroup in both places?

@AndyAyersMS
Copy link
Member Author

Or maybe at this point the OSR stuff should "graduate" to its own "osr" testGroup in both places?

Probably a good idea, but the combinatorics of all this start getting messy....

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jan 25, 2022

Generics context is still not right in case where the OSR method needs to report this but the Tier0 method didn't need to. While we think we're saving this in the OSR prolog, genReportGenericContextArg assumes the arg registers have their proper values, and that's not the case for an OSR method.

Some options:

  • fetch the Tier0 frame this and save it in the OSR context slot. This would work for x64 but perhaps not for arm64. We'd need to do a mem-mem move and on arm64 we only have one scratch reg and may need it to form the offset to the Tier0 slot. Now if we happen to have this enregistered on entry we could reorder the OSR arg fetches and context slot init and so get around the register limitation, but that's not guaranteed....
  • report the Tier0 this frame slot as the context slot.
  • force the Tier0 method to always keep alive and report the context, report the context slot as live.

The last one seems the simplest so will do that....

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Linux x64 debug failure:

  Starting:    System.Runtime.Experimental.Tests (parallel test collections = on, max threads = 2)
Process terminated. Assertion failed.
   at System.Number.FormatHalf(ValueStringBuilder& sb, Half value, ReadOnlySpan`1 format, NumberFormatInfo info) in /_/src/libraries/System.Private.CoreLib/src/System/Number.Formatting.cs:line 723

Similar to #61359 -- probably not related to this PR.

@AndyAyersMS
Copy link
Member Author

Still seeing some libraries test failures, looks like bad codegen. Let's see if we can catch it in coreclr tests.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

* Dump IR after patchpoints phase, if any were added
* Fix bug in recursive tail call detection -- if
we devirtualize we need to update the method handle we use.

This last change is the key bug fix -- OSR methods rely on
getting this detection right so they know if they need to import
the method entry block.
@AndyAyersMS
Copy link
Member Author

Last commit fixed a subtle bug.

When we devirtualize a tail call we might fail to recognize that the tail call is now a recursive tail call, because we are checking the wrong method handle. For normal methods this just means we may not mark the block range as a loop (which is bad but rarely leads to bugs). But for OSR methods this is fatal because they need to use this to trigger importation of the method entry.

One method where this happens is System.Xml.Serialization.StructMapping:System.Xml.Serialization.INameScope.get_Item(System.String,System.String):System.Object:this from the xml serialization libraries tests (under random OSR stress, we put a patchpoint at offset 0x1b).

@AndyAyersMS
Copy link
Member Author

Still think the enabling logic for OSR is off. Currently with QJFL=1, OSR=1 we will bail to opt for too many cases.

The logic with QJFL=1, OSR=1 should be

canGetTrappedAtTier0 = (has loop) || (has tail. prefix)
canBeBailedOutByOSR = !(hasLocalloc || isRPInvoke || has loop in handler)

optimize = canGetTrappedAtTier0 && !canBeBailedByOSR

but currently it's

!canBeBailedOutByOSR

So will fix this along with the other changes...

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

x86 libraries test failing

  Starting:    System.Diagnostics.PerformanceCounter.Tests (parallel test collections = on, max threads = 2)
    System.Diagnostics.Tests.PerformanceCounterTests.PerformanceCounter_CreateCounter_MultiInstanceReadOnly [FAIL]
      System.AggregateException : One or more errors occurred. (Category 'PerformanceCounter_CreateCounter_MultiInstanceReadOnly_Category' does not exist.) (Category 'PerformanceCounter_CreateCounter_MultiInstanceReadOnly_Category' does not exist.) (Category 'PerformanceCounter_CreateCounter_MultiInstanceReadOnly_Category' does not exist.)

Almost certainly unrelated as this change should be no-diff for x86.

@AndyAyersMS AndyAyersMS changed the title add osr plus stress tests for libraries enable osr stress tests for libraries and add fixes Jan 26, 2022
@AndyAyersMS
Copy link
Member Author

/azp run runtime-jit-experimental

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

@EgorBo can you make sure to look over the changes in this PR for the OSR enabling/fallback logic? Also perhaps remeasure some of the TE benchmarks (with QFJL=1 OSR=1, since this PR doesn't enable those by default)...?

I think we were needlessly optimizing too many methods when OSR was turned on....

@AndyAyersMS
Copy link
Member Author

Libraries stress test still finding issues.

@EgorBo
Copy link
Member

EgorBo commented Jan 26, 2022

@EgorBo can you make sure to look over the changes in this PR for the OSR enabling/fallback logic? Also perhaps remeasure some of the TE benchmarks (with QFJL=1 OSR=1, since this PR doesn't enable those by default)...?

I think we were needlessly optimizing too many methods when OSR was turned on....

Sure!
Will schedule a run for this PR overnight

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jan 27, 2022

Summary of the latest batch of issues:

  • x64 has a problem if we suspend a thread in the middle of an OSR epilog. Unwind gets confused because it is examining the actual epilog code and sees the second SP adjust and decides we must not be in an epilog. So it uses the full set of prolog unwind codes (in reverse) to restore context. But we are in the middle of the epilog so this gives corrupted context.
Assert failure(PID 8132 [0x00001fc4], Thread: 8112 [0x1fb0]): ((FARPROC) (TADDR)m_pvHJRetAddr) != NULL
Assert failure(PID 28 [0x0000001c], Thread: 48 [0x0030]): Assertion failed 'isValidGeneralDatasize(size)' 
  • arm64 has the known issue with a too-large prolog (GitHub_17777)
  • arm64 hits assert that I can't repro yet
Assert failure(PID 27 [0x0000001b], Thread: 33 [0x0021]): Assertion failed 'fgReachable(begBlk, endBlk)' in 'MS.Internal.Xml.XPath.FilterQuery:MatchNode(System.Xml.XPath.XPathNavigator):System.Xml.XPath.XPathNavigator:this' during 'Update flow graph opt pass' (IL size 444)

    File: /__w/1/s/src/coreclr/jit/optimizer.cpp Line: 167
    Image: /root/helix/work/correlation/dotnet
  • arm64 runtime failure (windows). Haven' tried this on windows but on Linux this does not hit any patchpoints or create any OSR methods, so seemingly unrelated?
JIT\Regression\JitBlue\GitHub_21990

@AndyAyersMS
Copy link
Member Author

Needed to merge up to resolve a conflict and pick up a fix.

I think we've probably gotten what we can out of this, the x64 unwind issue is the big remaining open. So after merging up I'll just let the default CI run...

@dotnet/jit-contrib PTAL

@@ -1986,13 +1986,18 @@ inline bool Compiler::lvaKeepAliveAndReportThis()
// the VM requires us to keep the generics context alive or it is used in a look-up.
// We keep it alive in the lookup scenario, even when the VM didn't ask us to,
// because collectible types need the generics context when gc-ing.
//
// Methoods that can inspire OSR methods must always report context as live
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Methoods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will fix in subsequent PR.

@AndyAyersMS AndyAyersMS merged commit 479ee51 into dotnet:main Jan 28, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Feb 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants