-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcstress=3 timeout failures #68529
Comments
Tagging subscribers to this area: @hoyosjs Issue DetailsRecent run of e.g., the
@trylek @jkoritzinsky did the recent test "batching" change effectively change the per-job timeouts? Does gcstress need to be handled differently?
|
I've just been discussing GCStress timeouts with JanV who hit this locally and I think I have a fix or at least a mitigation in Please let me know if you're aware of yet another place where the timeouts are getting hard-coded, there are never enough duplicates. |
A separate question, besides upping the timeouts, would be whether the gcstress jobs should split into more Helix jobs to do the work with greater parallelism, or at least not miss so much work beyond a timeout if a timeout is actually hit. Also, we never see Helix work progress until the job is finished, so it's advantageous to not have a Helix job run too long. |
@BruceForstall - thanks for pointing that out. I think there are two different aspects to all of this. I looked at multiple jobs in the run and all timeouts in the it might be in fact a test issue. In all failed tests I looked at the failure happened either in the For the actual running time of the individual items, that is something these failures tell us nothing about. In JanV's local testing the Methodical merged wrappers timed out after the pre-existing 20-minute timeout in GCStress mode and the TypeGeneratorTests just made it in about 10-15 minutes. In Pri1 mode each of the Methodical wrappers comprises about 200-300 tests i.o.w. if they theoretically took about 45 minutes to run, further splitting would reduce the total job running time at the expense of spinning more Helix VM's and more overhead w.r.t. repeated managed runtime initialization that is much slower in the GC stress mode. I'm trying to mine Kusto for data to let me better estimate the trade-off. |
Does this mean that maybe the way some tests are written makes them incompatible with a run enabling both GCStress and merged tests? If so, can we identify those incompatible patterns and disallow merging for those tests? |
Yes, that more or less matches my understanding. In general the easiest way to single out these tests is by marking them as |
@trylek A weekend run shows we're still hitting 4 hour timeouts with Linux-arm64 GCStress=3 in a couple Methodical_* tests: |
Hmm, sadly this is not a previously overlooked case - there are no explicit GC management calls in the test; I'm afraid I don't see any way forward without some form of diagnosability, either by enabling dumps (e.g. just for a few instrumented runs if it's problematic for all tests) or by reproing locally or on a lab VM. |
|
Recent run of
runtime-coreclr gcstress0x3-gcstress0xc
pipeline shows timeout failures after one hour in many different gcstress=3 tests:https://dev.azure.com/dnceng/public/_build/results?buildId=1735353&view=ms.vss-test-web.build-test-results-tab&runId=46978262&resultId=111084&paneView=debug
e.g., the
Methodical_*
tests in:coreclr Linux x64 Checked gcstress0x3 @ Ubuntu.1804.Amd64.Open
coreclr Linux arm64 Checked gcstress0x3 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
coreclr Linux arm Checked gcstress0x3 @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
coreclr OSX arm64 Checked gcstress0x3 @ OSX.1200.ARM64.Open
@trylek @jkoritzinsky did the recent test "batching" change effectively change the per-job timeouts? Does gcstress need to be handled differently?
The text was updated successfully, but these errors were encountered: