-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Threading.Tasks.Tests timed out on net5.0-Linux-Debug-arm64-Mono_release #42024
Comments
According to Kusto, there are only 13 records of this method taking longer than 1 second. Is there a good way to query how many hangs? |
taking a look. |
Given it's only timed out 5x in mono and 3x in not-mono for the history of everything in the DB (less than .1%) I'd go with "rare testcase issue that hasn't shown up in other configurations yet."
|
Sounds like then this isn't the case of getting unlucky at the tail end of a distribution of test durations. It sounds like if the tests/product work correctly, even on a heavily loaded machine it will take no more than a second or two. In this case, it did not finish in several minutes. That suggests to me that either the tests, the library or the runtime has a flaw that occasionally causes a genuine hang. And the key thing we need is a dump file. So +1 for migrating to vstest, and I guess we wait meantime. |
Moving to area-Infrastructure, since there is no System.Threading action at the moment. |
Tagging subscribers to this area: @ViktorHofer |
This bug is tracking a test failure in Threading not an infrastructure issue. Infrastructure can help debug but this isn’t tracking any infra work. |
Tagging subscribers to this area: @tarekgh |
Hit again in #42014 |
cc @SamMonoRT @lambdageek since this is mono. |
I ran a Kusto query for both mono and coreclr in the last week (substitue =~ "coreclr" in the RuntimeFlavor for coreclr)
Mono:
CoreCLR:
What does BadExit mean? And assuming the query is correct, there doesn't seem to be a significant amount of timeouts. |
non-zero exit codes. Seg faults, script failures, that sort of thing. |
@tarekgh @stephentoub can you please take a look? Note that the timeout might be mono specific. |
Added a live table of failures to the issue. |
I think Mono's team can take a look as this is happening on Mono's runtime only. @marek-safar could you please get someone look at this one? |
@steveisok - any bandwidth to take an initial look ? |
Is it Mono issue only? @steveisok query #42024 (comment) indicates that CoreCLR is crashing even more often |
@marek-safar Mono is the one timing out, coreclr is not. |
Agree with @tarekgh about the CoreCLR version of the failure. Clicking through ~5 failures they aren't timing out. Also all the failures are on PRs so it's not clear if that is a real failure or failure because the change being tested was bad. CoreCLR specific failures excluding PRs. Note that this data set is empty over last seven days. |
Feels like there's a bit to unravel here as the original issue was specific to net5.0-Linux-Debug-arm64-Mono_release. If you look at that queue alone, there does not seem to be that big of an issue. If I expand the Kusto query out to all queues, we have the following numbers (all runs in the last week): Mono:
CoreCLR:
A cursory glance into some of the builds that @jaredpar linked into the issue seem to indicate infrastructure induced timeouts/cancellations like: https://dev.azure.com/dnceng/public/_build/results?buildId=836799&view=results https://dev.azure.com/dnceng/public/_build/results?buildId=836302&view=results And a weird error of More analysis is definitely necessary before drawing further conclusions. Should we change the title of the issue to be something more expansive? Or close it and link to a new one? Thoughts? |
https://github.com/dotnet/core-eng/issues/11021 This seems to mostly be the fact that all the agents in buildpool scale sets are in the same network topology behind a NAT, so our throttling limits may be too low currently. |
That API rate limit error happens a lot 500+ occurences in the last week on the runtime build definition |
I've already merged a quota increase that would help here, and am discussing whether we can hotfix this (it's just changing #s in a JSON file, so maybe?) today. The tricky part about your runfo stuff is that many of those runs didn't change their fail-y-ness (e.g. https://dev.azure.com/dnceng/public/_build/results?buildId=836918&view=logs&j=d5c01a48-52b8-51d9-fe3a-6804ba4b63f8&t=215202f9-e149-511e-645c-558c2532aa74 ) because of throttling, so it's hard to say how many runs are actually broken by it, but I share your concerns and am trying to expedite it if possible. |
Not sure what you mean here. These are all timeline issue errors so that will default to failing the build. |
I mean runfo caught this but it failed exactly the same as it was going to fail without any 429s: https://dev.azure.com/dnceng/public/_build/results?buildId=836918&view=logs&j=d5c01a48-52b8-51d9-fe3a-6804ba4b63f8&t=215202f9-e149-511e-645c-558c2532aa74 |
So essentially you want to know when the only source of errors in a build were these 429s? |
If you can figure out how to do that that'd be rad. I'm not sure how easy or worth it it is. A hotfix to more than double the quota per minute is rolling out presently and should be live in the next 30 m. |
A couple of days ago, I ran this testsuite locally about 300 times in a loop and couldn't get it to hang on OSX. (runtime and libs built with |
Looks like this test failed frequently for a few weeks and then hasn't failed in over six months. I'm going to close this for now. |
It's failed 21 times in the last three days On non-pure PR it failed 4 times At least this one failed with the same error message as the original bug |
Ok, my runfo search skills are apparently quite lacking. Not sure what I searched for that yielded an empty set. |
Actually, no, I was right the first time. None of those failures are this test. A couple of them are
so I don't know why it's showing up in test results from the last few days. And that one is tracked by a different issue, #2271. |
My bad. I was looking at the title of the issue which had the test group. |
@stephentoub the only instance of the |
Ah, that explains that then. |
net5.0-Linux-Debug-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-16.04-helix-arm64v8-bfcd90a-20200127194925
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-42014-merge-47dea998f8584728a3/System.Threading.Tasks.Tests/console.181c257c.log?sv=2019-02-02&se=2020-09-29T13%3A09%3A28Z&sr=c&sp=rl&sig=lQ0DpS2rIbgdmMDvCYSnjAiZcVgh3Mgaf9fv5VNVCog%3D
https://dev.azure.com/dnceng/public/_build/results?buildId=807090&view=ms.vss-test-web.build-test-results-tab&runId=25594170&resultId=178179&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab
Can't say whether it's a Mono issue, or a rare testcase issue that hasn't shown up in other configurations yet.
Note there is no dump, but @ViktorHofer is working to migrate to vstest, at which point we will have dumps for hangs and timeouts like this.
Runfo Tracking Issue: System.Threading.Tasks.Tests timeout Mono
Displaying 100 of 129 results
Build Result Summary
The text was updated successfully, but these errors were encountered: