-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test Group Failure: System.Runtime.Tests outerloop #56567
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsNoticed these failures when I was investigating some disabled tracing tests in #56507. These failures are unrelated to the tests I turned back on in that PR, so I looked at the history. net6.0-Linux-Debug-x64-CoreCLR_release-Ubuntu.1804.Amd64.Open
and net6.0-Linux-Debug-x64-CoreCLR_release-SLES.15.Amd64.Open
Both appear to be the same failure with little to no other diagnostics information. I see a few other failures in the history in AzDO going as far back as at least June 24th, but I saw failures all the way back into early May. The logs for those builds are gone, so I can't verify that they are the same failures. I stopped going back in the history at May, so I'm not sure how far back this failure goes. Based on the history, it looks like this test is potentially flakey. It routinely passes, but occasionally fails. Seemingly in pairs, e.g., if one test run fails, there is another failure within a run of the other. All records of the test in AzDO have the exact same duration I couldn't find an issue tracking this, but feel free to dup if there is already one.
|
Another hit on these failures: https://github.com/dotnet/runtime/pull/56654/checks?check_run_id=3207664110 |
Need to find out what was eating 1.2GB memory in the tests/product |
Interestingly, 100% of these SIGKILLS of this test library are on Ubuntu 1804 and SLES 15. Could they have less memory or different config? next step: either try to repro locally, or perhaps fix #55702 so that we get a dump. Execute: Web | Desktop | Web (Lens) | Desktop (SAW) https://engsrvprod.kusto.windows.net/engineeringdata WorkItems
| where Started > now(-30d)
| where FriendlyName == "System.Runtime.Tests"
| where ExitCode == 137 //or ExitCode == 0
| join kind= inner (
Jobs | where Started > now(-30d) | project QueueName , JobId, Build, Type, Source,
Branch,
Pipeline = tostring(parse_json(Properties).DefinitionName),
Pipeline_Configuration = tostring(parse_json(Properties).configuration),
OS = QueueName,
Arch = tostring(parse_json(Properties).architecture)
) on JobId
| where Branch !startswith "refs/pull"
| summarize count() by ExitCode, QueueName, Branch, Pipeline, Pipeline_Configuration, OS, Arch
| order by count_ desc
|
Moving to 7 as this isn't a ship blocker, but it's important that our tests don't crash so we should investigate a little later. |
Incidentally dumping FinishedDate column shows this is failing 410 times in the last 30 days across main/Preview branches. That's not good .. it's probably one badly behaving test. Presumably an outerloop test per the table. We haven't added one of those since April:
Not sure we can go back further in history in the test failures. |
I take that back -- it started on April 22 ! WorkItems
| where FriendlyName == "System.Runtime.Tests"
| where ExitCode == 137 //or ExitCode == 0
| join kind= inner (
Jobs | project QueueName , JobId, Build, Type, Source,
Branch,
Pipeline = tostring(parse_json(Properties).DefinitionName),
Pipeline_Configuration = tostring(parse_json(Properties).configuration),
OS = QueueName,
Arch = tostring(parse_json(Properties).architecture)
) on JobId
| where Branch !startswith "refs/pull"
| summarize count() by ExitCode, QueueName, Branch, Pipeline, Pipeline_Configuration, OS, Arch, bin(Finished, 1d)
| order by Finished asc
| take 10
So very likely caused by https://github.com/dotnet/runtime/pull/51548/files We can mark the tests to skip Ubuntu and SLES. They shouldn't be likely to have OS specific bugs, and an OOM Killer termination shouldn't indicate we have a bug. |
Tagging @GrabYourPitchforks for visibility (I was just triaging the label) |
I'll skip them on these OS |
Interesting. We did add an outerloop test as part of that PR (see here), but it follows the same pattern that |
Not that I see -- not an OOM anyway. Could it be that occasionally the GC does not reclaim the 1GB from the first test by the time the second one tries to allocate? |
I wonder if it's a memory fragmentation issue. There's enough memory available, but not always as a contiguous block, so things fall over. And having the two tests run one after another exacerbates the fragmentation. |
@maonis here we have two tests, run immediately one after another, each allocate a 1GB array and then let it go out of scope. This is periodically failing in Linux only on SLES and Ubuntu. On those the oom killer terminates it (with a bit over 1GB committed, per the message). This did not happen when there was one such test, but only when Levi added a second such test that runs directly after. There's no product bug here, just curious whether you can shed light on why that might happen when the machine presumably has significantly more memory. And whether you are aware of varying oom killer behaviors between distros. |
Noticed these failures when I was investigating some disabled tracing tests in #56507. These failures are unrelated to the tests I turned back on in that PR, so I looked at the history.
net6.0-Linux-Debug-x64-CoreCLR_release-Ubuntu.1804.Amd64.Open
and
net6.0-Linux-Debug-x64-CoreCLR_release-SLES.15.Amd64.Open
Both appear to be the same failure with little to no other diagnostics information. I see a few other failures in the history in AzDO going as far back as at least June 24th, but I saw failures all the way back into early May. The logs for those builds are gone, so I can't verify that they are the same failures. I stopped going back in the history at May, so I'm not sure how far back this failure goes.
Based on the history, it looks like this test is potentially flakey. It routinely passes, but occasionally fails. Seemingly in pairs, e.g., if one test run fails, there is another failure within a run of the other. All records of the test in AzDO have the exact same duration
00:01:00.00
regardless of pass or fail. I'm not sure how much I trust these records as a result.I couldn't find an issue tracking this, but feel free to dup if there is already one.
The text was updated successfully, but these errors were encountered: