Test Group Failure: System.Runtime.Tests outerloop #56567

josalem · 2021-07-29T18:30:08Z

Noticed these failures when I was investigating some disabled tracing tests in #56507. These failures are unrelated to the tests I turned back on in that PR, so I looked at the history.

net6.0-Linux-Debug-x64-CoreCLR_release-Ubuntu.1804.Amd64.Open

/datadisks/disk1/work/B3F20994/w/C4E20A47/e /datadisks/disk1/work/B3F20994/w/C4E20A47/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6255 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 11202 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing $RSP_FILE
/datadisks/disk1/work/B3F20994/w/C4E20A47/e
----- end Thu Jul 29 01:24:36 UTC 2021 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill
ulimit -c value: unlimited
[ 2439.914551] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2439.914551] 251 total pagecache pages
[ 2439.914552] 0 pages in swap cache
[ 2439.914553] Swap cache stats: add 0, delete 0, find 0/0
[ 2439.914553] Free swap  = 0kB
[ 2439.914553] Total swap = 0kB
[ 2439.914554] 2097038 pages RAM
[ 2439.914554] 0 pages HighMem/MovableOnly
[ 2439.914555] 58679 pages reserved
[ 2439.914555] 0 pages cma reserved
[ 2439.914555] 0 pages hwpoisoned
[ 2439.914556] Tasks state (memory values in pages):
[ 2439.914556] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 2439.914560] [    447]     0   447    43216      215   331776        0             0 systemd-journal
[ 2439.914562] [    470]     0   470    24428       43    94208        0             0 lvmetad
[ 2439.914563] [    476]     0   476    11204      566   131072        0         -1000 systemd-udevd
[ 2439.914564] [    523]     0   523     3005      229    69632        0             0 hv_kvp_daemon
[ 2439.914565] [    896] 62583   896    35489      133   184320        0             0 systemd-timesyn
[ 2439.914566] [   1024]   100  1024    20021      151   176128        0             0 systemd-network
[ 2439.914567] [   1062]   101  1062    17697      173   176128        0             0 systemd-resolve
[ 2439.914569] [   1319]     0  1319    20058     3259   204800        0             0 python3
[ 2439.914570] [   1332]     0  1332    15545      168   155648        0             0 systemd-logind
[ 2439.914571] [   1333]     0  1333    42739     1957   229376        0             0 networkd-dispat
[ 2439.914572] [   1336]     0  1336    40270       32    86016        0             0 lxcfs
[ 2439.914573] [   1338]   103  1338    12514      160   143360        0          -900 dbus-daemon
[ 2439.914574] [   1366]     0  1366    72000      214   188416        0             0 accounts-daemon
[ 2439.914575] [   1372]     0  1372    27605       56   114688        0             0 irqbalance
[ 2439.914576] [   1381]     0  1381     7084       51    94208        0             0 atd
[ 2439.914577] [   1382]   102  1382    66817      364   163840        0             0 rsyslogd
[ 2439.914578] [   1391]     0  1391     7938       73    98304        0             0 cron
[ 2439.914579] [   1393]     0  1393   226267     6655   286720        0          -999 containerd
[ 2439.914580] [   1397]     0  1397     4104       38    73728        0             0 agetty
[ 2439.914581] [   1408]     0  1408     3723       32    69632        0             0 agetty
[ 2439.914582] [   1436]     0  1436    72221      197   200704        0             0 polkitd
[ 2439.914583] [   1622]     0  1622     1128       17    53248        0             0 none
[ 2439.914584] [   1785]     0  1785    18076      181   176128        0         -1000 sshd
[ 2439.914585] [   1806]     0  1806    96545     4082   266240        0             0 python3
[ 2439.914586] [   2473]  1000  2473     2899       66    65536        0             0 helix.sh
[ 2439.914588] [   2928]     0  2928   247469    11662   483328        0          -500 dockerd
[ 2439.914589] [   3295]  1000  3295    44341     6852   241664        0             0 python3
[ 2439.914590] [   3299]   106  3299     7150       46    94208        0             0 uuidd
[ 2439.914591] [   3313]  1000  3313    63593     7085   270336        0             0 python3
[ 2439.914592] [   3314]  1000  3314   124773    11968   348160        0             0 python3
[ 2439.914593] [  11190]  1000 11190     1158       16    57344        0             0 sh
[ 2439.914594] [  11192]  1000 11192     1158       17    57344        0             0 execute.sh
[ 2439.914595] [  11194]  1000 11194     2932       83    69632        0             0 bash
[ 2439.914596] [  11202]  1000 11202  2906815  1915794 15781888        0             0 dotnet
[ 2439.914597] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/helix.service,task=dotnet,pid=11202,uid=1000
[ 2439.914636] Out of memory: Killed process 11202 (dotnet) total-vm:11627260kB, anon-rss:7663176kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15412kB oom_score_adj:0
[ 2440.040540] oom_reaper: reaped process 11202 (dotnet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Waiting a few seconds for any dump to be written..
cat /proc/sys/kernel/core_pattern: /home/helixbot/dotnetbuild/dumps/core.%u.%p
cat /proc/sys/kernel/core_uses_pid: 0
cat: /proc/sys/kernel/coredump_filter: No such file or directory
cat /proc/sys/kernel/coredump_filter:
Looking around for any Linux dump..
... found no dump in /datadisks/disk1/work/B3F20994/w/C4E20A47/e
+ export _commandExitCode=137

and

net6.0-Linux-Debug-x64-CoreCLR_release-SLES.15.Amd64.Open

~/work/A42C0904/w/A0C3088C/e ~/work/A42C0904/w/A0C3088C/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6255 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 19114 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing $RSP_FILE
~/work/A42C0904/w/A0C3088C/e
----- end Thu Jul 29 01:40:46 UTC 2021 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill
ulimit -c value: unlimited
dmesg: read kernel buffer failed: Operation not permitted
Waiting a few seconds for any dump to be written..
cat /proc/sys/kernel/core_pattern: /home/helixbot/dotnetbuild/dumps/core.%u.%p
cat /proc/sys/kernel/core_uses_pid: 0
cat: /proc/sys/kernel/coredump_filter: No such file or directory
cat /proc/sys/kernel/coredump_filter:
Looking around for any Linux dump..
... found no dump in /home/helixbot/work/A42C0904/w/A0C3088C/e

Both appear to be the same failure with little to no other diagnostics information. I see a few other failures in the history in AzDO going as far back as at least June 24th, but I saw failures all the way back into early May. The logs for those builds are gone, so I can't verify that they are the same failures. I stopped going back in the history at May, so I'm not sure how far back this failure goes.

Based on the history, it looks like this test is potentially flakey. It routinely passes, but occasionally fails. Seemingly in pairs, e.g., if one test run fails, there is another failure within a run of the other. All records of the test in AzDO have the exact same duration 00:01:00.00 regardless of pass or fail. I'm not sure how much I trust these records as a result.

I couldn't find an issue tracking this, but feel free to dup if there is already one.

The text was updated successfully, but these errors were encountered:

ghost · 2021-07-29T18:30:11Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Noticed these failures when I was investigating some disabled tracing tests in #56507. These failures are unrelated to the tests I turned back on in that PR, so I looked at the history.

net6.0-Linux-Debug-x64-CoreCLR_release-Ubuntu.1804.Amd64.Open

/datadisks/disk1/work/B3F20994/w/C4E20A47/e /datadisks/disk1/work/B3F20994/w/C4E20A47/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6255 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 11202 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing $RSP_FILE
/datadisks/disk1/work/B3F20994/w/C4E20A47/e
----- end Thu Jul 29 01:24:36 UTC 2021 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill
ulimit -c value: unlimited
[ 2439.914551] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2439.914551] 251 total pagecache pages
[ 2439.914552] 0 pages in swap cache
[ 2439.914553] Swap cache stats: add 0, delete 0, find 0/0
[ 2439.914553] Free swap  = 0kB
[ 2439.914553] Total swap = 0kB
[ 2439.914554] 2097038 pages RAM
[ 2439.914554] 0 pages HighMem/MovableOnly
[ 2439.914555] 58679 pages reserved
[ 2439.914555] 0 pages cma reserved
[ 2439.914555] 0 pages hwpoisoned
[ 2439.914556] Tasks state (memory values in pages):
[ 2439.914556] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 2439.914560] [    447]     0   447    43216      215   331776        0             0 systemd-journal
[ 2439.914562] [    470]     0   470    24428       43    94208        0             0 lvmetad
[ 2439.914563] [    476]     0   476    11204      566   131072        0         -1000 systemd-udevd
[ 2439.914564] [    523]     0   523     3005      229    69632        0             0 hv_kvp_daemon
[ 2439.914565] [    896] 62583   896    35489      133   184320        0             0 systemd-timesyn
[ 2439.914566] [   1024]   100  1024    20021      151   176128        0             0 systemd-network
[ 2439.914567] [   1062]   101  1062    17697      173   176128        0             0 systemd-resolve
[ 2439.914569] [   1319]     0  1319    20058     3259   204800        0             0 python3
[ 2439.914570] [   1332]     0  1332    15545      168   155648        0             0 systemd-logind
[ 2439.914571] [   1333]     0  1333    42739     1957   229376        0             0 networkd-dispat
[ 2439.914572] [   1336]     0  1336    40270       32    86016        0             0 lxcfs
[ 2439.914573] [   1338]   103  1338    12514      160   143360        0          -900 dbus-daemon
[ 2439.914574] [   1366]     0  1366    72000      214   188416        0             0 accounts-daemon
[ 2439.914575] [   1372]     0  1372    27605       56   114688        0             0 irqbalance
[ 2439.914576] [   1381]     0  1381     7084       51    94208        0             0 atd
[ 2439.914577] [   1382]   102  1382    66817      364   163840        0             0 rsyslogd
[ 2439.914578] [   1391]     0  1391     7938       73    98304        0             0 cron
[ 2439.914579] [   1393]     0  1393   226267     6655   286720        0          -999 containerd
[ 2439.914580] [   1397]     0  1397     4104       38    73728        0             0 agetty
[ 2439.914581] [   1408]     0  1408     3723       32    69632        0             0 agetty
[ 2439.914582] [   1436]     0  1436    72221      197   200704        0             0 polkitd
[ 2439.914583] [   1622]     0  1622     1128       17    53248        0             0 none
[ 2439.914584] [   1785]     0  1785    18076      181   176128        0         -1000 sshd
[ 2439.914585] [   1806]     0  1806    96545     4082   266240        0             0 python3
[ 2439.914586] [   2473]  1000  2473     2899       66    65536        0             0 helix.sh
[ 2439.914588] [   2928]     0  2928   247469    11662   483328        0          -500 dockerd
[ 2439.914589] [   3295]  1000  3295    44341     6852   241664        0             0 python3
[ 2439.914590] [   3299]   106  3299     7150       46    94208        0             0 uuidd
[ 2439.914591] [   3313]  1000  3313    63593     7085   270336        0             0 python3
[ 2439.914592] [   3314]  1000  3314   124773    11968   348160        0             0 python3
[ 2439.914593] [  11190]  1000 11190     1158       16    57344        0             0 sh
[ 2439.914594] [  11192]  1000 11192     1158       17    57344        0             0 execute.sh
[ 2439.914595] [  11194]  1000 11194     2932       83    69632        0             0 bash
[ 2439.914596] [  11202]  1000 11202  2906815  1915794 15781888        0             0 dotnet
[ 2439.914597] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/helix.service,task=dotnet,pid=11202,uid=1000
[ 2439.914636] Out of memory: Killed process 11202 (dotnet) total-vm:11627260kB, anon-rss:7663176kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15412kB oom_score_adj:0
[ 2440.040540] oom_reaper: reaped process 11202 (dotnet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Waiting a few seconds for any dump to be written..
cat /proc/sys/kernel/core_pattern: /home/helixbot/dotnetbuild/dumps/core.%u.%p
cat /proc/sys/kernel/core_uses_pid: 0
cat: /proc/sys/kernel/coredump_filter: No such file or directory
cat /proc/sys/kernel/coredump_filter:
Looking around for any Linux dump..
... found no dump in /datadisks/disk1/work/B3F20994/w/C4E20A47/e
+ export _commandExitCode=137

and

net6.0-Linux-Debug-x64-CoreCLR_release-SLES.15.Amd64.Open

~/work/A42C0904/w/A0C3088C/e ~/work/A42C0904/w/A0C3088C/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6255 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 19114 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing $RSP_FILE
~/work/A42C0904/w/A0C3088C/e
----- end Thu Jul 29 01:40:46 UTC 2021 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill
ulimit -c value: unlimited
dmesg: read kernel buffer failed: Operation not permitted
Waiting a few seconds for any dump to be written..
cat /proc/sys/kernel/core_pattern: /home/helixbot/dotnetbuild/dumps/core.%u.%p
cat /proc/sys/kernel/core_uses_pid: 0
cat: /proc/sys/kernel/coredump_filter: No such file or directory
cat /proc/sys/kernel/coredump_filter:
Looking around for any Linux dump..
... found no dump in /home/helixbot/work/A42C0904/w/A0C3088C/e

Both appear to be the same failure with little to no other diagnostics information. I see a few other failures in the history in AzDO going as far back as at least June 24th, but I saw failures all the way back into early May. The logs for those builds are gone, so I can't verify that they are the same failures. I stopped going back in the history at May, so I'm not sure how far back this failure goes.

Based on the history, it looks like this test is potentially flakey. It routinely passes, but occasionally fails. Seemingly in pairs, e.g., if one test run fails, there is another failure within a run of the other. All records of the test in AzDO have the exact same duration 00:01:00.00 regardless of pass or fail. I'm not sure how much I trust these records as a result.

I couldn't find an issue tracking this, but feel free to dup if there is already one.

Author:	josalem
Assignees:	-
Labels:	`area-System.Runtime`
Milestone:	6.0.0

noahfalk · 2021-08-01T00:45:55Z

Another hit on these failures: https://github.com/dotnet/runtime/pull/56654/checks?check_run_id=3207664110

danmoseley · 2021-08-09T23:24:13Z

Need to find out what was eating 1.2GB memory in the tests/product Killed process 11202 (dotnet) total-vm:11627260kB,

danmoseley · 2021-08-09T23:32:59Z

Interestingly, 100% of these SIGKILLS of this test library are on Ubuntu 1804 and SLES 15. Could they have less memory or different config?

next step: either try to repro locally, or perhaps fix #55702 so that we get a dump.

Execute: Web | Desktop | Web (Lens) | Desktop (SAW)

https://engsrvprod.kusto.windows.net/engineeringdata

WorkItems 
| where Started > now(-30d)  
| where FriendlyName == "System.Runtime.Tests"
| where ExitCode == 137 //or ExitCode  == 0
| join kind= inner (
   Jobs  | where Started > now(-30d) | project  QueueName , JobId, Build, Type, Source,
    Branch,
  Pipeline = tostring(parse_json(Properties).DefinitionName),
  Pipeline_Configuration = tostring(parse_json(Properties).configuration),
  OS = QueueName,
  Arch = tostring(parse_json(Properties).architecture)
) on JobId
| where Branch  !startswith "refs/pull"
| summarize count() by ExitCode, QueueName, Branch, Pipeline, Pipeline_Configuration, OS, Arch
| order by count_ desc

ExitCode	QueueName	Branch	Pipeline	Pipeline_Configuration	OS	Arch	count_
137	sles.15.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop-linux	Release	sles.15.amd64.open.rt	x64	30
137	ubuntu.1804.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop-linux	Release	ubuntu.1804.amd64.open.rt	x64	30
137	ubuntu.1804.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop	Release	ubuntu.1804.amd64.open.rt	x64	29
137	sles.15.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop	Release	sles.15.amd64.open.rt	x64	29
137	ubuntu.1804.amd64.open.svc	refs/heads/release/6.0-preview7	runtime-libraries-coreclr outerloop-linux	Release	ubuntu.1804.amd64.open.svc	x64	25
137	ubuntu.1804.amd64.open.svc	refs/heads/release/6.0-preview7	runtime-libraries-coreclr outerloop	Release	ubuntu.1804.amd64.open.svc	x64	25
137	sles.15.amd64.open.svc	refs/heads/release/6.0-preview7	runtime-libraries-coreclr outerloop-linux	Release	sles.15.amd64.open.svc	x64	25
137	sles.15.amd64.open.svc	refs/heads/release/6.0-preview7	runtime-libraries-coreclr outerloop	Release	sles.15.amd64.open.svc	x64	25
137	sles.15.amd64.open.svc	refs/heads/release/6.0-preview6	runtime-libraries-coreclr outerloop-linux	Release	sles.15.amd64.open.svc	x64	24
137	ubuntu.1804.amd64.open.svc	refs/heads/release/6.0-preview6	runtime-libraries-coreclr outerloop	Release	ubuntu.1804.amd64.open.svc	x64	24
137	ubuntu.1804.amd64.open.svc	refs/heads/release/6.0-preview5	runtime-libraries-coreclr outerloop-linux	Release	ubuntu.1804.amd64.open.svc	x64	24
137	ubuntu.1804.amd64.open.svc	refs/heads/release/6.0-preview5	runtime-libraries-coreclr outerloop	Release	ubuntu.1804.amd64.open.svc	x64	24
137	sles.15.amd64.open.svc	refs/heads/release/6.0-preview6	runtime-libraries-coreclr outerloop	Release	sles.15.amd64.open.svc	x64	24
137	ubuntu.1804.amd64.open.svc	refs/heads/release/6.0-preview6	runtime-libraries-coreclr outerloop-linux	Release	ubuntu.1804.amd64.open.svc	x64	24
137	sles.15.amd64.open.svc	refs/heads/release/6.0-preview5	runtime-libraries-coreclr outerloop	Release	sles.15.amd64.open.svc	x64	24
137	sles.15.amd64.open.svc	refs/heads/release/6.0-preview5	runtime-libraries-coreclr outerloop-linux	Release	sles.15.amd64.open.svc	x64	24

danmoseley · 2021-08-09T23:33:38Z

Moving to 7 as this isn't a ship blocker, but it's important that our tests don't crash so we should investigate a little later.

danmoseley · 2021-08-09T23:36:26Z

Incidentally dumping FinishedDate column shows this is failing 410 times in the last 30 days across main/Preview branches. That's not good .. it's probably one badly behaving test. Presumably an outerloop test per the table.

We haven't added one of those since April:

C:\git\runtime>git lg -SOuterLoop src/libraries/System.Runtime/tests/**
...
* fd2e5643646 - Add internal Array.Clear method (#51548) (Wed Apr 21 15:38:03 2021 -0700) <Levi Broderick>
...
* 8c0d7c1ebc5 - Optimize the Linguistic String Search Operations (#43065) (Wed Oct 14 06:46:37 2020 -0700) <Tarek Mahmoud Sayed>
...
* c0ddd1c5d16 - Adding public API for Pinned Object Heap allocations (#33526) (Wed Mar 18 05:00:41 2020 +0000) <Vladimir Sadov>
...
* 6b904c0d885 - Add Array.Copy test for very large arrays (dotnet/corefx#42373) (Mon Nov 4 18:05:27 2019 -0500) <Jan Kotas>

Not sure we can go back further in history in the test failures.

danmoseley · 2021-08-09T23:40:19Z

I take that back -- it started on April 22 !

WorkItems 
| where FriendlyName == "System.Runtime.Tests"
| where ExitCode == 137 //or ExitCode  == 0
| join kind= inner (
   Jobs  | project  QueueName , JobId, Build, Type, Source,
    Branch,
  Pipeline = tostring(parse_json(Properties).DefinitionName),
  Pipeline_Configuration = tostring(parse_json(Properties).configuration),
  OS = QueueName,
  Arch = tostring(parse_json(Properties).architecture)
) on JobId
| where Branch  !startswith "refs/pull"
| summarize count() by ExitCode, QueueName, Branch, Pipeline, Pipeline_Configuration, OS, Arch, bin(Finished, 1d)
| order by Finished asc
| take 10

ExitCode	QueueName	Branch	Pipeline	Pipeline_Configuration	OS	Arch	Finished	count_
137	sles.15.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop-linux	Release	sles.15.amd64.open.rt	x64	2021-04-22 00:00:00.0000000	1
137	ubuntu.1804.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop	Release	ubuntu.1804.amd64.open.rt	x64	2021-04-22 00:00:00.0000000	1
137	sles.15.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop	Release	sles.15.amd64.open.rt	x64	2021-04-22 00:00:00.0000000	1
137	ubuntu.1804.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop-linux	Release	ubuntu.1804.amd64.open.rt	x64	2021-04-22 00:00:00.0000000	1
137	sles.15.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop	Release	sles.15.amd64.open.rt	x64	2021-04-23 00:00:00.0000000	1
137	ubuntu.1804.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop-linux	Release	ubuntu.1804.amd64.open.rt	x64	2021-04-23 00:00:00.0000000	1
137	ubuntu.1804.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop	Release	ubuntu.1804.amd64.open.rt	x64	2021-04-23 00:00:00.0000000	1
137	sles.15.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop-linux	Release	sles.15.amd64.open.rt	x64	2021-04-23 00:00:00.0000000	1
137	ubuntu.1804.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop-linux	Release	ubuntu.1804.amd64.open.rt	x64	2021-04-24 00:00:00.0000000	1
137	sles.15.amd64.open.rt	refs/heads/main	runtime-libraries-coreclr outerloop-linux	Release	sles.15.amd64.open.rt	x64	2021-04-24 00:00:00.0000000	1

So very likely caused by https://github.com/dotnet/runtime/pull/51548/files

We can mark the tests to skip Ubuntu and SLES. They shouldn't be likely to have OS specific bugs, and an OOM Killer termination shouldn't indicate we have a bug.

pgovind · 2021-08-10T00:26:21Z

Tagging @GrabYourPitchforks for visibility (I was just triaging the label)

danmoseley · 2021-08-10T00:36:32Z

I'll skip them on these OS

GrabYourPitchforks · 2021-08-10T01:16:27Z

Interesting. We did add an outerloop test as part of that PR (see here), but it follows the same pattern that Array.Copy already did. Was there any known flakiness in that test prior to this PR?

danmoseley · 2021-08-10T01:33:29Z

Not that I see -- not an OOM anyway. Could it be that occasionally the GC does not reclaim the 1GB from the first test by the time the second one tries to allocate?

GrabYourPitchforks · 2021-08-11T00:17:52Z

I wonder if it's a memory fragmentation issue. There's enough memory available, but not always as a contiguous block, so things fall over. And having the two tests run one after another exacerbates the fragmentation.

danmoseley · 2021-08-11T00:46:59Z

@maonis here we have two tests, run immediately one after another, each allocate a 1GB array and then let it go out of scope. This is periodically failing in Linux only on SLES and Ubuntu. On those the oom killer terminates it (with a bit over 1GB committed, per the message).

This did not happen when there was one such test, but only when Levi added a second such test that runs directly after.

There's no product bug here, just curious whether you can shed light on why that might happen when the machine presumably has significantly more memory. And whether you are aware of varying oom killer behaviors between distros.

josalem added the area-System.Runtime label Jul 29, 2021

josalem added this to the 6.0.0 milestone Jul 29, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jul 29, 2021

jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 31, 2021

noahfalk added the blocking-clean-ci-optional Blocking optional rolling runs label Aug 1, 2021

This was referenced Aug 2, 2021

Remove ThreadPool event tests #56680

Merged

Fix MetricsEventSource tests #56654

Merged

runfoapp bot mentioned this issue Aug 2, 2021

Infrastructure - Status/Health #702

Closed

danmoseley modified the milestones: 6.0.0, 7.0.0 Aug 9, 2021

danmoseley mentioned this issue Aug 10, 2021

Disable 2 memory heavy array tests for 2 distros #57108

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 10, 2021

stephentoub closed this as completed in #57108 Aug 11, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 11, 2021

karelz mentioned this issue Sep 3, 2021

System.Runtime.Tests Outerloop "killed" on Debian.10 and Debian.11 #58616

Open

ghost locked as resolved and limited conversation to collaborators Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Group Failure: System.Runtime.Tests outerloop #56567

Test Group Failure: System.Runtime.Tests outerloop #56567

josalem commented Jul 29, 2021

ghost commented Jul 29, 2021

noahfalk commented Aug 1, 2021

danmoseley commented Aug 9, 2021

danmoseley commented Aug 9, 2021

danmoseley commented Aug 9, 2021

danmoseley commented Aug 9, 2021

danmoseley commented Aug 9, 2021

pgovind commented Aug 10, 2021

danmoseley commented Aug 10, 2021

GrabYourPitchforks commented Aug 10, 2021

danmoseley commented Aug 10, 2021

GrabYourPitchforks commented Aug 11, 2021

danmoseley commented Aug 11, 2021 •

edited

Loading

Test Group Failure: System.Runtime.Tests outerloop #56567

Test Group Failure: System.Runtime.Tests outerloop #56567

Comments

josalem commented Jul 29, 2021

ghost commented Jul 29, 2021

noahfalk commented Aug 1, 2021

danmoseley commented Aug 9, 2021

danmoseley commented Aug 9, 2021

danmoseley commented Aug 9, 2021

danmoseley commented Aug 9, 2021

danmoseley commented Aug 9, 2021

pgovind commented Aug 10, 2021

danmoseley commented Aug 10, 2021

GrabYourPitchforks commented Aug 10, 2021

danmoseley commented Aug 10, 2021

GrabYourPitchforks commented Aug 11, 2021

danmoseley commented Aug 11, 2021 • edited Loading

danmoseley commented Aug 11, 2021 •

edited

Loading