scaling: stability: 1000 containers after apprx 3 hours shows issue #807

grahamwhaley · 2018-10-05T11:08:27Z

Description of problem

Using the fast footprint test, I launched 1000 containers, and left them running overnight. By the morning, things had crashed.

Expected result

These are pretty benign containers (a busybox doing 'notihing'), and the system is pretty large and not resource constrained afaict (88 cores, 384Gb or RAM). I'd expect the containers to stay up, pretty much forever.

Actual result

Something has 'died', and it looks like the kata runtime has become non-functional.
The first time I ran this test I believe iirc I ended up with 847 'live' containers in the morning. This time things crashed out. Logs below.

What did I run

For reference, I used this script to run the test and try to capture details upon death:

#!/bin/bash

# set our basic premise...
export NUM_CONTAINERS=1000
export MAX_MEMORY_CONSUMED="300*1024*1024*1024"

NAP_TIME=30
JOURNAL_ENTRIES=100
DMESG_ENTRIES=20

# Grab a system status snapshot - to stdout
snapshot() {
        local journal="$(sudo journalctl -n ${JOURNAL_ENTRIES} --no-pager)"
        local dm="$(dmesg -H | tail -${DMESG_ENTRIES})"
        local dkr="$(docker ps -a)"
        local kata="$(kata-runtime kata-env)"

        echo "Sanity log file [$fname]"
        echo "------------------------"
        echo "---- kata env"
        echo "${kata}"
        echo "---- dmesg tail"
        echo "${dm}"
        echo "---- Docker ps -a"
        echo "${dkr}"
        echo "---- journal tail"
        echo "${journal}"
        echo "------------------------"
}


echo "fast footprint sanity test"
echo " NUM_CONTAINERS=${NUM_CONTAINERS}"
echo " MAX_MEMORY_CONSUMED=${MAX_MEMORY_CONSUMED}"
echo " NAP_TIME=${NAP_TIME}"
echo " JOURNAL_ENTRIES=${JOURNAL_ENTRIES}"
echo " DMESG_ENTRIES=${DMESG_ENTRIES}"

echo "Launching the containers"
# launch the containers
bash ./fast_footprint.sh

#Take a status snapshot
snapshot > sanity_launched.log

echo "entering sanity check loop"
# enter the sanity poll loop
while :; do
        sleep ${NAP_TIME}
        ts=$(date -Iseconds)
        cnt=$(docker ps -q | wc -l)
        echo -n "[${ts}] of ${NUM_CONTAINERS} have ${cnt}"
        if [ $cnt == $NUM_CONTAINERS ]; then
                echo " OK"
        else
                echo " FAIL"
                snapshot > sanity_fail.log
                exit 1
        fi

done

What did I see

From the logs then...

Wating for KSM to settle...
............................................................................................................................................................................................................................................................................................................Timed out after 300s waiting for KSM to settle
and dodge death (>> I have no idea where this comes from btw - I'll check some time ;-) <<)
entering sanity check loop
[2018-10-04T08:51:50-07:00] of 1000 have 1000 OK

... about 3hours later ? ....

[2018-10-04T12:04:23-07:00] of 1000 have 1000 OK
runtime/cgo: pthread_create failed: No space left on device
SIGABRT: abort
PC=0x7fe6f3fdb428 m=19 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7fe6f3fdb428
stack: frame={sp:0x7fe6d67fba08, fp:0x0} stack=[0x7fe6d5ffc2f0,0x7fe6d67fbef0)
00007fe6d67fb908:  00007fe6f49be168  00007fe6d67fba68
00007fe6d67fb918:  00007fe6f47a1b1f  0000000000000003
00007fe6d67fb928:  00007fe6f49ae5f8  0000000000000005
00007fe6d67fb938:  000000000249b040  00007fe6bc0008c0
00007fe6d67fb948:  00000000000000f1  0000000000000011
00007fe6d67fb958:  0000000000000000  00000000019c71ff
00007fe6d67fb968:  00007fe6f47a6ac6  0000000000000005
00007fe6d67fb978:  0000000000000000  0000000100000000
00007fe6d67fb988:  00007fe6f3facde0  00007fe6d67fbb20
00007fe6d67fb998:  00007fe6f47ae923  00000000000000ff
00007fe6d67fb9a8:  0000000000000000  0000000000000000
00007fe6d67fb9b8:  0000000000000000  2525252525252525
00007fe6d67fb9c8:  2525252525252525  0000000000000000
00007fe6d67fb9d8:  00007fe6f436b700  00000000019c71ff
00007fe6d67fb9e8:  00007fe6bc0008c0  00000000000000f1
00007fe6d67fb9f8:  0000000000000011  0000000000000000
00007fe6d67fba08: <00007fe6f3fdd02a  0000000000000020
00007fe6d67fba18:  0000000000000000  0000000000000000
00007fe6d67fba28:  0000000000000000  0000000000000000
00007fe6d67fba38:  0000000000000000  0000000000000000
00007fe6d67fba48:  0000000000000000  0000000000000000
00007fe6d67fba58:  0000000000000000  0000000000000000
00007fe6d67fba68:  0000000000000000  0000000000000000
00007fe6d67fba78:  0000000000000000  0000000000000000
00007fe6d67fba88:  0000000000000000  0000000000000000
00007fe6d67fba98:  0000000000000000  0000000000000000
00007fe6d67fbaa8:  00007fe6f401ebff  00007fe6f436b540
00007fe6d67fbab8:  0000000000000001  00007fe6f436b5c3
00007fe6d67fbac8:  00000000000000f1  0000000000000011
00007fe6d67fbad8:  00007fe6f4020409  000000000000000a
00007fe6d67fbae8:  00007fe6f409d2dd  000000000000000a
00007fe6d67fbaf8:  00007fe6f436c770  0000000000000000
runtime: unknown pc 0x7fe6f3fdb428
stack: frame={sp:0x7fe6d67fba08, fp:0x0} stack=[0x7fe6d5ffc2f0,0x7fe6d67fbef0)
00007fe6d67fb908:  00007fe6f49be168  00007fe6d67fba68

... and more stack dumps....

I've attached the full log as 'death.log'.
death.log

Also, if I try to use the runtime to list how many containers are still running:

gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ sudo kata-runtime list
runtime/cgo: pthread_create failed: No space left on device
Aborted (core dumped)

I've uploaded the output from the kata collect as an attachment:
collect.log

The text was updated successfully, but these errors were encountered:

grahamwhaley · 2018-10-05T11:09:04Z

@jodh-intel @sboeuf - input welcome here.
Right now I still have the machine in the 'dead state'. I'll leave it like that for now, and then try to re-run early next week.

jodh-intel · 2018-10-05T13:18:12Z

Wow, that system looks sick (in the conventional, non-slang sense :)

Might be worth grabbing ps, ipcs and journalct -t kernel from that system. Does ps show a glut of kata-runtime processes maybe?

Also, I'm guessing we're not doing this currently, but it might be worth running sar or similar on the box when you try to recreate to see if something is obviously leaking.

I've noticed something odd in the collect.log - that's the output of kata-collect-data.sh which runs a bunch of commands. The log shows that it managed to run kata-runtime kata-env, cat, and systemctl just fine. It then tried to run kubectl twice and both instances crashed. It then tried to run systemctl (worked), then crio --version (crashed), then systemctl again (worked). So some golang apps are crashing but the runtime isn't. That sounds a bit odd (and unlikely). Could you extract full log details from the journal:

https://github.com/kata-containers/tests/tree/master/cmd/log-parser#usage

grahamwhaley · 2018-10-05T13:54:15Z

Yep, it occurred to me I can still use ps to look at how many containers still think they are running:

gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ ps -e | fgrep kata-shim | wc
    843    3372   28662
gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ ps -e | fgrep kata-shim | wc
    843    3372   28662
gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ ps -e | fgrep qemu-lite | wc
   1000    4000   40000

Oh dear. Well, that sort of aligns pretty well with my previous run - my notes for that say '847 left'.
I did a reverse kernel journalctl (I did capture the tail of the dmesg when the test failed as well). The first part is me prodding the system I think - the test data starts around

-- Logs begin at Wed 2018-09-26 13:50:38 PDT, end at Fri 2018-10-05 06:32:46 PDT. --
Oct 05 05:51:33 clrw02 kernel: docker0: port 425(veth5cfab96) entered disabled state
Oct 05 05:51:33 clrw02 kernel: device veth5cfab96 left promiscuous mode
Oct 05 05:51:33 clrw02 kernel: docker0: port 425(veth5cfab96) entered disabled state
Oct 05 05:51:33 clrw02 kernel: vethb3407f1: renamed from eth0
Oct 05 05:51:33 clrw02 kernel: docker0: port 425(veth5cfab96) entered disabled state
Oct 05 05:51:07 clrw02 kernel: Core dump to |/usr/share/apport/apport 79014 6 18446744073709551615 1 79
Oct 05 03:57:08 clrw02 kernel: EXT4-fs (loop0p1): mounted filesystem without journal. Opts: noload
Oct 05 03:57:07 clrw02 kernel:  loop0: p1
Oct 05 03:55:34 clrw02 kernel: Core dump to |/usr/share/apport/apport 7868 6 0 1 7868 pipe failed
Oct 05 03:35:43 clrw02 kernel: Core dump to |/usr/share/apport/apport 78945 6 0 1 78945 pipe failed
Oct 05 03:35:19 clrw02 kernel: Core dump to |/usr/share/apport/apport 29619 6 18446744073709551615 1 29
Oct 05 02:13:59 clrw02 kernel: docker0: port 140(veth715092e) entered disabled state
Oct 05 02:13:59 clrw02 kernel: device veth715092e left promiscuous mode
Oct 05 02:13:59 clrw02 kernel: docker0: port 140(veth715092e) entered disabled state
Oct 05 02:13:59 clrw02 kernel: vethe2b042b: renamed from eth0
Oct 05 02:13:59 clrw02 kernel: docker0: port 140(veth715092e) entered disabled state
Oct 05 02:13:33 clrw02 kernel: Core dump to |/usr/share/apport/apport 7507 6 18446744073709551615 1 750
Oct 05 01:34:39 clrw02 kernel: docker0: port 795(vethcfb78d4) entered disabled state
Oct 05 01:34:39 clrw02 kernel: device vethcfb78d4 left promiscuous mode
Oct 05 01:34:39 clrw02 kernel: docker0: port 795(vethcfb78d4) entered disabled state
Oct 05 01:34:39 clrw02 kernel: veth6bcb98f: renamed from eth0
Oct 05 01:34:39 clrw02 kernel: docker0: port 795(vethcfb78d4) entered disabled state
Oct 05 01:34:14 clrw02 kernel: Core dump to |/usr/share/apport/apport 16829 6 18446744073709551615 1 16
Oct 05 00:40:48 clrw02 kernel: perf: interrupt took too long (3983 > 3978), lowering kernel.perf_event_
Oct 05 00:15:44 clrw02 kernel: docker0: port 544(veth86fb710) entered disabled state
Oct 05 00:15:44 clrw02 kernel: device veth86fb710 left promiscuous mode
Oct 05 00:15:44 clrw02 kernel: docker0: port 544(veth86fb710) entered disabled state
Oct 05 00:15:44 clrw02 kernel: veth5e43303: renamed from eth0
Oct 05 00:15:44 clrw02 kernel: docker0: port 544(veth86fb710) entered disabled state
Oct 05 00:15:19 clrw02 kernel: Core dump to |/usr/share/apport/apport 24815 6 18446744073709551615 1 24
Oct 04 21:55:17 clrw02 kernel: docker0: port 430(vethf10e3c5) entered disabled state
Oct 04 21:55:17 clrw02 kernel: device vethf10e3c5 left promiscuous mode
Oct 04 21:55:17 clrw02 kernel: docker0: port 430(vethf10e3c5) entered disabled state
Oct 04 21:55:17 clrw02 kernel: veth9d5b686: renamed from eth0
Oct 04 21:55:17 clrw02 kernel: docker0: port 430(vethf10e3c5) entered disabled state
Oct 04 21:54:52 clrw02 kernel: Core dump to |/usr/share/apport/apport 79184 6 18446744073709551615 1 79
Oct 04 21:33:47 clrw02 kernel: docker0: port 310(vethdb96140) entered disabled state
Oct 04 21:33:47 clrw02 kernel: device vethdb96140 left promiscuous mode
Oct 04 21:33:47 clrw02 kernel: docker0: port 310(vethdb96140) entered disabled state
Oct 04 21:33:46 clrw02 kernel: veth65efb59: renamed from eth0
Oct 04 21:33:46 clrw02 kernel: docker0: port 310(vethdb96140) entered disabled state
Oct 04 21:33:31 clrw02 kernel: Core dump to |/usr/share/apport/apport 48201 6 18446744073709551615 1 48
Oct 04 21:13:26 clrw02 kernel: docker0: port 304(vethcb7478f) entered disabled state
Oct 04 21:13:26 clrw02 kernel: device vethcb7478f left promiscuous mode
Oct 04 21:13:26 clrw02 kernel: docker0: port 304(vethcb7478f) entered disabled state
Oct 04 21:13:26 clrw02 kernel: veth1bca11b: renamed from eth0
Oct 04 21:13:26 clrw02 kernel: docker0: port 304(vethcb7478f) entered disabled state
Oct 04 21:13:01 clrw02 kernel: Core dump to |/usr/share/apport/apport 48086 6 18446744073709551615 1 48
Oct 04 20:47:26 clrw02 kernel: docker0: port 680(vethcc81b50) entered disabled state
Oct 04 20:47:26 clrw02 kernel: device vethcc81b50 left promiscuous mode
Oct 04 20:47:26 clrw02 kernel: docker0: port 680(vethcc81b50) entered disabled state
Oct 04 20:47:26 clrw02 kernel: veth883c840: renamed from eth0
Oct 04 20:47:26 clrw02 kernel: docker0: port 680(vethcc81b50) entered disabled state
Oct 04 20:47:00 clrw02 kernel: Core dump to |/usr/share/apport/apport 63080 6 18446744073709551615 1 63
Oct 04 20:25:04 clrw02 kernel: docker0: port 773(vethab548f8) entered disabled state
Oct 04 20:25:04 clrw02 kernel: device vethab548f8 left promiscuous mode
Oct 04 20:25:04 clrw02 kernel: docker0: port 773(vethab548f8) entered disabled state
Oct 04 20:25:04 clrw02 kernel: veth9ae23d7: renamed from eth0
Oct 04 20:25:04 clrw02 kernel: docker0: port 773(vethab548f8) entered disabled state
Oct 04 20:24:49 clrw02 kernel: Core dump to |/usr/share/apport/apport 9366 6 18446744073709551615 1 936
Oct 04 19:50:38 clrw02 kernel: docker0: port 176(vetha9d7fca) entered disabled state
Oct 04 19:50:38 clrw02 kernel: device vetha9d7fca left promiscuous mode
Oct 04 19:50:38 clrw02 kernel: docker0: port 176(vetha9d7fca) entered disabled state
Oct 04 19:50:38 clrw02 kernel: vethc646869: renamed from eth0
Oct 04 19:50:38 clrw02 kernel: docker0: port 176(vetha9d7fca) entered disabled state

... and repeat, lots... until where I think the test started up.....

Oct 04 12:05:51 clrw02 kernel: docker0: port 150(veth3447844) entered disabled state
Oct 04 12:05:50 clrw02 kernel: veth3ac4943: renamed from eth0
Oct 04 12:05:50 clrw02 kernel: docker0: port 150(veth3447844) entered disabled state
Oct 04 12:05:23 clrw02 kernel: Core dump to |/usr/share/apport/apport 9911 6 18446744073709551615 1 9911 pip
Oct 04 11:32:54 clrw02 kernel: perf: interrupt took too long (2542 > 2500), lowering kernel.perf_event_max_s
Oct 04 08:50:39 clrw02 kernel: bash (8700): drop_caches: 3
Oct 04 08:45:21 clrw02 kernel: device eth0 entered promiscuous mode
Oct 04 08:45:21 clrw02 kernel: device eth0 entered promiscuous mode
Oct 04 08:45:21 clrw02 kernel: device eth0 entered promiscuous mode
Oct 04 08:45:21 clrw02 kernel: device eth0 entered promiscuous mode
Oct 04 08:45:21 clrw02 kernel: device eth0 entered promiscuous mode
Oct 04 08:45:20 clrw02 kernel: device eth0 entered promiscuous mode
Oct 04 08:45:20 clrw02 kernel: docker0: port 998(veth96c7f67) entered forwarding state
Oct 04 08:45:20 clrw02 kernel: docker0: port 998(veth96c7f67) entered blocking state

I tried to do a kata log grab/parse, but it seems my golang runtime is so stuffed by this that I can't go get and build it.... it probably could have done with a clean set of logs anyway - so, I'll reserve that for the next clean reboot/run/test.

I don't think ipcs shows anything relevant, but here it is:

gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density/crash1/components$ ipcs

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root       644        80         2
0x00000000 32769      root       644        16384      2
0x00000000 65538      root       644        280        2

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x000000a7 0          root       600        1

and some system resource things:

gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density/crash1/components$ free -h
              total        used        free      shared  buff/cache   available
Mem:           377G        181G        180G        1.6G         15G        181G
Swap:           29G          0B         29G
gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density/crash1/components$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            189G     0  189G   0% /dev
tmpfs            38G  1.7G   37G   5% /run
/dev/sda2        19G   15G  3.1G  83% /
tmpfs           189G     0  189G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           189G     0  189G   0% /sys/fs/cgroup
/dev/sda4       3.1T   24G  3.0T   1% /srv
cgmfs           100K     0  100K   0% /run/cgmanager/fs
tmpfs            38G     0   38G   0% /run/user/2000
gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density/crash1/components$ ps -e | wc
  13500   54000  492503

Let me check something with our OPS folks - when I first ran this test their nagios got somewhat upset as I'd launched 13k processes :-) They bumped the limit, but that 13500 processes looks like a horribly round number, and near the limit I think they set. I'm hoping they don't have some auto-killer set up.... hmm... even if they do, we should probably handle it better ;-)

jodh-intel · 2018-10-05T14:08:05Z

That journal snippet shows that almost straight into the test a bunch of pids are dying with SIGABRT. OOI, can your run go version or does that crash too [1]?

[1] - I also wonder if we should be capturing golang details in the collect script.

grahamwhaley · 2018-10-05T14:10:03Z

go version 1.10

jodh-intel · 2018-10-05T14:23:23Z

But you can run that command on that box without it crashing?

grahamwhaley · 2018-10-05T14:28:04Z

yep, go version worked. I dunno. oddness it is indeed.
Aha, a search turns up:
moby/moby#20096

So, we should probably stare fairly hard at that, and then I update my golang and try again?

grahamwhaley · 2018-10-05T16:43:33Z

OK, that link may not be directly related, but surfing a bit more around the phrase runtime/cgo: pthread_create failed: No space left on device shows it might be a threadmax or leaking process related issue.
Early next week, unless somebody wants any more data off the currently dead system, I'll restart the test, with clean system logs, and between each sleep/check I'll also snapshot a ps -e | wc and some free information, so we can try to monitor if we are growing/leaking over time. All the containers should be static, so we should not be. Maybe we have a slow puncture....

amshinde · 2018-10-08T23:26:33Z

@grahamwhaley Might be worth running this with runc as well to make sure this is a Kata issue. Have you tried that?

grahamwhaley · 2018-10-09T10:27:39Z

That's not a bad idea @amshinde :-).
I ran again overnight, and captured some crash details - and at that point realised that this machine has kata v1.2.1 installed on it! I'll update to 1.3.0, turn on debug this time, and try again.

For the record, it looked like after 1h30, maybe the proxy died (apport seems to have picked up something), and that upset the shim etc.
OK, time to try again on 1.3.0 then...

grahamwhaley · 2018-10-09T10:48:44Z

I may have a new clue. If I do a ps -eL to count the threads rather than the processes, then it looks like we have probably hit the system thread count limit (limit is about 90k on this box, and we had 89k threads....).
I'll modify my test/loop to look at the thread count, and re-run. My best guess is that on 1.2.1 we were gently leaking go threads over time. I'll check what 1.3.0 looks like.

grahamwhaley · 2018-10-09T15:37:40Z

I re-ran with v1.3.0. I monitored the threads, and I see 'thread growth'. Looking at the threads normally on the components, it seems the proxy sits around 10 threads, but in this scenario, it seems to grow/leak threads - monitoring one proxy I saw it grow to 41 and then 48 threads before I stopped the test.

For reference then, I did a count up of threads associated with each container/component and came up with roughly (for an idle busybox container):

shim: 10
proxy: 10
qemu: 3
vhost: 1
kvm-pit: 1
docker-containerd: 10

For a total of ~35 threads per kata instance. Thus, I should have had ~35k threads for my 1000 containers, but my system was showing more like 70k (the system caps at 90k threads, which is when I believe it starts killing things off, or rather, things fail to launch a new thread and abort).

With 1000 containers the system was gaining about 50 threads/minute.

My belief is that the proxy leaks threads when under this test setup. I suspect it may be that things respond slowly and maybe it is having some timeouts and spawning new threads without cleaning up old ones. I'll enable proxy debug and run the test again and see if that gives us any more info.

grahamwhaley · 2018-10-09T18:58:38Z

Enabled proxy debug, ran 1k containers, and then picked on a proxy PID to examine (as the thread count rose...) and did a bit of diagnostic work:

$ ps -e | fgrep kata-proxy | tail -1
88954 ?        00:00:04 kata-proxy

# magically dump the stacks to the journal
$ kill -SIGUSR1 88954
$ sudo journalctl _PID=88954 --no-pager > 88954_proxy.log
$ grep goroutine 88954_proxy.log

Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.549687853-07:00" level=error msg="goroutine 7 [running]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.550166481-07:00" level=error msg="goroutine 1 [chan receive, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.550410725-07:00" level=error msg="goroutine 5 [syscall]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.550659135-07:00" level=error msg="goroutine 6 [select, 80 minutes, locked to thread]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.550938473-07:00" level=error msg="goroutine 8 [IO wait, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.551525196-07:00" level=error msg="goroutine 9 [IO wait]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.552373309-07:00" level=error msg="goroutine 10 [select]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.552553201-07:00" level=error msg="goroutine 11 [sleep]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.552787405-07:00" level=error msg="goroutine 12 [IO wait, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.55339924-07:00" level=error msg="goroutine 13 [chan receive, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.553582598-07:00" level=error msg="goroutine 20 [select, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.553966851-07:00" level=error msg="goroutine 21 [IO wait, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy

$ grep goroutine 88954_proxy.log  | wc
     12     191    2968

$ ps -L 88954

  PID   LWP TTY      STAT   TIME COMMAND
88954 88954 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88957 ?        Sl     0:01 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88958 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88959 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88960 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88961 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88963 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88965 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88966 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 89178 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 89179 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  5901 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  5904 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  5905 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  5907 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  7411 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  7412 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  9708 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  9709 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 11829 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 11830 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 14921 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 14923 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 18112 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 18113 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 21643 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 21644 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 25712 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 25714 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 29350 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 29351 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 33153 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 33154 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 36749 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 36755 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 40632 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 40633 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 44061 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 44062 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 53254 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 57804 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 64891 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 69407 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 74352 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 75414 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 77034 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 55391 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  2636 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  3692 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 26355 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 59354 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 74836 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 79670 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 81267 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 82842 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 86503 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88631 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 89814 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954  2459 ?        Sl     0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/

$ ps --no-headers -L 88954 | wc
     59     944   27966

$ grep error 88954_proxy.log | head -1
Oct 09 09:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T09:05:46.644453689-07:00" level=debug msg="Copy stream error" error="write unix /run/vc/sbs/dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1/proxy.sock->@: use of closed network connection" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy

Two things then...

I don't understand why the stackdump goroutine count (12) is so different from the ps pthread count (59). My best guess is that we have dead golang threads that are not being reaped for some reason.
My suspicion is that Copy stream error, coming from this proxy code, is maybe closing one of the channels, but leaving the other open and spawing another one or pair - but, I only see one Copy error in the log, and my Go foo is weak.... So, some input from maybe @bergwolf @amshinde @sboeuf and @jodh-intel here most welcome :-)

If there are other things you think I can look at, just let me know - I'll leave this test running overnight.
I've attached the log file with the stack dumps in (but, tbh, I think those are only stacks of 'good' threads - I think we want to know where all the other threads have come from and gone to).

88954_proxy.log

jodh-intel · 2018-10-10T13:36:52Z

@grahamwhaley - I've done some testing but cannot make the proxy misbehave yet. That might be because I'm not scaling up to the levels you are but fwics the connections are always closed correctly. I'll keep trying...

On the topic of goroutines - there isn't a 1:1 relationship between O/S threads and goroutines so I think we'd expect to see goroutines > threads. However, the fact that you see goroutines < threads seems odd.

Any thoughts @bergwolf, @sboeuf?

grahamwhaley · 2018-10-10T17:35:37Z

More info then...
Reading around, it seems (yeah, maybe the rest of you know this already ;-) ), that the count of system threads is not directly related to the goroutine count (sure). There are golang rules to when a new system thread will be launched (basically when a goroutine needs to run, but all the other threads are blocked say in syscalls).
There is an interesting open Issue around this on golang: golang/go#14592 wrt to what to do with subsequent idle system threads... seems, at present, nothing - they are not culled afaict.

This then also hints at the tempting sounding SetMaxThreads, but from reading
https://golang.org/pkg/runtime/debug/#SetMaxThreads
it would seem that if you set a limit, and then bump into it, your process will get killed. Maybe not what we want, yet.

I ran up dlv and attached it to an errant proxy. dlv is nice - it can show you both the threads and the goroutines. It shows >70 threads, and 117 goroutines.

Here is a thread snippet:

Thread 40633 at 0x459729 /usr/local/go/src/runtime/sys_linux_amd64.s:664 runtime.epollwait
Thread 44061 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex

Only one thread is at epollwait - all others are in the runtime.futex. Their stacks look like:

(dlv) thread 89814
Switched from 2459 to 89814
(dlv) stack
0  0x0000000000459583 in runtime.futex
   at /usr/local/go/src/runtime/sys_linux_amd64.s:527
1  0x0000000000428ffb in runtime.futexsleep
   at /usr/local/go/src/runtime/os_linux.go:45
2  0x000000000040ffdb in runtime.notesleep
   at /usr/local/go/src/runtime/lock_futex.go:151
3  0x0000000000431205 in runtime.stopm
   at /usr/local/go/src/runtime/proc.go:1947
4  0x00000000004323cc in runtime.findrunnable
   at /usr/local/go/src/runtime/proc.go:2410
5  0x0000000000432edb in runtime.schedule
   at /usr/local/go/src/runtime/proc.go:2536
6  0x0000000000433206 in runtime.park_m
   at /usr/local/go/src/runtime/proc.go:2599
7  0x00000000004554ab in runtime.mcall
   at /usr/local/go/src/runtime/asm_amd64.s:351

(dlv) thread 40633
Switched from 89814 to 40633
(dlv) stack
0  0x0000000000459729 in runtime.epollwait
   at /usr/local/go/src/runtime/sys_linux_amd64.s:664
1  0x0000000000428eac in runtime.netpoll
   at /usr/local/go/src/runtime/netpoll_epoll.go:71
2  0x000000000043251b in runtime.findrunnable
   at /usr/local/go/src/runtime/proc.go:2392
3  0x0000000000432edb in runtime.schedule
   at /usr/local/go/src/runtime/proc.go:2536
4  0x0000000000433206 in runtime.park_m
   at /usr/local/go/src/runtime/proc.go:2599
5  0x00000000004554ab in runtime.mcall
   at /usr/local/go/src/runtime/asm_amd64.s:351

The goroutines look like:

[117 goroutines]
  Goroutine 1 - User: /home/gwhaley/go/src/github.com/kata-containers/proxy/proxy.go:313 main.handleExitSignal (0x51ae6b)
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 5 - User: /usr/local/go/src/runtime/sigqueue.go:139 os/signal.signal_recv (0x440df6)
  Goroutine 6 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 7 - User: /home/gwhaley/go/src/github.com/kata-containers/proxy/proxy.go:279 main.setupNotifier.func1 (0x51cfe8)
  Goroutine 8 - User: /usr/local/go/src/runtime/netpoll.go:173 internal/poll.runtime_pollWait (0x427c47)
  Goroutine 9 - User: /usr/local/go/src/runtime/netpoll.go:173 internal/poll.runtime_pollWait (0x427c47)
  Goroutine 10 - User: /home/gwhaley/go/src/github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux/session.go:368 github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux.(*Session).send (0x4fa270)
  Goroutine 11 - User: /usr/local/go/src/runtime/time.go:102 time.Sleep (0x4486a6)
  Goroutine 12 - User: /usr/local/go/src/runtime/netpoll.go:173 internal/poll.runtime_pollWait (0x427c47)
  Goroutine 13 - User: /home/gwhaley/go/src/github.com/kata-containers/proxy/proxy.go:436 main.realMain.func3 (0x51d431)
  Goroutine 14 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 22 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 23 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 24 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 25 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 34 - User: /usr/local/go/src/runtime/lock_futex.go:227 runtime.notetsleepg (0x4102c2)
  Goroutine 50 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
... many more in gopark...

The gopark stacks look like:

(dlv) goroutine 306
Switched from 0 to 306 (thread 11829)
(dlv) stack -full
0  0x000000000042d2aa in runtime.gopark
   at /usr/local/go/src/runtime/proc.go:292
       lock = unsafe.Pointer(0x64f1e0)
       reason = "timer goroutine (idle)"
       traceEv = 20
       traceskip = 1
       unlockf = runtime.parkunlock_c

1  0x000000000042d35e in runtime.goparkunlock
   at /usr/local/go/src/runtime/proc.go:297
       lock = (*runtime.mutex)(0x64f1e0)
       reason = "timer goroutine (idle)"
       traceEv = 20
       traceskip = 1

2  0x0000000000448f3d in runtime.timerproc
   at /usr/local/go/src/runtime/time.go:253
       tb = (*runtime.timersBucket)(0x64f1e0)
       arg.data = (*uint8)(0xc4201c0180)
       arg.type = (*uint8)(0x533420)
       delta = 999998420
       f = runtime.goroutineReady
       now = 1153558407007228
       seq = 0
       t = (*runtime.timer)(0xc4202101c0)

3  0x0000000000457fe1 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:2361

I need to do more digging into the code to see why the goroutines might end up at that point. Still open for ideas... ;-)

sboeuf · 2018-10-11T02:48:11Z

@jodh-intel no sorry, I don't have anything that comes to mind here. I'll try to look into the proxy code later.

jodh-intel · 2018-10-11T08:52:52Z

Looking at those goroutines, nothing jumps out - we can see:

the goroutine waiting for SIGTERM to tell the proxy to die (main.handleExitSignal)
the goroutine that handles general signals, waiting patiently for signals to arrive (main.setupNotifier.func1)
a bunch of goroutines sitting waiting for socket traffic.

And in fact when I connect to an idle proxy I see the same set of goroutines.

bergwolf · 2018-10-11T13:22:09Z

man 2 clone:

       ENOSPC (since Linux 4.9)
              One  of  the  values  in  flags  specified  the creation of a new user namespace, but doing so would have caused the limit defined by the corresponding file in
              /proc/sys/user to be exceeded.  For further details, see namespaces(7).

[macbeth@user]$ls /proc/sys/user/*
/proc/sys/user/max_cgroup_namespaces  /proc/sys/user/max_ipc_namespaces  /proc/sys/user/max_pid_namespaces
/proc/sys/user/max_inotify_instances  /proc/sys/user/max_mnt_namespaces  /proc/sys/user/max_user_namespaces
/proc/sys/user/max_inotify_watches    /proc/sys/user/max_net_namespaces  /proc/sys/user/max_uts_namespaces

It looks like you have reached one of these user namespaces limits.

grahamwhaley · 2018-10-11T13:49:49Z

:-) You threw me for a minute there @bergwolf - but, you are probably right in a way - I think we have hit the pid_max limit - which probably also applies to the pid namespace.
On this machine (88 cores), the pid_max is set to cores*1024 - so 90112. What I witness is that as our golang threads grow in the proxys, and if we monitor the system using ps -eL --no-headers | wc, when we start to approach ~90k pids, things start to get killed (or fail). I'm pretty sure it is the proxy trying to spawn yet another thread, and failing, and then dying.

@jodh-intel and I are staring at the for loop in the proxy at https://github.com/kata-containers/proxy/blob/master/proxy.go#L123-L138

	go func() {
		var err error

		wg := &sync.WaitGroup{}
		defer func() {
			// close session before waiting to make sure all streams are closed,
			// so that the copy goroutines can quit.
			session.Close()
			wg.Wait()
			l.Close()
			results <- err
		}()

		for {
			var conn, stream net.Conn
			conn, err = l.Accept()
			if err != nil {
				return
			}

			stream, err = session.Open()
			if err != nil {
				return
			}

			// add 2 for the two copy goroutines
			wg.Add(2)
			go proxyConn(conn, stream, wg)
		}
	}()

I put a log message just above the Accept(), and I see that coming out multiple times for a single proxy instance. Then:

I'm not sure what causes multiple connect/Accepts
We suspect the cleanup in this loop and the sub-call to proxyConn may be leaking goroutines and thus threads. @jodh-intel noted that there is a probable bug that if the session.Open() fails, then the conn is not closed - but that is probably not the error we are seeing (I will put some logs in those error returns though...). I'll note that if we get multiple Accepts(), then I think we call multiple session.Open()s, but I think only one Session.Close() on the defer. I'm not sure if that is right.

Eyes on that code welcome. And any input on why we get multiple Accepts() (and if that is expected or not) also welcome. I can add more log/debug if we have ideas to test or prove.

bergwolf · 2018-10-15T02:04:03Z

@grahamwhaley what host kernel version are you using?

jodh-intel · 2018-10-15T07:29:38Z

@grahamwhaley - To help answer @bergwolf's question, please can you paste in the output of kata-runtime kata-env which contains this info.

grahamwhaley · 2018-10-15T09:00:52Z

@bergwolf @jodh-intel - the output of a kata-collect is attached to the first entry. Here is the host data extracted from that file:

[Host]
  Kernel = "4.15.0-34-generic"
  Architecture = "amd64"
  VMContainerCapable = true
  SupportVSocks = false
  [Host.Distro]
    Name = "Ubuntu"
    Version = "16.04"
  [Host.CPU]
    Vendor = "GenuineIntel"
    Model = "Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz"

Let me do a post in a minute giving the core details of the issue (the proxy grows idle timer threads, which runs the machine out of PID space basically).

grahamwhaley · 2018-10-15T16:45:41Z

OK, let me see if I can explain what I think I know/have found. I'll explain it in a sort of reverse-timeline from the point where containers die....

1) Containers die

The root issue is that the system runs out of PID space. That is, there is a limited number of PIDs each system supports. On this system, if we go cat /proc/sys/kernel/pid_max on my 88 core machine, we get 90122. That number is determined in the kernel from this code, where it sets the max_pids to 1024*ncpu.
The reason the containers get killed is they fail to spawn a new gothread on a new processor as there is no PID available, which is a fatal error.
We can confirm this hypothesis somewhat by looking at the number of threads present on an 'errant system':

$ ps --no-headers -eL | wc -l
88424

That is a system where I had already killed a few proxys off so I could actually try to debug it.

2) the `kata-proxy` is eating all the threads

Poking around the system, it is pretty obvious that the proxy is eating most of the threads:

root@clrw02:/proc/sys/kernel# pgrep proxy | tail -1
89992
root@clrw02:/proc/sys/kernel# ps --no-headers -L 89992 | wc -l
60

I have a little script that gets the thread count and avg. for a few components:

gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ ./stats.sh
984 proxys, 59176 threads - avg 60.13
984 shims, 10413 threads - avg 10.58
999 qemus,  threads - avg 3.00

3) The proxy has many idle gothreads that are processor specific

If we run up dlv to have a look at that proxy, we find that it has 114 goroutines. If we grab their goroutine numbers and then do some paste trickery to extract their stacks, and then some grep trickery to work out where they are sat, we find:

what	No.
gcBgMarkWorker	88
timerproc	11
!goexit@2	15

The gcBgMarkWorker is the golang garbage collector. It has spawned one goroutiner per processor (we'll get to that code in a minute).

The timerproc is reasonably obvious - that is the per-processor goroutine that processes the timer queues. We don't have one of those for every processor though.

The other 15 entries are the 'normal' goroutines doing the work for the proxy, and look like:

$ grep "2  " stacks.txt | grep -v goexit | grep -v timerproc grep -v "12  "
2  0x0000000000405d32 in runtime.chanrecv
2  0x000000000042d0ec in runtime.forcegchelper
2  0x000000000041f27c in runtime.bgsweep
2  0x00000000004164fd in runtime.runfinq
2  0x00000000004f5bd2 in os/signal.loop
2  0x0000000000454be4 in runtime.ensureSigM.func1
2  0x0000000000405d32 in runtime.chanrecv
 2  0x0000000000427c47 in internal/poll.runtime_pollWait
 2  0x0000000000427c47 in internal/poll.runtime_pollWait
2  0x00000000004fa190 in github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux.(*Session).send
2  0x00000000004486a6 in time.Sleep
 2  0x0000000000427c47 in internal/poll.runtime_pollWait
2  0x0000000000405d32 in runtime.chanrecv
2  0x00000000004fbfa1 in github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux.(*Stream).Read
 2  0x0000000000427c47 in internal/poll.runtime_pollWait

So, the garbage collector looks to be the main culprit on spawning threads. Let's have a very quick peek at that code. Here is a stack:

(dlv) stack
0  0x000000000042d2aa in runtime.gopark
   at /usr/local/go/src/runtime/proc.go:292
1  0x0000000000419e90 in runtime.gcBgMarkWorker
   at /usr/local/go/src/runtime/mgc.go:1775
2  0x0000000000457fe1 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:2361

Which leads us to the golang gc code at https://github.com/golang/go/blob/release-branch.go1.10/src/runtime/mgc.go#L1714-L1728

but... why do we end up with so many?

4) Because the proxy is migrating around the processors?

The only reasons I could think of why we would end up with so many per-cpu goroutines was either we were getting some lock/race/stuck condition that then forced the scheduler to spawn more threads on other processors, or the proxy was migrating across processors, and the golang gc was kicking off a new thread/goroutine on the next iteration of the gc on the new processor.

I'm not yet convinced, but I thought I'd check the proxy for migrations. Remember, this is a wholly 'static' set of containers - 999 containers (one failed to launch ;-) ) all running busybox tailing /dev/null - so, no real activity:

gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density/crash5$ sudo perf stat -p 89992 sleep 600

 Performance counter stats for process id '89992':

         26.596074      task-clock (msec)         #    0.000 CPUs utilized
               629      context-switches          #    0.024 M/sec
                23      cpu-migrations            #    0.865 K/sec
                 0      page-faults               #    0.000 K/sec
        29,650,314      cycles                    #    1.115 GHz
        26,635,571      instructions              #    0.90  insn per cycle
         6,137,454      branches                  #  230.765 M/sec
           175,515      branch-misses             #    2.86% of all branches

     600.012957407 seconds time elapsed

Well, that is pretty interesting. In the space of 600 seconds we got 629 context switches (sounds like a 1s timer...), and 23 cpu migrations??

4) Hypothesis then

My basic hypothesis is that, when we have more processes than CPUs, the proxy gets migrated around, and that generates the gc goroutines which occupy threads. This does not quite hold water though, as then we'd expect to have 88 threads (one for each CPU), and why does this not also happen for the shim. Well, OK, maybe the shim really is asleep all the time, and thus not scheduled, and thus not migrated (should be easy to prove with perf). But the proxy has the yamux heartbeat timer - so, it is constantly waking up and doing a little bit of work. If you have 1000 of them, and they are all waking up at 1s intervals, in my mind it is not inconceivable that at some point many of them wake up on the same processor, where other processors have been idle, and the kernel scheduler decides somebody needs migrating. And maybe that is just a little storm that continues over time.

5) What does that mean for current kata scaling limits

Right now, it would seem that kata scaling is somewhat tied to the number of CPUs on the system, due to the proliferation of threads spawned in the proxy.

Given the processes/threads we have accounted for for a kata container:

what	how many procs/threads
docker	1
qemu	3
kvm-pit	1
kvm-vhost	1
shim	10
proxy	15 +ncpu

and we know that the kernel, by default, allows 1024 PIDs per cpu, we can probably formulate a guess at the max number of containers as limited by the nCPU right now:

containers <= (ncpu * 1024) / (31 + ncpu)

6) Next steps

We still need to prove exactly what is happening and why. The 60threads-vs-88cores does not quite make sense.
There are a number of golang trace facilities and settings we can try to correlate this situation. GOMAPXPROCS will be an interesting one to start with (although I'm not sure if it will constrain the gc goroutines, as they are system goroutines, not user ones etc.). There is go tracing for the scheduler as well. The hard bit will be getting those set up on the proxy maybe.
I'm discussing with @markdryan about the golang scheduler, and correlating if my thoughts so far make sense. @markdryan suspects we may have goroutines hung up on syscalls somewhere - I have provided him with a full set of stack traces for the 114 goroutines in case something leaps out.

The search continues. Things I might try:

see if I can get GOMAXPROCS into the docker env and thus passed onto the proxy (or, see if I can wire it into the proxy startup, which I think might be an option), to see if that reduces the max threads and goroutines the proxy spawns.
Possibly run less containers (say n<cpus), and see if that effects the cpu migration and thread/goroutine count. Maybe we will find with ncpus it appears...

amshinde · 2018-10-15T21:03:51Z

@grahamwhaley I would be interested in seeing the stack traces of the proxy goroutines.

I took an initial look at the proxy code, I see that we manually send out the hearbeat every second now. We have also set the session ConnectionWriteTimeout to 1 sec:
https://github.com/kata-containers/proxy/blob/master/proxy.go#L94
Can you see if increasing this timeout to say 5 sec or so helps?

grahamwhaley · 2018-10-16T08:45:36Z

Hi @amshinde . Sure, np. Attaching a set of stacks grabbed from dlv as stacks.txt.

stacks.txt

Heh, yeah, I'd spotted those 1s timeouts on Friday. I tweaked them (to 30s for heartbeat and 10s for write timeout) for my 'possible fix over the weekend' - but, it didn't work :-(
Having said that, I ran a test overnight with:

@@ -352,6 +358,9 @@ func realMain() {
        var channel, proxyAddr, agentLogsSocket, logLevel string
        var showVersion bool

+       // Reduce the number of threads we make
+       runtime.GOMAXPROCS(10)
+

applied to the proxy, and this morning my stats look like:

$ ./stats.sh
999 proxys, 11571 threads - avg 11.58
999 shims, 10583 threads - avg 10.59
999 qemus,  threads - avg 3.00

$ ps -eL | wc -l
41221

The heavy implication being that GOMAXPROCS can alleviate the spawning of so many idle goroutine/threads.

This will need more investigation and thought I think - but, it is a good clue. Maybe this is a core golang gc/scheduler 'feature' the proxy if provoking etc.

grahamwhaley · 2018-10-16T09:20:29Z

For ref, whilst I have it on my screen, I attached to one of the proxy's with dlv and did a snoop:

 ./dlv attach 89837
Type 'help' for list of commands.
(dlv) threads
* Thread 840 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 850 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 89837 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 89838 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 89839 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 89840 at 0x459729 /usr/local/go/src/runtime/sys_linux_amd64.s:664 runtime.epollwait
  Thread 89841 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 89842 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 89843 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 89844 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 89845 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
  Thread 89846 at 0x459583 /usr/local/go/src/runtime/sys_linux_amd64.s:527 runtime.futex
(dlv) goroutines
[19 goroutines]
  Goroutine 1 - User: /home/gwhaley/go/src/github.com/kata-containers/proxy/proxy.go:319 main.handleExitSignal (0x51afbb)
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 5 - User: /usr/local/go/src/runtime/sigqueue.go:139 os/signal.signal_recv (0x440df6)
  Goroutine 6 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 7 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 18 - User: /home/gwhaley/go/src/github.com/kata-containers/proxy/proxy.go:285 main.setupNotifier.func1 (0x51d358)
  Goroutine 19 - User: /usr/local/go/src/runtime/netpoll.go:173 internal/poll.runtime_pollWait (0x427c47)
  Goroutine 20 - User: /usr/local/go/src/runtime/netpoll.go:173 internal/poll.runtime_pollWait (0x427c47)
  Goroutine 21 - User: /home/gwhaley/go/src/github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux/session.go:399 github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux.(*Session).send (0x4fa190)
  Goroutine 22 - User: /usr/local/go/src/runtime/time.go:102 time.Sleep (0x4486a6)
  Goroutine 23 - User: /usr/local/go/src/runtime/netpoll.go:173 internal/poll.runtime_pollWait (0x427c47)
  Goroutine 24 - User: /home/gwhaley/go/src/github.com/kata-containers/proxy/proxy.go:445 main.realMain.func3 (0x51d7a1)
  Goroutine 25 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
  Goroutine 30 - User: /home/gwhaley/go/src/github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux.(*Stream).Read (0x4fbfa1)
  Goroutine 31 - User: /usr/local/go/src/runtime/netpoll.go:173 internal/poll.runtime_pollWait (0x427c47)
  Goroutine 34 - User: /usr/local/go/src/runtime/proc.go:292 runtime.gopark (0x42d2aa)
* Goroutine 50 - User: /usr/local/go/src/runtime/lock_futex.go:227 runtime.notetsleepg (0x4102c2)

bergwolf · 2018-10-16T09:26:25Z

Nice findings @grahamwhaley ! By default GOMAXPROCS is # of available cores on the machine. So kata-proxy will each spawn 32 threads on your test machine and that looks to drain the available pids on the host. For a simple usage like kata-proxy, we should limit the max number of procs on our own. Maybe equal to the available cores and at most 4 procs? The GO runtime does not dynamically scale up/down the limit so we have to tune it ourselves.

The same applies to kata-shim since it does not set GOMAXPROCS either.

And IMO we should dig a bit more to see why kata-proxy spawns to the limit of max processes. It seems one possible reason is docker/k8s keeps calling kata-runtime state to query container states and each call would ask kata-proxy to spawn a pair of goroutines to rely the request. The frequency is usually 1 per second so I'm not sure if it can really boom up go runtime internal P (os threads). While this one is expected, we should see if there other possible causes and try to optimize a bit if possible.

Another thing to look at the following piece of code. Maybe we should drop wg.Done() so that a copyStream goroutine can quit even its pairing one is still running.

func proxyConn(conn1 net.Conn, conn2 net.Conn, wg *sync.WaitGroup) {
        once := &sync.Once{}
        cleanup := func() {
                conn1.Close()
                conn2.Close()
        }
        copyStream := func(dst io.Writer, src io.Reader) {
                _, err := io.Copy(dst, src)
                if err != nil {
                        logger().WithError(err).Debug("Copy stream error")
                }

                once.Do(cleanup)
                wg.Done()
        }

        go copyStream(conn1, conn2)
        go copyStream(conn2, conn1)
}

markdryan · 2018-10-16T09:31:09Z

A couple of comments.

At the bottom of

https://github.com/golang/go/wiki/Performance#garbage-collector-trace

we see a comment stating that the maximum number of threads the GC can use is 8. So I think we can rule the GC out.

Next, the fact that GOMAXPROCS changes things is interesting.

"The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit."

If by setting this to 10 we see a reduction in the number of threads used by the process, this implies that the issue is not with blocking system calls as I had first thought. If it were, setting GOMAXPROCS, wouldn't improve things. The fact that it does also implies that the issue is not with system runtime threads either.

So perhaps what we're seeing is that a burst in activity leads to the creation of lots of go routines and the allocation of lots of system threads to service them in parallel. These threads are never released back to the system. Certainly in the small test program I ran quickly on my 8 core machine, I don't ever see any threads getting released back to the system, once the go-routines they were allocated to service have quit. Assuming this is the case setting GOMAXPROCS is the correct fix.

grahamwhaley · 2018-10-16T10:57:52Z

Hi @markdryan

we see a comment stating that the maximum number of threads the GC can use is 8. So I think we can rule the GC out.

Well, I see it on the wiki, but I don't see it in the code. What I do see in the code is I think a limit of 25% of CPU resources:
https://github.com/golang/go/blob/release-branch.go1.10/src/runtime/mgc.go#L898-L913
I had a dig to see if there was also an 8-thread cap, but could not find one. That is the max active gc theads I think that will run in parallel. But, I think the gc actually creates one gc goroutine per processor:
https://github.com/golang/go/blob/release-branch.go1.10/src/runtime/mgc.go#L1714-L1728

That is for the background mark code I think, but then those are the idle threads I saw in the stack dumps. I don't know if the rest of the gc is parallelized (but there are some really interesting presentations out there on the golang gc - this one covers the history and design... https://blog.golang.org/ismmkeynote).

Anyhow, I think unless we invoke the tracers and try to deep-diagnose this, then we may not understand what is going on there with the gc.

As per @bergwolf though - we probably do want to understand. I really don't like closing bugs with a workaround without knowing what it's really doing (that is ... proof....) :-)

I'll see if I can get set up to hand-run one container/proxy after I have the other 999 running or so, and see if we can get any trace. Looks like the trace comes out on stdout though, so that won't help.

@bergwolf wrt that loop and the wg() - let me pull in @amshinde and @jodh-intel on that. @jodh-intel was looking at that code last week and I think saw some potential bugs already on closes etc.
Oh, and thanks for the info on the 1s status poll from docker - that helps explain the context switch rate I was seeing my my perf measure!

If anybody has something specific they want me to try, just ask. Or, if you want to try and reproduce this on a local machine (which, probably due to the nature of the problem may be easier to see if the machine has many cores), also just ask.

markdryan · 2018-10-16T12:29:48Z

@grahamwhaley It looks like you're correct. That comment about the GC thread limit is out of date. It looks like they increased the value to 32 and then got rid of it completely. In any case, it doesn't really make any difference. If I understand correctly GOMAXPROCs affects the number of threads the runtime can use for the GC as well as the user Go routines. By not setting the value you're giving each GO process the right to legitimately use at least 88 threads on your machine, and the runtime is doing just this.

grahamwhaley · 2018-10-16T13:55:33Z

@markdryan Basically, yes. And there is no current mechanism for golang to reap/prune/purge the idle threads I think. There is a merge in 1.11.1 that looks like the prep work to enable this. And there is an open Issue relating to quitting idle OS threads.

For now, setting GOMAXPROCS seems the likely right fix - but, we still don't understand why we end up with so many gc goroutines, or if it is those that are causing the OS threads creation. Nor do we know why this only happens for the proxy, and not the shim (although the shim is generally more quiescent is my understanding, which I suspect is the main difference).

I'll see if I can work out how to do some golang trace on the scheduler and/or gc whilst having the proxy under docker/runtime control...

In summary, I think:

We understand what causes the death
We know how to limit that death with GOMAXPROCS
I'm happy that is a valid 'fix'.... but
I think we don't really understand why this happens specifically for the proxy. I think we suspect this is not a problem with the proxy, but we should expend a little more effort trying to verify that.

markdryan · 2018-10-16T14:00:25Z

@grahamwhaley My guess would be that the proxy is running more simultaneous go-routines than the other processes, or possibly generating more garbage.

grahamwhaley · 2018-10-16T14:03:54Z

I thought I'd go see if docker-containerd had anything in place already ;-)
from here

	if os.Getenv("GOMAXPROCS") == "" {
		// If GOMAXPROCS hasn't been set, we default to a value of 2 to reduce
		// the number of Go stacks present in the shim.
		runtime.GOMAXPROCS(2)
	}

containerd/containerd@5764bf1

Seems we have some precedent.

grahamwhaley · 2018-10-16T18:59:17Z

Update. Using this horrible hack (which I slapped in a gist for good measure ;-)

In the proxy:

diff --git a/proxy.go b/proxy.go
index 2a51f16..4a054bb 100644
--- a/proxy.go
+++ b/proxy.go
@@ -18,6 +18,8 @@ import (
        "net/url"
        "os"
        "os/signal"
+//     "runtime"
+       "runtime/trace"
        "sync"
        "syscall"
        "time"
@@ -450,5 +452,18 @@ func realMain() {

 func main() {
        defer handlePanic()
+
+       if _, err := os.Stat("/tmp/trace_proxy"); !os.IsNotExist(err) {
+               fname := fmt.Sprintf("/tmp/proxy.%d.log", os.Getpid())
+               logFile, _ := os.OpenFile(fname, os.O_WRONLY | os.O_CREATE | os.O_SYNC, 0755)
+               defer logFile.Close()
+               if false {
+                       trace.Start(logFile)
+                       defer trace.Stop()
+               } else {
+                       syscall.Dup2(int(logFile.Fd()), 1)
+                       syscall.Dup2(int(logFile.Fd()), 2)
+               }
+       }
        realMain()
 }

And in the runtime:

diff --git a/virtcontainers/kata_proxy.go b/virtcontainers/kata_proxy.go
index 1ff9d01..6fbc0b7 100644
--- a/virtcontainers/kata_proxy.go
+++ b/virtcontainers/kata_proxy.go
@@ -6,6 +6,7 @@
 package virtcontainers

 import (
+       "os"
        "os/exec"
        "syscall"
 )
@@ -47,6 +48,13 @@ func (p *kataProxy) start(params proxyParams) (int, string, error) {
        }

        cmd := exec.Command(args[0], args[1:]...)
+
+       if _, err := os.Stat("/tmp/trace_proxy"); !os.IsNotExist(err) {
+               env := os.Environ()
+               env = append(env, "GODEBUG=gctrace=2,schedtrace=60000")
+               cmd.Env = env
+       }
+
        if err := cmd.Start(); err != nil {
                return -1, "", err
        }

Then running up the 1000 container test (with the GOMAXPROC fix removed), and then creating the /tmp/trace_proxy file and launching container 1001 by hand to be traced... I am seeing this type of data in the log trace for that proxy... let's see what that looks like in the morning ;-)

SCHED 60004ms: gomaxprocs=88 idleprocs=88 threads=11 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 120009ms: gomaxprocs=88 idleprocs=88 threads=11 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
scvg0: inuse: 2, idle: 0, sys: 2, released: 0, consumed: 2 (MB)
SCHED 180014ms: gomaxprocs=88 idleprocs=88 threads=11 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 240018ms: gomaxprocs=88 idleprocs=88 threads=11 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
scvg1: 0 MB released
scvg1: inuse: 2, idle: 0, sys: 2, released: 0, consumed: 2 (MB)
SCHED 300168ms: gomaxprocs=88 idleprocs=88 threads=11 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 360172ms: gomaxprocs=88 idleprocs=88 threads=11 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

amshinde · 2018-10-17T00:50:31Z

@grahamwhaley Nice findings about the GOMAXPROCS. Agree, we need to see why the proxy is generating so many go-routines in the first place. Regarding you comment about timeouts, I think we need to set our heartbeat intervals to be less than the write timeout, to make sure we do not timeout because of missed heartbeats. wdyt?

grahamwhaley · 2018-10-17T15:05:35Z

PRs opened to set GOMAXPROCS for the shim and proxy:
kata-containers/shim#115
kata-containers/proxy#118

I did a fuller run and captured the GODEBUG trace for when a 'proxy goes bad'.... as we thought, it happens at the point where the GC kicks in. The proxy starts with 11 threads reported, and then at the point where the first GC cycle kicks in ('GC forced') at about the 52 minute point from the test start, we see the thread count grow. The odd thing is it then seems to grow an extra 2 threads per GC cycle, until it settled down at 40 threads in this instance, when I got bored waiting ;-)

SCHED 3000605ms: gomaxprocs=88 idleprocs=88 threads=11 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
gc 1 @3029.205s 0%: 0.13+0.65+0.70 ms clock, 11+0.16/0.74/1.0+62 ms cpu, 4->4->0 MB, 5 MB goal, 88 P
SCHED 3060607ms: gomaxprocs=88 idleprocs=88 threads=16 spinningthreads=0 idlethreads=8 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 3120612ms: gomaxprocs=88 idleprocs=88 threads=16 spinningthreads=0 idlethreads=8 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
GC forced
gc 2 @3149.215s 0%: 0.12+0.61+0.51 ms clock, 10+0/0.88/0.45+45 ms cpu, 1->1->1 MB, 4 MB goal, 88 P
scvg20: 0 MB released
scvg20: inuse: 1, idle: 2, sys: 4, released: 0, consumed: 4 (MB)
SCHED 3180612ms: gomaxprocs=88 idleprocs=88 threads=18 spinningthreads=0 idlethreads=10 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 3240616ms: gomaxprocs=88 idleprocs=88 threads=18 spinningthreads=0 idlethreads=10 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
GC forced
gc 3 @3269.228s 0%: 0.040+0.53+0.55 ms clock, 3.5+0/1.1/0.54+49 ms cpu, 1->1->1 MB, 4 MB goal, 88 P
scvg21: inuse: 1, idle: 2, sys: 4, released: 0, consumed: 4 (MB)
SCHED 3300625ms: gomaxprocs=88 idleprocs=88 threads=20 spinningthreads=0 idlethreads=12 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 3360632ms: gomaxprocs=88 idleprocs=88 threads=20 spinningthreads=0 idlethreads=12 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
GC forced
gc 4 @3389.236s 0%: 0.013+0.38+0.66 ms clock, 1.2+0/0.79/0.55+58 ms cpu, 1->1->1 MB, 4 MB goal, 88 P
SCHED 3420633ms: gomaxprocs=88 idleprocs=88 threads=22 spinningthreads=0 idlethreads=14 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
scvg22: 2 MB released
scvg22: inuse: 1, idle: 2, sys: 3, released: 2, consumed: 1 (MB)
SCHED 3480641ms: gomaxprocs=88 idleprocs=88 threads=22 spinningthreads=0 idlethreads=14 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
GC forced
gc 5 @3509.243s 0%: 0.033+0.42+0.78 ms clock, 2.9+0/0.66/0.47+69 ms cpu, 1->1->1 MB, 4 MB goal, 88 P
SCHED 3540649ms: gomaxprocs=88 idleprocs=88 threads=24 spinningthreads=0 idlethreads=16 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

.... 

scvg29: 0 MB released
scvg29: inuse: 1, idle: 1, sys: 3, released: 1, consumed: 1 (MB)
SCHED 4500738ms: gomaxprocs=88 idleprocs=88 threads=40 spinningthreads=0 idlethreads=32 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 4560745ms: gomaxprocs=88 idleprocs=88 threads=40 spinningthreads=0 idlethreads=32 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
GC forced
gc 14 @4589.303s 0%: 0.031+0.43+1.0 ms clock, 2.7+0/0.83/0.51+90 ms cpu, 1->1->1 MB, 4 MB goal, 88 P
SCHED 4620750ms: gomaxprocs=88 idleprocs=88 threads=40 spinningthreads=0 idlethreads=32 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
scvg30: inuse: 1, idle: 1, sys: 3, released: 1, consumed: 2 (MB)
SCHED 4680755ms: gomaxprocs=88 idleprocs=88 threads=40 spinningthreads=0 idlethreads=32 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

I'll have a little read around golang gc optimisation in case there is something we are doing in the proxy that makes this worse, but I think if we really want to get to the bottom of what is going it, it could take some digging. I might reach out to the golang folks to ask if they know why (is it maybe due to processor migrations forcing the gc to kick off more threads on the new cores it has landed on etc.).

88plug · 2018-10-19T02:32:58Z

I have same issue with ~40 crypto wallets....will report back soon.
Server Version: 18.06.1-ce

martinlevesque · 2019-05-10T22:17:14Z

docker -v
Docker version 18.09.2, build 6247962

Had to do a service restart: service docker restart

Then the issue was fixed.

A trick is to have that docker daemon config:

{
"live-restore": true
}

So when you restart the service the existing containers stay up

grahamwhaley · 2019-05-13T08:27:47Z

Thanks for the info @martinlevesque . The more details you can provide the better. For instance:

which version of Kata were you running
What containers were you running
how many?
Attaching the output of kata-collect-data.sh to give us an idea of the setups

jodh-intel · 2019-05-13T08:29:39Z

@martinlevesque - Could you clarify if you were using live restore when you restarted the docker service? If so, doesn't that imply a docker bug?

martinlevesque · 2019-05-13T11:51:31Z

Hey,

I have no idea what kata containers are in fact, my bad. I am using the latest debian docker.io package.

grahamwhaley · 2019-05-13T12:21:27Z

Hi @martinlevesque - just to check then... do you mean either:

You are using Kata Containers, but you don't know what version
or
You are not using Kata Containers at all, and you are reporting a generic Docker issue?

If the former, please run kata-runtime kata-env or the aforementioned kata-collect-data.sh and attach the output here...
If the latter, then this is not the place to report generic Docker issues - my guess is you should be reporting that maybe over in the https://github.com/moby code base, or directly with Docker themselves?

Also, if the latter, we can close the issue here :-)

martinlevesque · 2019-05-13T13:29:41Z

@grahamwhaley I'm not using kata containers - yes this is purely related to docker.

Thanks

grahamwhaley · 2019-05-13T14:10:18Z

OK, closing this issue here then as 'not applicable'. Good luck

…cgroup device: Ease device access for rootfs device to allow node creation

grahamwhaley changed the title ~~scaling: stability: 1000 containers after XX hours shows issue~~ scaling: stability: 1000 containers after apprx 3 hours shows issue Oct 5, 2018

grahamwhaley mentioned this issue Oct 8, 2018

list: should be more robust in the event of bad or missing containers #809

Closed

This was referenced Oct 17, 2018

Limit the number of threads golang can utilise kata-containers/proxy#117

Closed

Restrict the max number of golang threads that will be used kata-containers/shim#114

Closed

grahamwhaley closed this as completed May 13, 2019

mgax mentioned this issue Jul 22, 2019

Examples liquidinvestigations/cluster#58

Closed

egernst pushed a commit to egernst/runtime that referenced this issue Feb 9, 2021

Merge pull request kata-containers#807 from amshinde/ease-rootfs-dev-…

2e94561

…cgroup device: Ease device access for rootfs device to allow node creation

scaling: stability: 1000 containers after apprx 3 hours shows issue #807

scaling: stability: 1000 containers after apprx 3 hours shows issue #807

Comments

grahamwhaley commented Oct 5, 2018

Description of problem

Expected result

Actual result

What did I run

What did I see

grahamwhaley commented Oct 5, 2018

jodh-intel commented Oct 5, 2018

grahamwhaley commented Oct 5, 2018

jodh-intel commented Oct 5, 2018

grahamwhaley commented Oct 5, 2018

jodh-intel commented Oct 5, 2018

grahamwhaley commented Oct 5, 2018

grahamwhaley commented Oct 5, 2018

amshinde commented Oct 8, 2018

grahamwhaley commented Oct 9, 2018

grahamwhaley commented Oct 9, 2018

grahamwhaley commented Oct 9, 2018

grahamwhaley commented Oct 9, 2018

jodh-intel commented Oct 10, 2018

grahamwhaley commented Oct 10, 2018

sboeuf commented Oct 11, 2018

jodh-intel commented Oct 11, 2018

bergwolf commented Oct 11, 2018

grahamwhaley commented Oct 11, 2018

bergwolf commented Oct 15, 2018

jodh-intel commented Oct 15, 2018

grahamwhaley commented Oct 15, 2018

grahamwhaley commented Oct 15, 2018

1) Containers die

2) the kata-proxy is eating all the threads

3) The proxy has many idle gothreads that are processor specific

4) Because the proxy is migrating around the processors?

4) Hypothesis then

5) What does that mean for current kata scaling limits

6) Next steps

amshinde commented Oct 15, 2018 • edited Loading

grahamwhaley commented Oct 16, 2018

grahamwhaley commented Oct 16, 2018

bergwolf commented Oct 16, 2018

markdryan commented Oct 16, 2018

grahamwhaley commented Oct 16, 2018

markdryan commented Oct 16, 2018

grahamwhaley commented Oct 16, 2018

markdryan commented Oct 16, 2018

grahamwhaley commented Oct 16, 2018

grahamwhaley commented Oct 16, 2018

amshinde commented Oct 17, 2018

grahamwhaley commented Oct 17, 2018

88plug commented Oct 19, 2018

martinlevesque commented May 10, 2019 • edited Loading

grahamwhaley commented May 13, 2019

jodh-intel commented May 13, 2019

martinlevesque commented May 13, 2019

grahamwhaley commented May 13, 2019

martinlevesque commented May 13, 2019

grahamwhaley commented May 13, 2019

2) the `kata-proxy` is eating all the threads

amshinde commented Oct 15, 2018 •

edited

Loading

martinlevesque commented May 10, 2019 •

edited

Loading