-
Notifications
You must be signed in to change notification settings - Fork 374
scaling: stability: 1000 containers after apprx 3 hours shows issue #807
Comments
@jodh-intel @sboeuf - input welcome here. |
Wow, that system looks sick (in the conventional, non-slang sense :) Might be worth grabbing Also, I'm guessing we're not doing this currently, but it might be worth running I've noticed something odd in the |
Yep, it occurred to me I can still use gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ ps -e | fgrep kata-shim | wc
843 3372 28662
gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ ps -e | fgrep kata-shim | wc
843 3372 28662
gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ ps -e | fgrep qemu-lite | wc
1000 4000 40000 Oh dear. Well, that sort of aligns pretty well with my previous run - my notes for that say '847 left'.
I tried to do a kata log grab/parse, but it seems my golang runtime is so stuffed by this that I can't go get and build it.... it probably could have done with a clean set of logs anyway - so, I'll reserve that for the next clean reboot/run/test. I don't think
and some system resource things:
Let me check something with our OPS folks - when I first ran this test their nagios got somewhat upset as I'd launched 13k processes :-) They bumped the limit, but that 13500 processes looks like a horribly round number, and near the limit I think they set. I'm hoping they don't have some auto-killer set up.... hmm... even if they do, we should probably handle it better ;-) |
That journal snippet shows that almost straight into the test a bunch of pids are dying with [1] - I also wonder if we should be capturing golang details in the collect script. |
go version 1.10 |
But you can run that command on that box without it crashing? |
yep, So, we should probably stare fairly hard at that, and then I update my golang and try again? |
OK, that link may not be directly related, but surfing a bit more around the phrase |
@grahamwhaley Might be worth running this with runc as well to make sure this is a Kata issue. Have you tried that? |
That's not a bad idea @amshinde :-). For the record, it looked like after 1h30, maybe the proxy died (apport seems to have picked up something), and that upset the shim etc. |
I may have a new clue. If I do a |
I re-ran with v1.3.0. I monitored the threads, and I see 'thread growth'. Looking at the threads normally on the components, it seems the proxy sits around 10 threads, but in this scenario, it seems to grow/leak threads - monitoring one proxy I saw it grow to 41 and then 48 threads before I stopped the test. For reference then, I did a count up of threads associated with each container/component and came up with roughly (for an idle busybox container): shim: 10 For a total of ~35 threads per kata instance. Thus, I should have had ~35k threads for my 1000 containers, but my system was showing more like 70k (the system caps at 90k threads, which is when I believe it starts killing things off, or rather, things fail to launch a new thread and abort). With 1000 containers the system was gaining about 50 threads/minute. My belief is that the proxy leaks threads when under this test setup. I suspect it may be that things respond slowly and maybe it is having some timeouts and spawning new threads without cleaning up old ones. I'll enable proxy debug and run the test again and see if that gives us any more info. |
Enabled proxy debug, ran 1k containers, and then picked on a proxy PID to examine (as the thread count rose...) and did a bit of diagnostic work: $ ps -e | fgrep kata-proxy | tail -1
88954 ? 00:00:04 kata-proxy
# magically dump the stacks to the journal
$ kill -SIGUSR1 88954
$ sudo journalctl _PID=88954 --no-pager > 88954_proxy.log
$ grep goroutine 88954_proxy.log
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.549687853-07:00" level=error msg="goroutine 7 [running]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.550166481-07:00" level=error msg="goroutine 1 [chan receive, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.550410725-07:00" level=error msg="goroutine 5 [syscall]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.550659135-07:00" level=error msg="goroutine 6 [select, 80 minutes, locked to thread]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.550938473-07:00" level=error msg="goroutine 8 [IO wait, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.551525196-07:00" level=error msg="goroutine 9 [IO wait]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.552373309-07:00" level=error msg="goroutine 10 [select]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.552553201-07:00" level=error msg="goroutine 11 [sleep]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.552787405-07:00" level=error msg="goroutine 12 [IO wait, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.55339924-07:00" level=error msg="goroutine 13 [chan receive, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.553582598-07:00" level=error msg="goroutine 20 [select, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Oct 09 11:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T11:05:46.553966851-07:00" level=error msg="goroutine 21 [IO wait, 80 minutes]:" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
$ grep goroutine 88954_proxy.log | wc
12 191 2968
$ ps -L 88954
PID LWP TTY STAT TIME COMMAND
88954 88954 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88957 ? Sl 0:01 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88958 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88959 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88960 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88961 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88963 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88965 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88966 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 89178 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 89179 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 5901 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 5904 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 5905 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 5907 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 7411 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 7412 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 9708 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 9709 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 11829 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 11830 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 14921 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 14923 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 18112 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 18113 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 21643 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 21644 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 25712 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 25714 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 29350 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 29351 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 33153 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 33154 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 36749 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 36755 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 40632 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 40633 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 44061 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 44062 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 53254 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 57804 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 64891 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 69407 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 74352 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 75414 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 77034 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 55391 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 2636 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 3692 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 26355 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 59354 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 74836 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 79670 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 81267 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 82842 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 86503 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 88631 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 89814 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
88954 2459 ? Sl 0:00 /usr/libexec/kata-containers/kata-proxy -listen-socket unix:///run/vc/
$ ps --no-headers -L 88954 | wc
59 944 27966
$ grep error 88954_proxy.log | head -1
Oct 09 09:05:46 clrw02 kata-proxy[88954]: time="2018-10-09T09:05:46.644453689-07:00" level=debug msg="Copy stream error" error="write unix /run/vc/sbs/dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1/proxy.sock->@: use of closed network connection" name=kata-proxy pid=88954 sandbox=dd1ab4303ec71e49c04e88eeab282cf9356be2eb3a0275098527ebb6698b70a1 source=proxy
Two things then...
If there are other things you think I can look at, just let me know - I'll leave this test running overnight. |
@grahamwhaley - I've done some testing but cannot make the proxy misbehave yet. That might be because I'm not scaling up to the levels you are but fwics the connections are always closed correctly. I'll keep trying... On the topic of goroutines - there isn't a 1:1 relationship between O/S threads and goroutines so I think we'd expect to see |
More info then... This then also hints at the tempting sounding SetMaxThreads, but from reading I ran up Here is a thread snippet:
Only one thread is at epollwait - all others are in the runtime.futex. Their stacks look like:
The goroutines look like:
The gopark stacks look like:
I need to do more digging into the code to see why the goroutines might end up at that point. Still open for ideas... ;-) |
@jodh-intel no sorry, I don't have anything that comes to mind here. I'll try to look into the proxy code later. |
Looking at those goroutines, nothing jumps out - we can see:
And in fact when I connect to an idle proxy I see the same set of goroutines. |
man 2 clone:
It looks like you have reached one of these user namespaces limits. |
:-) You threw me for a minute there @bergwolf - but, you are probably right in a way - I think we have hit the pid_max limit - which probably also applies to the pid namespace. @jodh-intel and I are staring at the for loop in the proxy at https://github.com/kata-containers/proxy/blob/master/proxy.go#L123-L138
I put a log message just above the
Eyes on that code welcome. And any input on why we get multiple Accepts() (and if that is expected or not) also welcome. I can add more log/debug if we have ideas to test or prove. |
@grahamwhaley what host kernel version are you using? |
@grahamwhaley - To help answer @bergwolf's question, please can you paste in the output of |
@bergwolf @jodh-intel - the output of a
Let me do a post in a minute giving the core details of the issue (the proxy grows idle timer threads, which runs the machine out of PID space basically). |
OK, let me see if I can explain what I think I know/have found. I'll explain it in a sort of reverse-timeline from the point where containers die.... 1) Containers dieThe root issue is that the system runs out of PID space. That is, there is a limited number of PIDs each system supports. On this system, if we go cat
That is a system where I had already killed a few 2) the
|
what | No. |
---|---|
gcBgMarkWorker | 88 |
timerproc | 11 |
!goexit@2 | 15 |
The gcBgMarkWorker
is the golang garbage collector. It has spawned one goroutiner per processor (we'll get to that code in a minute).
The timerproc
is reasonably obvious - that is the per-processor goroutine that processes the timer queues. We don't have one of those for every processor though.
The other 15 entries are the 'normal' goroutines doing the work for the proxy
, and look like:
$ grep "2 " stacks.txt | grep -v goexit | grep -v timerproc grep -v "12 "
2 0x0000000000405d32 in runtime.chanrecv
2 0x000000000042d0ec in runtime.forcegchelper
2 0x000000000041f27c in runtime.bgsweep
2 0x00000000004164fd in runtime.runfinq
2 0x00000000004f5bd2 in os/signal.loop
2 0x0000000000454be4 in runtime.ensureSigM.func1
2 0x0000000000405d32 in runtime.chanrecv
2 0x0000000000427c47 in internal/poll.runtime_pollWait
2 0x0000000000427c47 in internal/poll.runtime_pollWait
2 0x00000000004fa190 in github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux.(*Session).send
2 0x00000000004486a6 in time.Sleep
2 0x0000000000427c47 in internal/poll.runtime_pollWait
2 0x0000000000405d32 in runtime.chanrecv
2 0x00000000004fbfa1 in github.com/kata-containers/proxy/vendor/github.com/hashicorp/yamux.(*Stream).Read
2 0x0000000000427c47 in internal/poll.runtime_pollWait
So, the garbage collector looks to be the main culprit on spawning threads. Let's have a very quick peek at that code. Here is a stack:
(dlv) stack
0 0x000000000042d2aa in runtime.gopark
at /usr/local/go/src/runtime/proc.go:292
1 0x0000000000419e90 in runtime.gcBgMarkWorker
at /usr/local/go/src/runtime/mgc.go:1775
2 0x0000000000457fe1 in runtime.goexit
at /usr/local/go/src/runtime/asm_amd64.s:2361
Which leads us to the golang gc code at https://github.com/golang/go/blob/release-branch.go1.10/src/runtime/mgc.go#L1714-L1728
but... why do we end up with so many?
4) Because the proxy is migrating around the processors?
The only reasons I could think of why we would end up with so many per-cpu goroutines was either we were getting some lock/race/stuck condition that then forced the scheduler to spawn more threads on other processors, or the proxy was migrating across processors, and the golang gc was kicking off a new thread/goroutine on the next iteration of the gc on the new processor.
I'm not yet convinced, but I thought I'd check the proxy for migrations. Remember, this is a wholly 'static' set of containers - 999 containers (one failed to launch ;-) ) all running busybox tailing /dev/null
- so, no real activity:
gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density/crash5$ sudo perf stat -p 89992 sleep 600
Performance counter stats for process id '89992':
26.596074 task-clock (msec) # 0.000 CPUs utilized
629 context-switches # 0.024 M/sec
23 cpu-migrations # 0.865 K/sec
0 page-faults # 0.000 K/sec
29,650,314 cycles # 1.115 GHz
26,635,571 instructions # 0.90 insn per cycle
6,137,454 branches # 230.765 M/sec
175,515 branch-misses # 2.86% of all branches
600.012957407 seconds time elapsed
Well, that is pretty interesting. In the space of 600 seconds we got 629 context switches (sounds like a 1s timer...), and 23 cpu migrations??
4) Hypothesis then
My basic hypothesis is that, when we have more processes than CPUs, the proxy gets migrated around, and that generates the gc goroutines which occupy threads. This does not quite hold water though, as then we'd expect to have 88 threads (one for each CPU), and why does this not also happen for the shim. Well, OK, maybe the shim really is asleep all the time, and thus not scheduled, and thus not migrated (should be easy to prove with perf
). But the proxy has the yamux heartbeat timer - so, it is constantly waking up and doing a little bit of work. If you have 1000 of them, and they are all waking up at 1s intervals, in my mind it is not inconceivable that at some point many of them wake up on the same processor, where other processors have been idle, and the kernel scheduler decides somebody needs migrating. And maybe that is just a little storm that continues over time.
5) What does that mean for current kata scaling limits
Right now, it would seem that kata scaling is somewhat tied to the number of CPUs on the system, due to the proliferation of threads spawned in the proxy.
Given the processes/threads we have accounted for for a kata container:
what | how many procs/threads |
---|---|
docker | 1 |
qemu | 3 |
kvm-pit | 1 |
kvm-vhost | 1 |
shim | 10 |
proxy | 15 +ncpu |
and we know that the kernel, by default, allows 1024 PIDs per cpu, we can probably formulate a guess at the max number of containers as limited by the nCPU right now:
containers <= (ncpu * 1024) / (31 + ncpu)
6) Next steps
We still need to prove exactly what is happening and why. The 60threads-vs-88cores does not quite make sense.
There are a number of golang trace facilities and settings we can try to correlate this situation. GOMAPXPROCS will be an interesting one to start with (although I'm not sure if it will constrain the gc goroutines, as they are system goroutines, not user ones etc.). There is go tracing for the scheduler as well. The hard bit will be getting those set up on the proxy maybe.
I'm discussing with @markdryan about the golang scheduler, and correlating if my thoughts so far make sense. @markdryan suspects we may have goroutines hung up on syscalls somewhere - I have provided him with a full set of stack traces for the 114 goroutines in case something leaps out.
The search continues. Things I might try:
- see if I can get
GOMAXPROCS
into the docker env and thus passed onto the proxy (or, see if I can wire it into the proxy startup, which I think might be an option), to see if that reduces the max threads and goroutines the proxy spawns. - Possibly run less containers (say n<cpus), and see if that effects the cpu migration and thread/goroutine count. Maybe we will find with ncpus it appears...
@grahamwhaley I would be interested in seeing the stack traces of the proxy goroutines. I took an initial look at the proxy code, I see that we manually send out the hearbeat every second now. We have also set the session |
Hi @amshinde . Sure, np. Attaching a set of stacks grabbed from Heh, yeah, I'd spotted those 1s timeouts on Friday. I tweaked them (to 30s for heartbeat and 10s for write timeout) for my 'possible fix over the weekend' - but, it didn't work :-( @@ -352,6 +358,9 @@ func realMain() {
var channel, proxyAddr, agentLogsSocket, logLevel string
var showVersion bool
+ // Reduce the number of threads we make
+ runtime.GOMAXPROCS(10)
+ applied to the proxy, and this morning my stats look like: $ ./stats.sh
999 proxys, 11571 threads - avg 11.58
999 shims, 10583 threads - avg 10.59
999 qemus, threads - avg 3.00
$ ps -eL | wc -l
41221
The heavy implication being that This will need more investigation and thought I think - but, it is a good clue. Maybe this is a core golang gc/scheduler 'feature' the proxy if provoking etc. |
For ref, whilst I have it on my screen, I attached to one of the proxy's with
|
Nice findings @grahamwhaley ! By default The same applies to kata-shim since it does not set And IMO we should dig a bit more to see why kata-proxy spawns to the limit of max processes. It seems one possible reason is docker/k8s keeps calling Another thing to look at the following piece of code. Maybe we should drop
|
A couple of comments. At the bottom of https://github.com/golang/go/wiki/Performance#garbage-collector-trace we see a comment stating that the maximum number of threads the GC can use is 8. So I think we can rule the GC out. Next, the fact that GOMAXPROCS changes things is interesting. "The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit." If by setting this to 10 we see a reduction in the number of threads used by the process, this implies that the issue is not with blocking system calls as I had first thought. If it were, setting GOMAXPROCS, wouldn't improve things. The fact that it does also implies that the issue is not with system runtime threads either. So perhaps what we're seeing is that a burst in activity leads to the creation of lots of go routines and the allocation of lots of system threads to service them in parallel. These threads are never released back to the system. Certainly in the small test program I ran quickly on my 8 core machine, I don't ever see any threads getting released back to the system, once the go-routines they were allocated to service have quit. Assuming this is the case setting GOMAXPROCS is the correct fix. |
Hi @markdryan
Well, I see it on the wiki, but I don't see it in the code. What I do see in the code is I think a limit of 25% of CPU resources: That is for the background mark code I think, but then those are the idle threads I saw in the stack dumps. I don't know if the rest of the gc is parallelized (but there are some really interesting presentations out there on the golang gc - this one covers the history and design... https://blog.golang.org/ismmkeynote). Anyhow, I think unless we invoke the tracers and try to deep-diagnose this, then we may not understand what is going on there with the gc. As per @bergwolf though - we probably do want to understand. I really don't like closing bugs with a workaround without knowing what it's really doing (that is ... proof....) :-) I'll see if I can get set up to hand-run one container/proxy after I have the other 999 running or so, and see if we can get any trace. Looks like the trace comes out on stdout though, so that won't help. @bergwolf wrt that loop and the If anybody has something specific they want me to try, just ask. Or, if you want to try and reproduce this on a local machine (which, probably due to the nature of the problem may be easier to see if the machine has many cores), also just ask. |
@grahamwhaley It looks like you're correct. That comment about the GC thread limit is out of date. It looks like they increased the value to 32 and then got rid of it completely. In any case, it doesn't really make any difference. If I understand correctly GOMAXPROCs affects the number of threads the runtime can use for the GC as well as the user Go routines. By not setting the value you're giving each GO process the right to legitimately use at least 88 threads on your machine, and the runtime is doing just this. |
@markdryan Basically, yes. And there is no current mechanism for golang to reap/prune/purge the idle threads I think. There is a merge in 1.11.1 that looks like the prep work to enable this. And there is an open Issue relating to quitting idle OS threads. For now, setting I'll see if I can work out how to do some golang trace on the scheduler and/or gc whilst having the proxy under docker/runtime control... In summary, I think:
|
@grahamwhaley My guess would be that the proxy is running more simultaneous go-routines than the other processes, or possibly generating more garbage. |
I thought I'd go see if
Seems we have some precedent. |
Update. Using this horrible hack (which I slapped in a gist for good measure ;-) In the proxy: diff --git a/proxy.go b/proxy.go
index 2a51f16..4a054bb 100644
--- a/proxy.go
+++ b/proxy.go
@@ -18,6 +18,8 @@ import (
"net/url"
"os"
"os/signal"
+// "runtime"
+ "runtime/trace"
"sync"
"syscall"
"time"
@@ -450,5 +452,18 @@ func realMain() {
func main() {
defer handlePanic()
+
+ if _, err := os.Stat("/tmp/trace_proxy"); !os.IsNotExist(err) {
+ fname := fmt.Sprintf("/tmp/proxy.%d.log", os.Getpid())
+ logFile, _ := os.OpenFile(fname, os.O_WRONLY | os.O_CREATE | os.O_SYNC, 0755)
+ defer logFile.Close()
+ if false {
+ trace.Start(logFile)
+ defer trace.Stop()
+ } else {
+ syscall.Dup2(int(logFile.Fd()), 1)
+ syscall.Dup2(int(logFile.Fd()), 2)
+ }
+ }
realMain()
} And in the runtime: diff --git a/virtcontainers/kata_proxy.go b/virtcontainers/kata_proxy.go
index 1ff9d01..6fbc0b7 100644
--- a/virtcontainers/kata_proxy.go
+++ b/virtcontainers/kata_proxy.go
@@ -6,6 +6,7 @@
package virtcontainers
import (
+ "os"
"os/exec"
"syscall"
)
@@ -47,6 +48,13 @@ func (p *kataProxy) start(params proxyParams) (int, string, error) {
}
cmd := exec.Command(args[0], args[1:]...)
+
+ if _, err := os.Stat("/tmp/trace_proxy"); !os.IsNotExist(err) {
+ env := os.Environ()
+ env = append(env, "GODEBUG=gctrace=2,schedtrace=60000")
+ cmd.Env = env
+ }
+
if err := cmd.Start(); err != nil {
return -1, "", err
} Then running up the 1000 container test (with the
|
@grahamwhaley Nice findings about the |
PRs opened to set I did a fuller run and captured the GODEBUG trace for when a 'proxy goes bad'.... as we thought, it happens at the point where the GC kicks in. The proxy starts with 11 threads reported, and then at the point where the first GC cycle kicks in ('GC forced') at about the 52 minute point from the test start, we see the thread count grow. The odd thing is it then seems to grow an extra 2 threads per GC cycle, until it settled down at 40 threads in this instance, when I got bored waiting ;-)
I'll have a little read around golang gc optimisation in case there is something we are doing in the proxy that makes this worse, but I think if we really want to get to the bottom of what is going it, it could take some digging. I might reach out to the golang folks to ask if they know why (is it maybe due to processor migrations forcing the gc to kick off more threads on the new cores it has landed on etc.). |
I have same issue with ~40 crypto wallets....will report back soon. |
docker -v Had to do a service restart: service docker restart Then the issue was fixed. A trick is to have that docker daemon config: { So when you restart the service the existing containers stay up |
Thanks for the info @martinlevesque . The more details you can provide the better. For instance:
|
@martinlevesque - Could you clarify if you were using live restore when you restarted the docker service? If so, doesn't that imply a docker bug? |
Hey, I have no idea what kata containers are in fact, my bad. I am using the latest debian docker.io package. |
Hi @martinlevesque - just to check then... do you mean either:
If the former, please run Also, if the latter, we can close the issue here :-) |
@grahamwhaley I'm not using kata containers - yes this is purely related to docker. Thanks |
OK, closing this issue here then as 'not applicable'. Good luck |
…cgroup device: Ease device access for rootfs device to allow node creation
Description of problem
Using the fast footprint test, I launched 1000 containers, and left them running overnight. By the morning, things had crashed.
Expected result
These are pretty benign containers (a busybox doing 'notihing'), and the system is pretty large and not resource constrained afaict (88 cores, 384Gb or RAM). I'd expect the containers to stay up, pretty much forever.
Actual result
Something has 'died', and it looks like the kata runtime has become non-functional.
The first time I ran this test I believe iirc I ended up with 847 'live' containers in the morning. This time things crashed out. Logs below.
What did I run
For reference, I used this script to run the test and try to capture details upon death:
What did I see
From the logs then...
I've attached the full log as 'death.log'.
death.log
Also, if I try to use the runtime to list how many containers are still running:
I've uploaded the output from the kata collect as an attachment:
collect.log
The text was updated successfully, but these errors were encountered: