Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU/MEMORY leak in agones controller container #414

Closed
Oleksii-Terekhov opened this issue Nov 12, 2018 · 31 comments
Closed

CPU/MEMORY leak in agones controller container #414

Oleksii-Terekhov opened this issue Nov 12, 2018 · 31 comments
Labels
area/performance Anything to do with Agones being slow, or making it go faster. kind/bug These are bugs.
Milestone

Comments

@Oleksii-Terekhov
Copy link

Oleksii-Terekhov commented Nov 12, 2018

This is only a suspicion, but after restart agones-controller pod CPU usage constantly increased. This behavior independent to current count of fleets/gs/autoscalest.

Looks like any fleet/gs, even after deleting, eat CPU(need garbage collector in data structs???).

040
mixed

Reproduced with agones v0.4.0 and v0.5.0

@markmandel markmandel added area/performance Anything to do with Agones being slow, or making it go faster. kind/bug These are bugs. labels Nov 12, 2018
@markmandel
Copy link
Collaborator

Thanks for sharing this, and taking the time to graph it out. Do you have any specific steps we could we use to reproduce this? Looks like it happens over a few days?

@Oleksii-Terekhov
Copy link
Author

Nothing special - start/stop fleets
Up to 40 fleets with 5 gs on v0.4.0
Up to 40 fleets with autoscaler(min:0 buffer:2 max:10) v0.5.0
Agerage deploy rate for each fleet: ~4-5 per day

Main concern - heavy CPU usage after remove all fleets/gs/autoscalers

@Oleksii-Terekhov
Copy link
Author

Oleksii-Terekhov commented Nov 13, 2018

STR:

  1. empty k8s cluster (dashboard, agones, nginx-ingress, istio), 3 master nodes, 3 worker nodes, 8 Cores, Ram 16G
  2. Create easy start-wait-stop shell file
$ cat loop.sh
while :
do

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/agones/master/examples/simple-udp/fleet.yaml
    sleep 5
    kubectl get fleet
    sleep 5
    kubectl get gs -o=custom-columns=NAME:.metadata.name,STATUS:.status.state,CONTAINERPORT:.spec.ports[0].containerPort
    sleep 15
    kubectl delete fleet simple-udp
    sleep 5
done
  1. restart controller pod
  2. wait for 2 hours
  3. start loop.sh fir 2 hours
  4. stop loop.sh + delete all fleets
  5. wait for 1 hour

Results:

Wall clock: 10:00 CPU: 0.008m, RAM: 14M - restart controller
Wall clock: 11:00 CPU: 0.008m, RAM: 14M - after 1h idle
Wall clock: 12:00 CPU: 0.009m, RAM: 15M - start loop.sh
Wall clock: 13:00 CPU: 0.220m, RAM: 21M - after 1h burn
Wall clock: 13:30 CPU: 0.369m, RAM: 24M - after 1.5h burn
Wall clock: 14:00 CPU: 0.447m, RAM: 26M - stop all after 2h burn fleets

Wall clock: 15:00 CPU: 0.464m, RAM: 23M - after 1h idle

@markmandel
Copy link
Collaborator

Ah perfect, thanks!

Can you have a look (possible share?) the controller logs at the end there, if you still have them? I'm wondering if something gets stuck in a queue, and just tries to sync forever. It may show off something obvious.

@Oleksii-Terekhov
Copy link
Author

Pod still alive. Last log lines - from fleet shutdown time:

2018-11-13T11:57:50.051309488Z {"key":"default/simple-udp-scvnl-pmttc","msg":"Enqueuing key","queue":"stable.agones.dev.GameServerController","severity":"info","source":"*gameservers.Controller","time":"2018-11-13T11:57:50Z"}
2018-11-13T11:57:50.051337574Z {"key":"default/simple-udp-scvnl-ckkzk","msg":"Enqueuing key","queue":"stable.agones.dev.GameServerController","severity":"info","source":"*gameservers.Controller","time":"2018-11-13T11:57:50Z"} 

@markmandel
Copy link
Collaborator

Hmnn. It does resync every 30 seconds, but if there aren't any pods left, that is odd. Thanks for the info!

@Oleksii-Terekhov
Copy link
Author

Small update: with empty fleet (replicas: 0) - no CPU leakage

apiVersion: "stable.agones.dev/v1alpha1"
kind: Fleet
metadata:
  name: simple-udp
spec:
  replicas: 0
  template:
    spec:
      ports:
      - name: default
        containerPort: 7654
      template:
        spec:
          containers:
          - name: simple-udp
            image: gcr.io/agones-images/udp-server:0.4

@markmandel
Copy link
Collaborator

Looking to see if other scenarios cause the same effect.

Leaving a Fleet of 50 GameServers running over time, and then doing a restart in the middle - that seems to be fine.

selection_057

(Line is the restart)

@markmandel
Copy link
Collaborator

Would you feel comfortable rebuilding with https://golang.org/pkg/runtime/pprof/ and doing a test with that, and seeing where the CPU cycles are going?

Totally fine if you aren't - maybe we can do a special build.

markmandel added a commit to markmandel/agones that referenced this issue Nov 15, 2018
Tools for building a controller with pprof enabed, and Make targets
for accessing the http pprof endpoint while it is running on a Kubernetes
cluster, with accompanying documentation.

This will help in diagnosing googleforgames#414
markmandel added a commit to markmandel/agones that referenced this issue Nov 15, 2018
Tools for building a controller with pprof enabed, and Make targets
for accessing the http pprof endpoint while it is running on a Kubernetes
cluster, with accompanying documentation.

This will help in diagnosing googleforgames#414
markmandel added a commit to markmandel/agones that referenced this issue Nov 16, 2018
Tools for building a controller with pprof enabed, and Make targets
for accessing the http pprof endpoint while it is running on a Kubernetes
cluster, with accompanying documentation.

This will help in diagnosing googleforgames#414
@Oleksii-Terekhov
Copy link
Author

We use pprof, but i not so familiar with gotpl...
Let's make pprof always in controller and sdk images, but disabled.. Wait special env var or user signal to start-stop capturing...
I can test special image with pprof and send pprof logs at Monday.

@markmandel
Copy link
Collaborator

I made this PR to have a compile flag to enable pprof on some builds:
#422

Just waiting on review.

markmandel added a commit that referenced this issue Nov 16, 2018
Tools for building a controller with pprof enabed, and Make targets
for accessing the http pprof endpoint while it is running on a Kubernetes
cluster, with accompanying documentation.

This will help in diagnosing #414
@Oleksii-Terekhov
Copy link
Author

After build-run-burn pprof controller:
pprof_agones.zip

Notes:

  • controller eat 0.7 CPU during profiler time
  • I use manual command in build container go tool pprof -svg http://<k8s>:<controller_node_port>/debug/pprof/profile?seconds=300 instead pprof-web due ssh+headless VM
  • bug still in 0.6.0 - after 17h on prod cluster Agones controller eat 2500+ milli-cores

@cyriltovena
Copy link
Collaborator

only the "fast user space mutex" stand out here, might be because there is only one worker listening to the channel and a lot of events to process.

@markmandel
Copy link
Collaborator

What may work best (not sure if you still have the cpu pprof files - be good to keep them, so we can share them around, and do various diagramming on them).

Get one at the beginning when the system is idle (i.e. no changes are being made to fleets or gameservers), for a baseline.
Half way through the test, grab a another when the system is idle.
And then at the end when things have really spiked grab another one when the system is idle.

Then if we flame graph each stage, we should (hopefully) be able to compare each of them, and see visually where the growth is in the flame graphs.

Does that makes sense? I personally find it really hard to parse the cpu graph trees, I don't know about you all - also with no base level comparison, it's hard to see where the growth is - I think a flamegraph will be easier to visualise.

And thanks for putting all the time in on this!

@Oleksii-Terekhov
Copy link
Author

I finish new measure
pprof2.zip
When no API activity (create/delete/...) - controller log is empy

Root cause looks like simple memory leak: i see strong CPU-MEMORY corellation:
"Burn" cluster:
agones-burn
"Prod" cluster:
agones-memory-leak

CPU-MEMORY amount proportional to gs/gss/flt count from controller restart. Even deleted things consume resources

I do not test FleetAllocation and FleetAutoscaler

@markmandel
Copy link
Collaborator

if we think this is a memory leak - can we grab some pprof memory profiles, to see if we can see where it's happening?

I still think flamegraphs would be much easier to parse where the cpu time is going - I find the cpu graphs hard to follow personally.

@Oleksii-Terekhov
Copy link
Author

New data from pprof allocs+heap

pprof3.zip

@Oleksii-Terekhov Oleksii-Terekhov changed the title CPU leak in agones controller container CPU/MEMEORY leak in agones controller container Dec 4, 2018
@Oleksii-Terekhov Oleksii-Terekhov changed the title CPU/MEMEORY leak in agones controller container CPU/MEMORY leak in agones controller container Dec 4, 2018
@markmandel
Copy link
Collaborator

@Kuqd and I were digging into this some more - the memory leak looks to be in the k8s event recorder. Which leads up to think, k8s events aren't being properly processed, which provides the memory leak, and because the event queue goes into exponential backoff, we also get the CPU leak as well.

So a few more fun questions:

  • Are you running this on a GKE cluster? (Basically, I'm wondering if this is a custom cluster, and if there is a setup issue that is causing the events not to get processed)
  • Have we tried this without anything OTHER than agones installed on the cluster? (I noticed that there is Istio installed. I'm wondering if there is a mutating webhook making changes that are causing the event system to fail)

We will find a solution for this! 👍

@markmandel
Copy link
Collaborator

Oh I had one more question - are there any errors in the Kubernetes logs themselves, basically anything in regards to rejecting to not being able to process event streams?

(if you kubectl describe a GameServer resource - do you see events at the bottom of the resource description?)

@Oleksii-Terekhov
Copy link
Author

  • Are you running this on a GKE cluster? - No, several baremetal cluster. (kubespray 2.8.0 + 2.6.0)
  • Reproduced without istio (only dashboard+heapser+nginx_ingress)
  • kubectl describe gs:
Events:
  Type    Reason          Age   From                   Message
  ----    ------          ----  ----                   -------
  Normal  PortAllocation  8s    gameserver-controller  Port allocated
  Normal  Creating        8s    gameserver-controller  Pod simple-udp-kkl8c-rz9fn-rfdr2 created
  Normal  Scheduled       8s    gameserver-controller  Address and port populated
  Normal  Ready           6s    gameserver-controller  SDK.Ready() executed
  • No errors in k8s logs on master node and on node with controller pod
  • Memory top after fast burn:
go tool pprof http://10.10.20.13:59060/debug/pprof/heap

(pprof) top
Showing nodes accounting for 6256.90kB, 100% of 6256.90kB total
Showing top 10 nodes out of 97
      flat  flat%   sum%        cum   cum%
 1629.58kB 26.04% 26.04%  1629.58kB 26.04%  agones.dev/agones/vendor/github.com/golang/groupcache/lru.(*Cache).Add
  528.17kB  8.44% 34.49%   528.17kB  8.44%  agones.dev/agones/vendor/k8s.io/apimachinery/pkg/watch.(*Broadcaster).Watch.func1
     514kB  8.21% 42.70%      514kB  8.21%  agones.dev/agones/vendor/k8s.io/apimachinery/pkg/conversion.ConversionFuncs.Add
  512.62kB  8.19% 50.89%   512.62kB  8.19%  reflect.unsafe_NewArray
  512.19kB  8.19% 59.08%   512.19kB  8.19%  regexp.onePassCopy
  512.14kB  8.19% 67.27%  1024.33kB 16.37%  agones.dev/agones/vendor/k8s.io/api/core/v1.init
  512.07kB  8.18% 75.45%   512.07kB  8.18%  strings.Join
  512.05kB  8.18% 83.63%   512.05kB  8.18%  agones.dev/agones/vendor/github.com/json-iterator/go.describeStruct
  512.05kB  8.18% 91.82%   512.05kB  8.18%  agones.dev/agones/vendor/golang.org/x/net/http2.(*ClientConn).RoundTrip
  512.02kB  8.18%   100%   512.02kB  8.18%  agones.dev/agones/vendor/github.com/json-iterator/go.(*Iterator).ReadString
  • CPU top after fast burn:
go tool pprof http://10.10.20.13:59060/debug/pprof

(pprof) top
Showing nodes accounting for 280ms, 66.67% of 420ms total
Showing top 10 nodes out of 148
      flat  flat%   sum%        cum   cum%
     140ms 33.33% 33.33%      140ms 33.33%  runtime.futex
      50ms 11.90% 45.24%       50ms 11.90%  syscall.Syscall
      20ms  4.76% 50.00%       20ms  4.76%  runtime.usleep
      10ms  2.38% 52.38%       10ms  2.38%  context.Background
      10ms  2.38% 54.76%       10ms  2.38%  encoding/json.indirect
      10ms  2.38% 57.14%       10ms  2.38%  encoding/json.stateEndValue
      10ms  2.38% 59.52%       10ms  2.38%  encoding/json.stateInString
      10ms  2.38% 61.90%       10ms  2.38%  memeqbody
      10ms  2.38% 64.29%       10ms  2.38%  runtime.(*waitq).dequeueSudoG
      10ms  2.38% 66.67%       20ms  4.76%  runtime.chansend
(pprof) list runtime.futex
Total: 420ms
ROUTINE ======================== runtime.futex in /usr/local/go/src/runtime/sys_linux_amd64.s
     140ms      140ms (flat, cum) 33.33% of Total
 Error: Could not find file src/runtime/sys_linux_amd64.s on path /go


markmandel added a commit to markmandel/agones that referenced this issue Dec 10, 2018
This should have a check on it anyway, but I went hunting
as I was wondering if this might be the cause for googleforgames#414
EricFortin pushed a commit that referenced this issue Dec 11, 2018
This should have a check on it anyway, but I went hunting
as I was wondering if this might be the cause for #414
@aLekSer
Copy link
Collaborator

aLekSer commented Dec 11, 2018

When running this loop.sh script for about 30 minutes next errors were found in container logs with the count over 32:

{"msg":"error updating status on GameServerSet simple-udp-v6qn4: Operation cannot be fulfilled on gameserversets.stable.agones.dev \"simple-udp-v6qn4\": the object has been modified; please apply your changes to the latest version and try again","severity":"error","stack":["agones.dev/agones/pkg/gameserversets.(*Controller).syncGameServerSetState\n\t/go/src/agones.dev/agones/pkg/gameserversets/controller.go:373","agones.dev/agones/pkg/gameserversets.(*Controller).syncGameServerSet\n\t/go/src/agones.dev/agones/pkg/gameserversets/controller.go:253","agones.dev/agones/pkg/gameserversets.(*Controller).syncGameServerSet-fm\n\t/go/src/agones.dev/agones/pkg/gameserversets/controller.go:96","agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).processNextWorkItem\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:107","agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).runWorker\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:83","agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).runWorker-fm\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:135","agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133","agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134","agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.Until/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88","agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).run\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:135","runtime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"],"time":"2018-12-11T10:30:10Z"}

{"error":"error updating status on GameServerSet simple-udp-v6qn4: Operation cannot be fulfilled on gameserversets.stable.agones.dev \"simple-udp-v6qn4\": the object has been modified; please apply your changes to the latest version and try again","msg":"","obj":"default/simple-udp-v6qn4","queue":"stable.agones.dev.GameServerSetController","severity":"error","source":"*gameserversets.Controller","time":"2018-12-11T10:30:13Z"}

Diff in status before the error in syncGameServerSetState()

Status: {    Replicas: 2 ReadyReplicas: 0 AllocatedReplicas: 0 }
Status: {    Replicas: 2 ReadyReplicas: 1 AllocatedReplicas: 0 }

@markmandel
Copy link
Collaborator

"Object has been modified" errors are benign and are expected - details in this issue: kubernetes/kubernetes#28149

@aLekSer
Copy link
Collaborator

aLekSer commented Dec 12, 2018

The bug might be fixed along with this PullRequest kubernetes/kubernetes#70277
Fix goroutine leak of wait.poller

@aLekSer
Copy link
Collaborator

aLekSer commented Dec 12, 2018

The CPU leak part of the bug might be fixed along with this PullRequest kubernetes/kubernetes#70277 Fix goroutine leak of wait.poller

goroutine profile: total 110
12 @ 0x457d7b 0x457e23 0x42ebbb 0x42e99b 0x1653eb6 0x4875f1
#	0x1653eb5	agones.dev/agones/vendor/k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0xa5	/go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/record/event.go:229

9 @ 0x457d7b 0x457e23 0x42ebbb 0x42e99b 0x14e8cff 0xa88701 0x14e913a 0xa89162 0xa8840f 0xa8831b 0x14e2a1b 0x14e9952 0xa890bd 0x4875f1
#	0x14e8cfe	agones.dev/agones/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1.1+0x9e	/go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:549

Before applying changes of 70277 it was:

goroutine profile: total 740
632 @ 0x457d7b 0x469087 0xa1306f 0x4875f1
#    0xa1306e    agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait.poller.func1.1+0x19e    /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:354
12 @ 0x457d7b 0x457e23 0x42ebbb 0x42e99b 0x15901d6 0x4875f1
#    0x15901d5    agones.dev/agones/vendor/k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0xa5    /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/record/event.go:229

@markmandel
Copy link
Collaborator

This makes lots of sense, from the bug details:

This PR fix a bug of wait.poller(). wait.poller() returns a function with type WaitFunc. the function creates a goroutine and the goroutine only quits when after or done closed.

In cache.WaitForCacheSync, after is nil and done is never closed. Then the goroutine never stops.

Given that we cache.WaitForCacheSync on every allocation, and when we scale down GameServerSets - this explains why this hits us badly.

@aLekSer
Copy link
Collaborator

aLekSer commented Dec 17, 2018

I think there is no memory leak, after analyzing the inuse_space memory logs, Mark found the place which used memory the most this is EventCorrelator which has max size parameter of
maxLruCacheEntries = 4096 https://github.com/GoogleCloudPlatform/agones/blob/8f86eee485dd5d5c1f7fa18449208e02aa73ea46/vendor/k8s.io/client-go/tools/record/events_cache.go#L434
And if we change this const into 256 items we can see that now memory consumption is under 20 MB for a controller.
To conclude the only thing which is needed to fix this bug is to wait when https://github.com/kubernetes/kubernetes/pull/70277/files will be merged or make a branch with this fix applied.

@markmandel markmandel pinned this issue Dec 20, 2018
@markmandel
Copy link
Collaborator

markmandel commented Dec 20, 2018

Since we are cutting the RC on the 1st/2nd of Jan (depending on how awake I am 😄 ) - should we apply the patch to our system now, and confirm that it solves our issue so that it is prepared for our next release?

I would suggest we then keep this issue open to track the open PR (kubernetes/kubernetes#70277) on K8s, and then merge that in when it's ready (which is on a timeline we can't control).

What do we think?

@cyriltovena
Copy link
Collaborator

The fix is small, I'm all in for a patch !

aLekSer added a commit to aLekSer/agones that referenced this issue Dec 26, 2018
Fixes googleforgames#414 CPU leak.
Fix is proposed by next pull request, which have not merged yet.
https://github.com/kubernetes/kubernetes/pull/70277/files
markmandel pushed a commit that referenced this issue Dec 26, 2018
Fixes #414 CPU leak.
Fix is proposed by next pull request, which have not merged yet.
https://github.com/kubernetes/kubernetes/pull/70277/files
@markmandel
Copy link
Collaborator

I'm going to reopen this issue, so we can track the merged PR, until we can guarantee it has been merged into the k8s apimachinery, and we don't have to keep an eye out of for it every time we do a dep ensure.

@heartrobotninja
Copy link
Contributor

k8s.io/apimachinery kubernetes-1.14.9 appears to have it.

@aLekSer
Copy link
Collaborator

aLekSer commented May 5, 2020

I assume this issue could be closed now.

@aLekSer aLekSer closed this as completed May 5, 2020
@markmandel markmandel added this to the 0.9.0 milestone Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Anything to do with Agones being slow, or making it go faster. kind/bug These are bugs.
Projects
None yet
Development

No branches or pull requests

5 participants