-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU/MEMORY leak in agones controller container #414
Comments
Thanks for sharing this, and taking the time to graph it out. Do you have any specific steps we could we use to reproduce this? Looks like it happens over a few days? |
Nothing special - start/stop fleets Main concern - heavy CPU usage after remove all fleets/gs/autoscalers |
STR:
Results:
|
Ah perfect, thanks! Can you have a look (possible share?) the controller logs at the end there, if you still have them? I'm wondering if something gets stuck in a queue, and just tries to sync forever. It may show off something obvious. |
Pod still alive. Last log lines - from fleet shutdown time:
|
Hmnn. It does resync every 30 seconds, but if there aren't any pods left, that is odd. Thanks for the info! |
Small update: with empty fleet (replicas: 0) - no CPU leakage
|
Would you feel comfortable rebuilding with https://golang.org/pkg/runtime/pprof/ and doing a test with that, and seeing where the CPU cycles are going? Totally fine if you aren't - maybe we can do a special build. |
Tools for building a controller with pprof enabed, and Make targets for accessing the http pprof endpoint while it is running on a Kubernetes cluster, with accompanying documentation. This will help in diagnosing googleforgames#414
Tools for building a controller with pprof enabed, and Make targets for accessing the http pprof endpoint while it is running on a Kubernetes cluster, with accompanying documentation. This will help in diagnosing googleforgames#414
Tools for building a controller with pprof enabed, and Make targets for accessing the http pprof endpoint while it is running on a Kubernetes cluster, with accompanying documentation. This will help in diagnosing googleforgames#414
We use pprof, but i not so familiar with gotpl... |
I made this PR to have a compile flag to enable pprof on some builds: Just waiting on review. |
Tools for building a controller with pprof enabed, and Make targets for accessing the http pprof endpoint while it is running on a Kubernetes cluster, with accompanying documentation. This will help in diagnosing #414
After build-run-burn pprof controller: Notes:
|
only the "fast user space mutex" stand out here, might be because there is only one worker listening to the channel and a lot of events to process. |
What may work best (not sure if you still have the cpu pprof files - be good to keep them, so we can share them around, and do various diagramming on them). Get one at the beginning when the system is idle (i.e. no changes are being made to fleets or gameservers), for a baseline. Then if we flame graph each stage, we should (hopefully) be able to compare each of them, and see visually where the growth is in the flame graphs. Does that makes sense? I personally find it really hard to parse the cpu graph trees, I don't know about you all - also with no base level comparison, it's hard to see where the growth is - I think a flamegraph will be easier to visualise. And thanks for putting all the time in on this! |
I finish new measure Root cause looks like simple memory leak: i see strong CPU-MEMORY corellation: CPU-MEMORY amount proportional to gs/gss/flt count from controller restart. Even deleted things consume resources I do not test FleetAllocation and FleetAutoscaler |
if we think this is a memory leak - can we grab some pprof memory profiles, to see if we can see where it's happening? I still think flamegraphs would be much easier to parse where the cpu time is going - I find the cpu graphs hard to follow personally. |
New data from pprof allocs+heap |
@Kuqd and I were digging into this some more - the memory leak looks to be in the k8s event recorder. Which leads up to think, k8s events aren't being properly processed, which provides the memory leak, and because the event queue goes into exponential backoff, we also get the CPU leak as well. So a few more fun questions:
We will find a solution for this! 👍 |
Oh I had one more question - are there any errors in the Kubernetes logs themselves, basically anything in regards to rejecting to not being able to process event streams? (if you kubectl describe a |
|
This should have a check on it anyway, but I went hunting as I was wondering if this might be the cause for googleforgames#414
This should have a check on it anyway, but I went hunting as I was wondering if this might be the cause for #414
When running this loop.sh script for about 30 minutes next errors were found in container logs with the count over 32:
Diff in status before the error in syncGameServerSetState()
|
"Object has been modified" errors are benign and are expected - details in this issue: kubernetes/kubernetes#28149 |
The bug might be fixed along with this PullRequest kubernetes/kubernetes#70277 |
The CPU leak part of the bug might be fixed along with this PullRequest kubernetes/kubernetes#70277 Fix goroutine leak of wait.poller
Before applying changes of 70277 it was:
|
This makes lots of sense, from the bug details:
Given that we |
I think there is no memory leak, after analyzing the inuse_space memory logs, Mark found the place which used memory the most this is |
Since we are cutting the RC on the 1st/2nd of Jan (depending on how awake I am 😄 ) - should we apply the patch to our system now, and confirm that it solves our issue so that it is prepared for our next release? I would suggest we then keep this issue open to track the open PR (kubernetes/kubernetes#70277) on K8s, and then merge that in when it's ready (which is on a timeline we can't control). What do we think? |
The fix is small, I'm all in for a patch ! |
Fixes googleforgames#414 CPU leak. Fix is proposed by next pull request, which have not merged yet. https://github.com/kubernetes/kubernetes/pull/70277/files
Fixes #414 CPU leak. Fix is proposed by next pull request, which have not merged yet. https://github.com/kubernetes/kubernetes/pull/70277/files
I'm going to reopen this issue, so we can track the merged PR, until we can guarantee it has been merged into the k8s apimachinery, and we don't have to keep an eye out of for it every time we do a |
k8s.io/apimachinery kubernetes-1.14.9 appears to have it. |
I assume this issue could be closed now. |
This is only a suspicion, but after restart agones-controller pod CPU usage constantly increased. This behavior independent to current count of fleets/gs/autoscalest.
Looks like any fleet/gs, even after deleting, eat CPU(need garbage collector in data structs???).
Reproduced with agones v0.4.0 and v0.5.0
The text was updated successfully, but these errors were encountered: