-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOMKilled - gatekeeper:v3.4.0 #1279
Comments
@abhinav454 Thanks for reporting the issue. Few questions to help us understand your environment:
|
@ritazh Thank you for taking a look at this.
We might be using the default settings and the configuration used is as below:
Oh, this seems to be a great feature. We don't have this enabled yet, will go ahead and enable this and get back to you once it is enabled. I do hope that this will be an effective use of resources as we are only using it for a single
The constraint seems to be that the pod seems to be Crashlooping as it is getting OOM killed. We have bumped the memory couple of times, but it still seems to be running out of memory. Since our cluster is a multi-tenancy K8s cluster, I am afraid as we add new services, it will keep asking for more and more memory. Hope this provides the information you might be looking for, else I will be more than happy to provide any other information which can assist with troubleshooting this further. I do have a hunch that setting the below flag will help |
@abhinav454 depending on the number of resources you are trying to audit in your cluster (# of pods, configmaps, etc), you can set How many constraint templates and constraints per template do you have deployed in your cluster? and what are the # of resources you are trying to audit (like # of pods)? |
Ohhh, that would also be helpful 🙇♂️
We only have a single template |
@ritazh @sozercan I highly believe that setting
I was about to open a PR to add support for that, but later realized that the helm charts are built automatically using helmify which already has the change: Does that mean that the next version will allow us to pass this argument? And if so, any time line for this would be much appreciated 🙏 🙇 |
@abhinav454 that's right, looks like it got added in #1245 and it's in staging chart. It'll be available in the helm repo when we cut next minor version release. |
Thank you for your quick response. Will add this flag as soon as it is available. That should address the issues we are facing 👍 |
Should we reconsider setting this value to |
Have the arguments changed? If this flag works for them now, it may stop working if they add a constraint that doesn't match against kind, or if they add another constraint against a different kind. They did mention they wanted to add more constraints. Users should opt in to such a limitation. Also, we are not sure yet whether |
@abhinav454 Can I ask:
|
Also, if you could copy/paste your test constraint, I think that'd be interesting to look at. |
@abhinav454 Also, taking a look at your graph, I notice memory use seems to spike every few days even though your audit cycle is every 60 seconds. Is the graph accurate? How often are you OOMing? If the OOMing is sporadic (i.e. the pod can run without crashing for, say, > 10 minutes), then the OOMing wouldn't be able to be explained by normal audit behavior. |
Mostly those major spikes in the graph are interesting to me, growth over time could be explained by increased #s of resources on the cluster. |
@maxsmythe Thank you for looking into this. I hope the below answers some of the questions you might have: We bumped the gatekeeper-audit limits even further using the VPA recommendations and now the pod hasn't restarted or crashlooped since last 4d. The first time we used VPA to get recommendations on this, we bumped the limits accordingly, but the pod was still getting OOM Killed. Since, the second time we have done this exercise, it has been more stable. We will be implementing both Currently, we are not using gatekeeper as much and have just implemented as a POC, so we have a single
We haven't yet decided any other rules yet. We just wanted to run a single rule such as above and try it out. The cluster is a shared mult-tenancy cluster and is a large cluster which runs multiple tenants. As we onboard more and more tenants, more api-resources would be added to the cluster. |
Thanks for the extra info! Can you give us:
Those data will make it easier to distinguish between the following scenarios:
Which would be super helpful, as it would let us know if there is a performance bug we should be targeting to make ourselves more memory efficient. |
I ran into this same issues and implemented the additional arg of --audit-match-kind-only=true. That didn't stop the OOMkilled pods. I added --audit-chunk-size=500 and that did the trick. |
@cnwaldron glad chunk size worked for you! just curious, did you have any constraints that didn't have match kind defined? |
@sozercan All constraints had match kind defined |
@maxsmythe After setting the
|
Hi, We are also facing the same issue every 5-10 min gatekeeper pod is getting OOMkilled - loopback off. Configurations: there are total of ~800 pods running in the cluster. I have verified node capacity it has good amount of memory where pods can scale with limits. Here I am clearing out that pod is not crashing because of limit overcommit. @sozercan, @maxsmythe, @ritazh, @abhinav454: Please suggest best suitable requests/limits values for gatekeeper workload. |
@abhinav454 Thank you for the data! @naveen-mindtree Unfortunately it's hard to come up with scaling recommendations for audit memory usage, as it depends on:
A quick way to figure out memory usage experimentally is to just double the usage until the pod becomes stable, then you can scale back to use only the memory required by the pod (with some overhead for growth). |
Thank you so much @maxsmythe So it will be like try and error method. Thanks again. |
gatekeeper-audit seems to be consuming a lot of memory. Initially, we observed that the pod was crashlooping, as it was being OOMKilled. We have bumped the limits couple of times now, but it still ends up using whatever limits we set.
We have used a VerticalPodAutoscaler on the gatekeeper-audit deployment to get insights on the memory consumption and what the target memory should be. We have tried adjusting the resources few times now, but the memory consumption keeps growing. As of now, it looks something like:
I am a bit curious to know how this actually works. We are deploying this on a shared multi-tenancy cluster, so there will be much more api-resources added as we add new tenants. As of now, we just have a single basic rule
K8sRequiredLabels
as a POC.It seems like the way gatekeeper-audit works is that it pretty much loads all resources into memory and performs an audit by using the rules defined.
Are there any recommendations on what we should be doing on our end to improve this memory utilization. I have also reviewed the below related issues and followed the recommendations, but no luck:
Kubernetes version:
The text was updated successfully, but these errors were encountered: