Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOMKilled - gatekeeper:v3.4.0 #1279

Closed
abhi-kapoor opened this issue Apr 30, 2021 · 22 comments · Fixed by #1634
Closed

OOMKilled - gatekeeper:v3.4.0 #1279

abhi-kapoor opened this issue Apr 30, 2021 · 22 comments · Fixed by #1634

Comments

@abhi-kapoor
Copy link

gatekeeper-audit seems to be consuming a lot of memory. Initially, we observed that the pod was crashlooping, as it was being OOMKilled. We have bumped the limits couple of times now, but it still ends up using whatever limits we set.

We have used a VerticalPodAutoscaler on the gatekeeper-audit deployment to get insights on the memory consumption and what the target memory should be. We have tried adjusting the resources few times now, but the memory consumption keeps growing. As of now, it looks something like:

    resources:
      limits:
        cpu: "1"
        memory: 850Mi
      requests:
        cpu: 100m
        memory: 850Mi

image

I am a bit curious to know how this actually works. We are deploying this on a shared multi-tenancy cluster, so there will be much more api-resources added as we add new tenants. As of now, we just have a single basic rule K8sRequiredLabels as a POC.

It seems like the way gatekeeper-audit works is that it pretty much loads all resources into memory and performs an audit by using the rules defined.

Are there any recommendations on what we should be doing on our end to improve this memory utilization. I have also reviewed the below related issues and followed the recommendations, but no luck:

  1. OOMKilled - gatekeeper:v3.1.0-beta.0 #339
  2. OOMKilled - v3.1.0-beta.12 #780

Kubernetes version:

kubectl version --short=true
Client Version: v1.15.0
Server Version: v1.17.17-gke.3000
@abhi-kapoor abhi-kapoor added the bug Something isn't working label Apr 30, 2021
@ritazh
Copy link
Member

ritazh commented Apr 30, 2021

@abhinav454 Thanks for reporting the issue. Few questions to help us understand your environment:

@abhi-kapoor
Copy link
Author

@ritazh Thank you for taking a look at this.

We might be using the default settings and the configuration used is as below:

--constraint-violations-limit=20
--audit-from-cache=false
--audit-chunk-size=0
--audit-interval=60

Oh, this seems to be a great feature. We don't have this enabled yet, will go ahead and enable this and get back to you once it is enabled. I do hope that this will be an effective use of resources as we are only using it for a single kind

  • can you share your constraint(s)

The constraint seems to be that the pod seems to be Crashlooping as it is getting OOM killed. We have bumped the memory couple of times, but it still seems to be running out of memory. Since our cluster is a multi-tenancy K8s cluster, I am afraid as we add new services, it will keep asking for more and more memory.

Hope this provides the information you might be looking for, else I will be more than happy to provide any other information which can assist with troubleshooting this further. I do have a hunch that setting the below flag will help --audit-match-kind-only=true for now.

@sozercan
Copy link
Member

sozercan commented Apr 30, 2021

@abhinav454 depending on the number of resources you are trying to audit in your cluster (# of pods, configmaps, etc), you can set --audit-chunk-size also. This will process audit in smaller chunks instead of one big chunk which might reduce memory consumption.

How many constraint templates and constraints per template do you have deployed in your cluster? and what are the # of resources you are trying to audit (like # of pods)?

@abhi-kapoor
Copy link
Author

@abhinav454 depending on the number of resources you are trying to audit in your cluster (# of pods, configmaps, etc), you can set --audit-chunk-size also. This will process audit in smaller chunks instead of one big chunk which might reduce memory consumption.

Ohhh, that would also be helpful 🙇‍♂️

How many constraint templates and constraints per template do you have deployed in your cluster?

We only have a single template ConstraintTemplate with a single constraint as of now. This was really just a POC and we were planning to add more, but seemed to have faced the issues. I will set this flag as well and report back on how the pod is doing.

@abhi-kapoor
Copy link
Author

@ritazh @sozercan I highly believe that setting --audit-match-kind-only=true flag will help resolve some of the issues we are facing. However, we use the Helm chart to deploy this and it seems like, passing this argument is not yet supported within the template:

- --audit-from-cache={{ .Values.auditFromCache }}

I was about to open a PR to add support for that, but later realized that the helm charts are built automatically using helmify which already has the change:
https://github.com/abhinav454/gatekeeper/blob/aa20de6acc0f26943305483271051e9317c2c6ec/cmd/build/helmify/kustomize-for-helm.yaml#L107

Does that mean that the next version will allow us to pass this argument? And if so, any time line for this would be much appreciated 🙏 🙇

@sozercan
Copy link
Member

sozercan commented Apr 30, 2021

@abhinav454 that's right, looks like it got added in #1245 and it's in staging chart. It'll be available in the helm repo when we cut next minor version release.

@abhi-kapoor
Copy link
Author

@abhinav454 that's right, looks like it got added in #1245 and it's in staging chart. It'll be available in the helm repo when we cut next minor version release.

Thank you for your quick response. Will add this flag as soon as it is available. That should address the issues we are facing 👍

@ritazh
Copy link
Member

ritazh commented Apr 30, 2021

highly believe that setting --audit-match-kind-only=true flag will help resolve some of the issues we are facing.

Should we reconsider setting this value to true? @maxsmythe I know you had abjections to this. But given this is a recurring issue and this flag clearly helps. Let's revisit this.

@maxsmythe
Copy link
Contributor

maxsmythe commented May 1, 2021

Have the arguments changed?

If this flag works for them now, it may stop working if they add a constraint that doesn't match against kind, or if they add another constraint against a different kind. They did mention they wanted to add more constraints. Users should opt in to such a limitation.

Also, we are not sure yet whether --match-kind-only will address the issue. Hopefully, but it fails in the case where clusters have large numbers of a given kind (e.g. lots of Pods). If that is the case, then a solution like chunking will be needed. Chunking should also be resilient to any set of inbound constraints regardless of their contents. Taking a look at the code, memory usage (with a single constraint) should be proportional to the number of resources associated with the most populous kind.

@maxsmythe
Copy link
Contributor

maxsmythe commented May 1, 2021

@abhinav454 Can I ask:

  • What kinds of constraints are you interested in implementing?
  • Against which resources would these constraints apply?
  • Would your policy be centrally managed by a single individual/team, or would there be shared responsibility?
  • Is the team that runs Gatekeeper the same team that is writing the constraints/templates?
  • How large is your cluster?
  • What are your most populous kinds? How many member objects do they have?

@maxsmythe
Copy link
Contributor

Also, if you could copy/paste your test constraint, I think that'd be interesting to look at.

@maxsmythe
Copy link
Contributor

@abhinav454 Also, taking a look at your graph, I notice memory use seems to spike every few days even though your audit cycle is every 60 seconds.

Is the graph accurate? How often are you OOMing? If the OOMing is sporadic (i.e. the pod can run without crashing for, say, > 10 minutes), then the OOMing wouldn't be able to be explained by normal audit behavior.

@maxsmythe
Copy link
Contributor

maxsmythe commented May 1, 2021

Mostly those major spikes in the graph are interesting to me, growth over time could be explained by increased #s of resources on the cluster.

@abhi-kapoor
Copy link
Author

@maxsmythe Thank you for looking into this. I hope the below answers some of the questions you might have:

We bumped the gatekeeper-audit limits even further using the VPA recommendations and now the pod hasn't restarted or crashlooped since last 4d. The first time we used VPA to get recommendations on this, we bumped the limits accordingly, but the pod was still getting OOM Killed. Since, the second time we have done this exercise, it has been more stable.

We will be implementing both auditChunkSize and --audit-match-kind-only=true.

Currently, we are not using gatekeeper as much and have just implemented as a POC, so we have a single ConstraintTemplate and only 1 rule which is of kind K8sRequiredLabels and applies to Namespaces. For your reference, you can see the rule below:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: ns-must-have-name-label
spec:
  # operate in audit mode. ie do not enforce admission
  enforcementAction: dryrun
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace"]
  parameters:
    labels: ["name"]

We haven't yet decided any other rules yet. We just wanted to run a single rule such as above and try it out. The cluster is a shared mult-tenancy cluster and is a large cluster which runs multiple tenants. As we onboard more and more tenants, more api-resources would be added to the cluster.

@maxsmythe
Copy link
Contributor

Thanks for the extra info!

Can you give us:

  • Stable memory usage
  • What is most populous kind on your cluster? How many resources belong to that kind?
  • Frequency of crashes before you found a stable memory setting (was it about once every day like the graph suggests or was it more frequent?)

Those data will make it easier to distinguish between the following scenarios:

  • Audit memory usage is working as expected, and peak usage is governed by the most populous kind
  • We are somehow still caching all resources and therefore not scaling as well as we should be (e.g. maybe there is a hidden client cache we're invoking)
  • There is something that sporadically interferes with garbage collection (or some other issue that causes transient memory usage spikes)

Which would be super helpful, as it would let us know if there is a performance bug we should be targeting to make ourselves more memory efficient.

@cnwaldron
Copy link

I ran into this same issues and implemented the additional arg of --audit-match-kind-only=true. That didn't stop the OOMkilled pods. I added --audit-chunk-size=500 and that did the trick.

@sozercan
Copy link
Member

sozercan commented May 3, 2021

@cnwaldron glad chunk size worked for you! just curious, did you have any constraints that didn't have match kind defined?

@cnwaldron
Copy link

@sozercan All constraints had match kind defined

@abhi-kapoor
Copy link
Author

abhi-kapoor commented May 4, 2021

@maxsmythe After setting the --audit-chunk-size=500 flag, it has been working fine. In order to assist with troubleshooting any cache/buffer issues:

  • Pod has been stable, since we updated memory to 850Mi
  • Our rules were only tied to Namespace, but most populous kind on our cluster would be pods. Including all the system pods, there are total of ~750 pods running in the cluster.
  • The crashes were more frequent prior to updating the memory limits, every 10-20 min.

@naveen210121
Copy link

naveen210121 commented May 5, 2021

Hi,

We are also facing the same issue every 5-10 min gatekeeper pod is getting OOMkilled - loopback off.

Configurations:
GKE Version : 1.18 (GCP GKE Kubernetes cluster)
Gatekeeper Version : gatekeeper:v3.1.0-beta.2
resources:
limits:
cpu: 1500m
memory: 1500Mi
requests:
cpu: 800m
memory: 800Mi

there are total of ~800 pods running in the cluster.

I have verified node capacity it has good amount of memory where pods can scale with limits. Here I am clearing out that pod is not crashing because of limit overcommit.

@sozercan, @maxsmythe, @ritazh, @abhinav454: Please suggest best suitable requests/limits values for gatekeeper workload.

@maxsmythe
Copy link
Contributor

@abhinav454 Thank you for the data!

@naveen-mindtree Unfortunately it's hard to come up with scaling recommendations for audit memory usage, as it depends on:

  • Most populous kind in the cluster
  • Number of constraints
  • Number of violations a given resource has
  • Number of constraint templates
  • Whatever memory usage is required to run the constraint templates
  • Maybe more?

A quick way to figure out memory usage experimentally is to just double the usage until the pod becomes stable, then you can scale back to use only the memory required by the pod (with some overhead for growth).

@naveen210121
Copy link

Thank you so much @maxsmythe

So it will be like try and error method. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants