Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gatekeeper should fail safe during audit operation #163

Closed
RamyasreeChakka opened this issue Jul 1, 2019 · 9 comments · Fixed by #1634
Closed

Gatekeeper should fail safe during audit operation #163

RamyasreeChakka opened this issue Jul 1, 2019 · 9 comments · Fixed by #1634
Labels
bug Something isn't working

Comments

@RamyasreeChakka
Copy link

On a Kubernetes cluster, I had a bad pod https://raw.githubusercontent.com/RamyasreeChakka/RegoPolicy/master/GateKeeperV3/container-resource-limits/ContainerResourceLimits-bad2.yaml which has container CPU resource limit incorrectly specified.

I installed below templates and was trying to test audit feature - didn't observe any audit violations reported for a long time. Later @ritazh helped me debug and from the Gatekeeper logs, looks like audit function was bailing out with an exception due to a bad pod.

Gatekeeper should fail safe - should ignore bad policy, bad Kubernetes objects.

https://github.com/open-policy-agent/gatekeeper/blob/master/demo/agilebank/templates/k8scontainterlimits_template.yaml
https://github.com/open-policy-agent/gatekeeper/blob/master/demo/agilebank/templates/k8srequiredlabels_template.yaml
https://github.com/open-policy-agent/gatekeeper/blob/master/demo/agilebank/templates/k8sallowedrepos_template.yaml

@RamyasreeChakka
Copy link
Author

Fail safe should applicable in green field (create) operation as well. I haven't tested create scenario.

@maxsmythe
Copy link
Contributor

What is the significance of the links at the bottom of this bug?

Is it possible to post the logs? Without more context, I'm not sure exactly what a "bad pod" is or how it's causing a failure.

@maxsmythe
Copy link
Contributor

Also, please define "fail safe". What is an example desired behavior?

@RamyasreeChakka
Copy link
Author

@maxsmythe , Regd. fail safe,
The current behavior I noticed when a bad pod exists in the system is - audit violations are not reported at all for any policy. I had 3 constraint templates and 3 constraints on the system and none of them showed audit violations. Out of the 3 templates, one policy was evaluating containers, one policy was evaluating namespaces. Because of bad pod issue, namespaces policy also didn't show the audit violations. What I am trying to say here is:

  1. For namespaces policy show the audit violations
  2. For containers policy, its perfectly fine not to show any violations.

Follow below steps to repro this issue:

  1. kubectl create -f https://raw.githubusercontent.com/RamyasreeChakka/RegoPolicy/master/GateKeeperV3/container-resource-limits/ContainerResourceLimits-bad2.yaml
    This is bad pod. If you look at the yaml, CPU resource limit is defined as 100Mi instead of 100m.
  2. Install Gatekeeper
  3. Install below templates
    https://github.com/open-policy-agent/gatekeeper/blob/master/demo/agilebank/templates/k8scontainterlimits_template.yaml
    https://github.com/open-policy-agent/gatekeeper/blob/master/demo/agilebank/templates/k8srequiredlabels_template.yaml
  4. Install below constraints
    https://github.com/open-policy-agent/gatekeeper/blob/master/demo/agilebank/constraints/containers_must_be_limited.yaml
    https://github.com/open-policy-agent/gatekeeper/blob/master/demo/agilebank/constraints/owner_must_be_provided.yaml
  5. Observe violations are not reported on both constraints.

@maxsmythe
Copy link
Contributor

maxsmythe commented Jul 2, 2019

Thanks for the clarifications.

Is this the error in the logs?

admission webhook "validation.gatekeeper.sh" denied the request: admission.k8s.gatekeeper.sh: templates["admission.k8s.gatekeeper.sh"]["K8sContainerLimits"]:7: eval_internal_error: to_number: strconv.ParseFloat: parsing "100Mi": invalid syntax

@RamyasreeChakka
Copy link
Author

@maxsmythe Yes, I saw similar log for audit operation.

@maxsmythe
Copy link
Contributor

maxsmythe commented Jul 2, 2019

The failure is coming from the CPU spec. 100Mi is not a valid value for CPU, which means the constraint fails to canonify the CPU before casting it as a number.

@tsandall this is a good example of an appropriate use of fail-fast/fail-loudly from my view. Unfortunately for the audit use case, we'd probably want some way to recover from the error and report that it happened, so that other audit responses could be returned.

Had this failed silently, the CPU constraint would simply never apply in these cases, and we may never have caught that this edge case was occurring.

@RamyasreeChakka I'm not sure Rego has a device to handle exceptions such as these.

@maxsmythe
Copy link
Contributor

@tsandall another thing I noticed while digging into the to_number source code is that it casts to float. Are we worried about a loss of accuracy ever? The values of CPU/memory are meant to be integers to avoid rounding errors.

@RamyasreeChakka in the interim, #167 should fix this specific error, but I think a more general construct is necessary.

@ritazh
Copy link
Member

ritazh commented Jul 2, 2019

for the audit use case, we'd probably want some way to recover from the error and report that it happened, so that other audit responses could be returned

I will look into this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants