Gatekeeper validatingwebhook stops workloads scheduling #102

david-curran-90 · 2022-01-20T14:11:52Z

david-curran-90
Jan 20, 2022

I have deployed gatekeeper in my cluster but now I can't deploy anything unless I delete the validatingwebhookconfiguration. Not even the replicasets are created

# no pods being created for deployments
kubectl -n kasten-io get deploy
NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
aggregatedapis-svc   0/1     0            0           25m
auth-svc             0/1     0            0           26m

# delete webhook config
kubectl delete validatingwebhookconfigurations gatekeeper-validating-webhook-configuration
validatingwebhookconfiguration.admissionregistration.k8s.io "gatekeeper-validating-webhook-configuration" deleted

# now pods
kubectl -n kasten-io get po
NAME                                  READY   STATUS             RESTARTS      AGE
aggregatedapis-svc-7d578bc79d-657ll   1/1     Running            0             5m18s
auth-svc-7987b7d5c8-5rmnk             0/1     CrashLoopBackOff   4 (12s ago)   6m4s

I have set validatingWebhookCheckIgnoreFailurePolicy: "Ignore" in my Helm values and validated this had set.
I have two policies, both set to dry-run once checks for labels and the other image repo.
Any ideas what would stop any workloads from scheduling?

might this be the cause

error: unable to retrieve the complete list of server APIs: actions.kio.kasten.io/v1alpha1: the server is currently unable to handle the request, apps.kio.kasten.io/v1alpha1: the server is currently unable to handle the request, metrics.k8s.io/v1beta1: the server is currently unable to handle the request, tap.linkerd.io/v1alpha1: the server is currently unable to handle the request, vault.kio.kasten.io/v1alpha1: the server is currently unable to handle the request

Unable to query some of the api resources

Thanks

Answered by david-curran-90

Jan 21, 2022

The issue was with my CNI, Calico wasn't starting correctly on the master thus interrupting network traffic to the gatekeeper-webhook service. Fixed the issue with Calico (IP autodetection picking the "wrong" interface on a different subnet for those that are interested).

It also fixed an issue I'd been ignoring with Linkerd service-mesh.

It still seems that the failure wasn't being ignored though?

View full answer

maxsmythe · 2022-01-21T00:20:29Z

maxsmythe
Jan 21, 2022
Maintainer

Do you have any logs from the K8s API server? If the cause was rejection/failure from webhook, it should show up there.

What is the timeout set to for the ValidatingWebhookConfiguration? If you set it to 1 second, does the webhook config's existence still interfere?

Can you copy/paste the contents of your constraints (and constraint templates, if they are not library templates)?

1 reply

david-curran-90 Jan 21, 2022
Author

W0121 09:39:00.708408       1 dispatcher.go:176] Failed calling webhook, failing open mutation.gatekeeper.sh: failed calling webhook "mutation.gatekeeper.sh": failed
to call webhook: Post "https://gatekeeper-webhook-service.security.svc:443/v1/mutate?timeout=3s": dial tcp 10.99.155.63:443: i/o timeout
E0121 09:39:00.708441       1 dispatcher.go:180] failed calling webhook "mutation.gatekeeper.sh": failed to call webhook: Post
"https://gatekeeper-webhook-service.security.svc:443/v1/mutate?timeout=3s": dial tcp 10.99.155.63:443: i/o timeout
W0121 09:39:00.937378       1 dispatcher.go:176] Failed calling webhook, failing open mutation.gatekeeper.sh: failed calling webhook "mutation.gatekeeper.sh": failed
to call webhook: Post "https://gatekeeper-webhook-service.security.svc:443/v1/mutate?timeout=3s": context deadline exceeded
E0121 09:39:00.937397       1 dispatcher.go:180] failed calling webhook "mutation.gatekeeper.sh": failed to call webhook: Post
"https://gatekeeper-webhook-service.security.svc:443/v1/mutate?timeout=3s": context deadline exceeded

Logs suggest the webhook is failing open but resources are not being created. Suggests that constraints aren't even being tested

Templates

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8sapprovedimageregistries
spec:
  crd:
    spec:
      names:
        kind: K8sApprovedImageRegistries
      validation:
        # Schema for the `parameters` field
        openAPIV3Schema:
          type: object
          properties:
            registries:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sapprovedimageregistries

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          satisfied := [good | registry = input.parameters.registries[_] ; good = contains(container.image, registry)]
          not any(satisfied)
          msg := sprintf("container <%v> has an invalid image registry <%v>, allowed registries are %v", [container.name, container.image, input.parameters.registries])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.initContainers[_]
          satisfied := [good | registry = input.parameters.registries[_] ; good = contains(container.image, registry)]
          not any(satisfied)
          msg := sprintf("container <%v> has an invalid image registry <%v>, allowed registries are %v", [container.name, container.image, input.parameters.registries])
        }

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        # Schema for the `parameters` field
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("you must provide labels: %v", [missing])
        }

constraints

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sApprovedImageRegistries
metadata:
  name: approved-image-registries
spec:
  enforcementAction: dryrun
  match:
    excludedNamespaces: [kube-*]
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    registries:
    - long list of registries

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: required-labels-all
spec:
  enforcementAction: dryrun
  match:
    excludedNamespaces: [kube-*, "default", "calico-system"]
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace", "Pods"]
  parameters:
    labels: [mylabels"]

david-curran-90 · 2022-01-21T11:28:41Z

david-curran-90
Jan 21, 2022
Author

The issue was with my CNI, Calico wasn't starting correctly on the master thus interrupting network traffic to the gatekeeper-webhook service. Fixed the issue with Calico (IP autodetection picking the "wrong" interface on a different subnet for those that are interested).

It also fixed an issue I'd been ignoring with Linkerd service-mesh.

It still seems that the failure wasn't being ignored though?

1 reply

maxsmythe Jan 22, 2022
Maintainer

Yeah, it looks like you were hitting timeouts on the mutation webhook.

3 seconds may be too long a timeout, since mutation is potentially called multiple times in series. Let's assume it's called twice and there is a validating webhook failure:

3s + 3s +3s = 9s. K8s has a default request timeout of 10s, so if you have any other webhooks, that's only 1s of buffer for request latency and their timeouts.

What happens if you reduce the mutation timeout to 1s?

@ritazh @sozercan FYI as this is a concrete example of latency sensitivity for mutation.

maxsmythe · 2022-01-22T01:56:00Z

maxsmythe
Jan 22, 2022
Maintainer

To be clear, I can't say for sure that the problem was cumulative webhook timeouts from the snippet of API server logs provided, but it's the most likely explanation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Policy Agent

Gatekeeper validatingwebhook stops workloads scheduling #102

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Open Policy Agent

Gatekeeper validatingwebhook stops workloads scheduling #102

david-curran-90 Jan 20, 2022

Replies: 3 comments · 2 replies

maxsmythe Jan 21, 2022 Maintainer

david-curran-90 Jan 21, 2022 Author

david-curran-90 Jan 21, 2022 Author

maxsmythe Jan 22, 2022 Maintainer

maxsmythe Jan 22, 2022 Maintainer

david-curran-90
Jan 20, 2022

Replies: 3 comments 2 replies

maxsmythe
Jan 21, 2022
Maintainer

david-curran-90 Jan 21, 2022
Author

david-curran-90
Jan 21, 2022
Author

maxsmythe Jan 22, 2022
Maintainer

maxsmythe
Jan 22, 2022
Maintainer