Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook requests fail on gatekeeper pod startup/shutdown #3776

Open
l0wl3vel opened this issue Jan 13, 2025 · 4 comments · May be fixed by #3780
Open

Webhook requests fail on gatekeeper pod startup/shutdown #3776

l0wl3vel opened this issue Jan 13, 2025 · 4 comments · May be fixed by #3780
Labels
bug Something isn't working

Comments

@l0wl3vel
Copy link

l0wl3vel commented Jan 13, 2025

Co-Authored by @nilsfed

What steps did you take and what happened:

On our clusters we use gatekeeper with an AssignImage mutation to rewrite Pod
images to use our private mirror. On about 2% of pods this rewrite does not
happen when rolling over our node pools.

We run Gatekeeper in a failing-open configuration with three
gatekeeper-controller-manager replicas.

We investigated this by installing gatekeeper in a controlled environment
(minikube) and used curl to query the webhook endpoint in a loop as fast as
possible and recording failures. Our test setup is outlined below. On scaling
events we observed failing requests. We root caused this to the following two
problems in gatekeeper:

  • On gatekeeper pod startup the readiness probe indicates a pod is ready but
    unable to serve webhook traffic
  • On gatekeeper pod termination the service still points to the terminating pod
    for a brief moment due to Services updating asynchronously. Gatekeeper does
    not have a grace period on server shutdown, leading to refused connections.

What did you expect to happen:

Gatekeeper pods can receive requests when they are registered endpoints at the
service.


Mitigations:

We found that the health- and readinessprobes are misconfigured. They indicate a
ready-state as soon as the manager is started, even though the webhook is not
responding to requests yet.

While we can reconfigure the health probe to validate that the webhook server is
able to serve requests by passing --enable-tls-healthcheck=true to the
gatekeeper-controller-manager, this is not possible yet for the readiness probe.

If webhooks are enabled the readiness probe should check actual service health.
So we propose to implement the behavior of --enable-tls-healthcheck=true for
the readiness probe as well and enable it by default for the readiness probe
only.

We also found that adding a preStopHook to the gatekeeper-controller-manager
further prevents failing requests due to the webhook server terminating before
the endpoint gets removed from the K8s service.

Both mitigations yield zero failed requests over a test time frame of 30 minutes
with the test setup outlined below. Without the mitigations we saw requests
failing after less than a minute.


Anything else you would like to add:

Our test setup:

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.17.1/deploy/gatekeeper.yaml

kubectl apply -f deployment.yaml

kubectl exec -it deployment/test -- bash

# In test container

request_count=0
fail_count=0
while :
do
    if ! curl --retry 0 --connect-timeout 1 --insecure https://gatekeeper-webhook-service.gatekeeper-system:443/v1/mutate; then
        echo fail
        fail_count=$((fail_count+1))
    fi
    request_count=$((request_count+1)) 
    echo failed: $fail_count
    echo request_count: $request_count
done

# In seperate terminal on host:

while :
do
kubectl rollout restart -n gatekeeper-system deployment gatekeeper-controller-manager
sleep 20
done

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
      - name: test
        image: fedora:latest
        command:
          - sleep
          - infinity
        resources:
          limits:
            memory: "128Mi"

We have attached the kustomization we use to work around this problem, until the
final fix is upstream, here. It is a hack, which enables tls health checks and
uses the /healthz endpoint for readiness probes and adds the termination
preStopHook to buy the service time to disconnect the service and finish
in-flight requests.

Kustomization

kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
 
namespace: gatekeeper-system
 
resources:
  - https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.17.1/deploy/gatekeeper.yaml

patches:
- path: ./hotfix.yaml 
  target:
    kind: Deployment

hotfix.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gatekeeper-controller-manager
  namespace: gatekeeper-system
spec:
  template:
    spec:
      containers:
        - name: manager
          args:
            - --port=8443
            - --logtostderr
            - --exempt-namespace=gatekeeper-system
            - --operation=webhook
            - --operation=mutation-webhook
            - --disable-opa-builtin={http.send}
            - --enable-tls-healthcheck=true
          readinessProbe:
            httpGet:
              path: /healthz
          lifecycle:
            preStop:
              sleep: 10


Environment:

  • Gatekeeper version: 1.17.1
  • Kubernetes version: (use kubectl version): 1.31.0
@l0wl3vel l0wl3vel added the bug Something isn't working label Jan 13, 2025
@JaydipGabani
Copy link
Contributor

@l0wl3vel Thank you for sharing this information. I am looking into this.

@JaydipGabani
Copy link
Contributor

@l0wl3vel we check if the webhook is able to serve requests or not as a part of enable-tls-handshake - https://github.com/open-policy-agent/gatekeeper/blob/master/pkg/webhook/health_check.go#L29, so the answer just might be adding the flag as is to controller-manager pod. And we can modify existing charts to add preStop hook.

@l0wl3vel
Copy link
Author

The first thing is that the health check is not of interest to us. Rather, the readiness probe causes our problems, since that controls if traffic is routed to a pod. And with the current implementation the readiness probe does not indicate if the webhook is able to serve traffic. I already created a pull request for this issue.

The second issue is that preStop with the sleep type is only supported on 1.30+ without modifying feature gates. Which means gatekeeper would have to bump the last supported K8s version to 1.30. 1.29 goes EOL in the end of February 2025, so just 1.5 months to go according to the version skew policy of gatekeeper.
Alternatively we can wait when receiving a sigterm before cancelling all contexts by inserting a waiting period into the signal handler.

This is one of the few times a "sleep for x" is the proper solution, I think. Unless someone has an idea how to wait until an endpoint has been removed from a service and the changes are propagated through the routing tables I do not see any alternatives.

@l0wl3vel
Copy link
Author

@JaydipGabani I have created a PR containing the fix for this issue here: #3780

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants