Webhooks server memory usage depends on number of pods in cluster. #888

amisevsk · 2022-07-05T22:10:12Z

Description

The webhooks server is caching all pods on the cluster in-memory, regardless of whether they are DevWorkspace pods or not. This can result in the webhook server reaching its default memory limit (300Mi) and being killed by the cluster, causing all DWO webhooks will stop working. Further, since it's not possible to filter pods/exec requests in webhooks (see kubernetes/kubernetes#91732), all kubectl exec commands in the cluster will be blocked.

This issue hasn't been seen until now as webhook server memory usage only becomes a problem when there are a large number (>6000) pods on a cluster, and normally this does not occur due to CPU/memory constraints in the cluster. However, some cluster tasks (e.g. s2i on OpenShift) can result in many completed or errored pods being left on the cluster.

How To Reproduce

Note: This is hard to reproduce without also killing a small test cluster

Create test pods on a cluster -- note these pods have ~100KB of annotations (to hopefully use more space in memory per pod) and complete immediately after starting:

for i in {0001..0500}; do
  curl https://gist.githubusercontent.com/amisevsk/2ca0a75f2bfcf785e597df37d8a22221/raw/007c6b8ff7780ec3aacfd7881f4ed9e22f809470/big-pod.yaml \
    | yq -y --arg name "test-pod-$i" '.metadata.name = $name' \
    | oc apply -f -
  sleep 1s
done

(In another terminal) try to kubectl exec into a pod to trigger the webhook server caching pods internally
Observe webhooks server memory usage (e.g. on OpenShift: oc adm top pod <webhooks-server-pod-name>)

In my testing, once I get to around 500 pods, the webhooks server requires ~230MiB of memory (up from ~30MiB at idle)

Expected behavior

DWO webhooks server memory usage should not depend on the number of non-DWO-related objects on the cluster.

Additional context

The webhooks server needs to read pods from the cluster in order to validate pods/exec requests for restricted-access workspaces. Since this is a read-only operation (we just need to check pod metadata), this is done via the controller-runtime manager, which implements efficient read operations by asynchronously watching all objects of interest in the cluster.

The controller itself has had a similar problem in the past. See: #652

The text was updated successfully, but these errors were encountered:

amisevsk self-assigned this Jul 5, 2022

amisevsk mentioned this issue Jul 5, 2022

Limit objects cached in webhooks manager #889

Merged

3 tasks

amisevsk closed this as completed in #889 Jul 6, 2022

amisevsk added this to the v0.15.x milestone Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhooks server memory usage depends on number of pods in cluster. #888

Webhooks server memory usage depends on number of pods in cluster. #888

amisevsk commented Jul 5, 2022

Webhooks server memory usage depends on number of pods in cluster. #888

Webhooks server memory usage depends on number of pods in cluster. #888

Comments

amisevsk commented Jul 5, 2022

Description

How To Reproduce

Expected behavior

Additional context