-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Katib webhooks blocking pod creation in kubeflow namespace #1261
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
Issue Label Bot is not confident enough to auto-label this issue. |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
@jlewi Was it block by validation webhook or by mutating webhook? We have 3 webooks in Katib.
I can see only one case when user can invoke Katib webhook on not Katib's pods:
I was able to reproduce this case, but this is very uncommon, because Trial name is generated as Experiment name + 8 random symbols in each Katib Experiment run: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go#L98-L100. Can it be that case @jlewi ? |
I don't think so. The user never finished deploying Kubeflow because of the issue so they never got to running an experiment. Is |
From the design of the metrics collector, we have to watch all the pods to determine if the pod is owned by a Katib experiment, then we decide if we should inject. I think what we can do now is to set the webhook |
Right now, yes, but is that correct @gaocegege @johnugeorge ? What is the reason to keep Webhook is looking for namespace label where Experiment is submitted (https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/webhook.go#L106). Regarding to this issue, @jlewi can you ask user which webhook was blocking installation ? |
Hi @andreyvelich, we believe it was the katib-mutating-webhook-config in combination with the admission-webhook that was preventing deployments from triggering a scale up in the kubeflow namespace. These webhooks were preventing new pods from being created in the kubeflow namespace. Our node-pools are set to autoscale based upon load, and when a scale-up was triggered on the node-pool side, the master failed to reschedule the kubeflow pods, which brought down our entire kubeflow environment. When we checked the logs from the master, it kept pointing to the certmanager admission webhook. Only after deleting the katib-mutating-webhook-config, was when the master resumed scheduling pods again. At that point our environment came back up. I hope this helps clarify the bug that we're facing. |
Hi @RoyerRamirez, thank you for this information. How did you deploy Kubeflow? Are you using private GKE cluster? If yes, it can be the problem with Firewall, check this issue (kubernetes/kubernetes#79739). Please try to remove As I mentioned before, it is not necessary to have this label on Kubeflow namespace, since you can't submit Katib Experiments in that namespace. |
@andreyvelich its possible kubernetes/kubernetes#79739 was blocking the proper functioning of the katib collector. But its still a bug IMO if the katib admission hook is improperly affecting pods unrelated to Katib. So there's two additional issues here
|
@Tomcli added a unittest to validate webhooks: https://github.com/Tomcli/manifests/blob/b358d8898859a1aa2b6caab33c87142739a61645/tests/validate_resources_test.go but since Katib is dynamically creating its webhooks the test isn't getting applied. |
I can see your comment here: kserve/kserve#568 (comment) and this issue: kubeflow/kubeflow#4730, users should not submit Katib Experiments in Kubeflow namespace.
Found this comment: kubeflow/kubeflow#4231 (comment). I believe if we introduce pod-level label, we can have performance issue on huge clusters? |
It looks like we have katib collector enabled here on kubeflow namespace This file was introduced after 1.0 so I don't think that's what the user was using. Should we update that to disable metrics collector in Kubeflow namespace? |
SGTM |
Fixed by kubeflow/manifests#1387 and kubeflow/manifests#1388. |
/kind bug
A GCP customer reported a problem where the katib webhooks were blocking pod creation in the kubeflow namespace. Deleting the webhooks unblocked pod creation.
The root cause of why the webhooks were failing is still unknown.
Is it expected that the katib webhooks are
Do we already have an issue tracking ensuring the webhooks are properly scoped to just katib pods when you using a sufficiently new version of Kubernetes
Attached are the webhook configuration object
katib-mutating-webhook-config.yaml.txt
katib-validating-webhook-config.yaml.txt
The text was updated successfully, but these errors were encountered: