Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Katib webhooks blocking pod creation in kubeflow namespace #1261

Closed
jlewi opened this issue Jul 9, 2020 · 15 comments
Closed

Katib webhooks blocking pod creation in kubeflow namespace #1261

jlewi opened this issue Jul 9, 2020 · 15 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 9, 2020

/kind bug

A GCP customer reported a problem where the katib webhooks were blocking pod creation in the kubeflow namespace. Deleting the webhooks unblocked pod creation.

The root cause of why the webhooks were failing is still unknown.

Is it expected that the katib webhooks are

  1. Applying to the kubeflow namespace
  2. Applying to all pods

Do we already have an issue tracking ensuring the webhooks are properly scoped to just katib pods when you using a sufficiently new version of Kubernetes

Attached are the webhook configuration object
katib-mutating-webhook-config.yaml.txt
katib-validating-webhook-config.yaml.txt

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.94

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/front-end 0.68

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@andreyvelich
Copy link
Member

@jlewi Was it block by validation webhook or by mutating webhook?

We have 3 webooks in Katib.

  1. Validation webhook - https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/webhook.go#L80.
    Which is applying only for Katib Experiment CR. It is used for Experiment validation.

  2. Experiment Mutating webook - https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/webhook.go#L92. Which is apply only for Katib Experiment CR. It is used for Experiment default values.

  3. Pod Mutating webhook - https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/webhook.go#L110. Which is apply for Pods in namespace with katib-metricscollector-injection: enabled. Is it used for metrics-collector.
    The latest webhook checks if Pod Owner Name (e.g TF Job name) is equal to any Trial in the Pod's namespace: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/inject_webhook.go#L121-L123).

I can see only one case when user can invoke Katib webhook on not Katib's pods:

  1. Run Katib Experiment.
  2. Get one of the Trial name
  3. Create TF Job with that name.

I was able to reproduce this case, but this is very uncommon, because Trial name is generated as Experiment name + 8 random symbols in each Katib Experiment run: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/suggestion/suggestionclient/suggestionclient.go#L98-L100.

Can it be that case @jlewi ?

@jlewi
Copy link
Contributor Author

jlewi commented Jul 10, 2020

I don't think so. The user never finished deploying Kubeflow because of the issue so they never got to running an experiment.
It was other kubeflow applications that ended up getting blocked e.g. central dashboard.

Is katib-metricscollector-injection: enabled set on the kubeflow namespace?

@gaocegege
Copy link
Member

From the design of the metrics collector, we have to watch all the pods to determine if the pod is owned by a Katib experiment, then we decide if we should inject. I think what we can do now is to set the webhook ignore

@andreyvelich
Copy link
Member

andreyvelich commented Jul 10, 2020

Is katib-metricscollector-injection: enabled set on the kubeflow namespace?

Right now, yes, but is that correct @gaocegege @johnugeorge ?
If we take a look at this issue: kubeflow/kubeflow#4730, users can't submit Experiment on Kubeflow namespace like @jlewi commented.

What is the reason to keep katib-metricscollector-injection: enabled label on Kubeflow namespace in not Katib standalone infrastructure, if we need this label only on user's Profile namespace, which is currently Profile controller is doing.

Webhook is looking for namespace label where Experiment is submitted (https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/webhook.go#L106).

Regarding to this issue, @jlewi can you ask user which webhook was blocking installation ?

@RoyerRamirez
Copy link

Hi @andreyvelich, we believe it was the katib-mutating-webhook-config in combination with the admission-webhook that was preventing deployments from triggering a scale up in the kubeflow namespace.

These webhooks were preventing new pods from being created in the kubeflow namespace. Our node-pools are set to autoscale based upon load, and when a scale-up was triggered on the node-pool side, the master failed to reschedule the kubeflow pods, which brought down our entire kubeflow environment.

When we checked the logs from the master, it kept pointing to the certmanager admission webhook. Only after deleting the katib-mutating-webhook-config, was when the master resumed scheduling pods again. At that point our environment came back up.

I hope this helps clarify the bug that we're facing.

@andreyvelich
Copy link
Member

Hi @RoyerRamirez, thank you for this information.

How did you deploy Kubeflow? Are you using private GKE cluster? If yes, it can be the problem with Firewall, check this issue (kubernetes/kubernetes#79739).

Please try to remove katib-metricscollector-injection=enabled label from Kubeflow namespace.

As I mentioned before, it is not necessary to have this label on Kubeflow namespace, since you can't submit Katib Experiments in that namespace.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 13, 2020

@andreyvelich its possible kubernetes/kubernetes#79739 was blocking the proper functioning of the katib collector. But its still a bug IMO if the katib admission hook is improperly affecting pods unrelated to Katib.

So there's two additional issues here

  1. If katib shouldn't be applied to the kubeflow namespace can we disable the admission hook in that namespace
  2. Starting in Kubernetes 1.X it should be possible to match pod labels to webhooks so that webhooks only apply to a subset of pods in a namespace. Do we have an issue tracking that?

@jlewi
Copy link
Contributor Author

jlewi commented Jul 13, 2020

@Tomcli added a unittest to validate webhooks: https://github.com/Tomcli/manifests/blob/b358d8898859a1aa2b6caab33c87142739a61645/tests/validate_resources_test.go but since Katib is dynamically creating its webhooks the test isn't getting applied.

@andreyvelich
Copy link
Member

So there's two additional issues here

  1. If katib shouldn't be applied to the kubeflow namespace can we disable the admission hook in that namespace

I can see your comment here: kserve/kserve#568 (comment) and this issue: kubeflow/kubeflow#4730, users should not submit Katib Experiments in Kubeflow namespace.
I can't see any reason to have katib-metricscollector-injection: enabled on Kubeflow namespace.
WDYT @johnugeorge @gaocegege ?

  1. Starting in Kubernetes 1.X it should be possible to match pod labels to webhooks so that webhooks only apply to a subset of pods in a namespace. Do we have an issue tracking that?

Found this comment: kubeflow/kubeflow#4231 (comment). I believe if we introduce pod-level label, we can have performance issue on huge clusters?

@jlewi
Copy link
Contributor Author

jlewi commented Jul 14, 2020

It looks like we have katib collector enabled here on kubeflow namespace
https://github.com/kubeflow/manifests/blob/5cc6fd3ca0eac7c46aad116c04713d6944f249fd/namespaces/base/namespaces.yaml#L8

This file was introduced after 1.0 so I don't think that's what the user was using. Should we update that to disable metrics collector in Kubeflow namespace?

@gaocegege
Copy link
Member

I can't see any reason to have katib-metricscollector-injection: enabled on Kubeflow namespace.
WDYT @johnugeorge @gaocegege ?

SGTM

@andreyvelich
Copy link
Member

Fixed by kubeflow/manifests#1387 and kubeflow/manifests#1388.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants