-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change failurePolicy to Fail for Katib Webhooks #2018
Change failurePolicy to Fail for Katib Webhooks #2018
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thanks!
/lgtm
Maybe, we need to fix E2E... Name: katib-controller-5474c9d7dc-hlj74
Namespace: kubeflow
Priority: 0
Service Account: katib-controller
Node: fv-az202-219/10.1.0.54
Start Time: Thu, 17 Nov 2022 18:44:07 +0000
Labels: katib.kubeflow.org/component=controller
pod-template-hash=5474c9d7dc
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
sidecar.istio.io/inject: false
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/katib-controller-5474c9d7dc
Containers:
katib-controller:
Container ID:
Image: docker.io/kubeflowkatib/katib-controller:e2e-test
Image ID:
Ports: 8443/TCP, 8080/TCP, 18080/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Command:
./katib-controller
Args:
--webhook-port=8443
--trial-resources=Job.v1.batch
--trial-resources=TFJob.v1.kubeflow.org
--trial-resources=PyTorchJob.v1.kubeflow.org
--trial-resources=MPIJob.v1.kubeflow.org
--trial-resources=XGBoostJob.v1.kubeflow.org
--trial-resources=MXJob.v1.kubeflow.org
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Liveness: http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
KATIB_CORE_NAMESPACE: kubeflow (v1:metadata.namespace)
Mounts:
/tmp/cert from cert (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqd8c (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cert:
Type: Secret (a volume populated by a Secret)
SecretName: katib-webhook-cert
Optional: false
kube-api-access-vqd8c:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m1s default-scheduler Successfully assigned kubeflow/katib-controller-5474c9d7dc-hlj74 to fv-az202-219
Warning FailedMount 57s (x8 over 2m) kubelet MountVolume.SetUp failed for volume "cert" : secret "katib-webhook-cert" not found https://github.com/kubeflow/katib/actions/runs/3490817753/jobs/5842728114#step:4:10217 |
/lgtm cancel |
556027c
to
f4e7bad
Compare
@tenzen-y @johnugeorge I know why e2e are failing for this change. When Katib Control Plane is deployed, Webhooks don't have appropriate certs and the Cert Generator Job's Pod creation failed with the following error:
Any ideas how we can avoid it ? namespaceSelector:
matchLabels:
katib.kubeflow.org/metrics-collector-injection: enabled
objectSelector:
matchExpressions:
- key: katib.kubeflow.org/metrics-collector-injection
operator: NotIn
values:
- disabled Similar to Istio Sidecar injector. |
@andreyvelich In the long term, I would like to perform processes of katib-cert-generator in katib-controller as a sub-process, then we can remove the object selector from the MutatingWebhookConfiguration. This mean, we integrate the katib-cert-generator into the katib-controller, then we will remove the katib-cert-generator. |
@tenzen-y For example, using init container ? |
How is this selector specific to cert-generator? |
No, I think that we can create and inject cert for webhook to the |
@johnugeorge We can update our Cert Generator Job YAML with the following label, so mutation webhook won't be invoked: template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
labels:
katib.kubeflow.org/metrics-collector-injection: disabled |
Make sense, so we can do it before we Register Controller Components. |
The confusing part is the |
@johnugeorge Would it be useful to have this feature for our users who don't want to get Katib Mutation Webhook executed on their Pods ? Currently, if namespace has |
↑ ping @johnugeorge |
Yes. We can add a specific label to skip mutation web hook. |
Let's discuss this again on December 14th (next community meeting) @johnugeorge @tenzen-y . If we are going to integrate |
Let's merge this PR after next release, so we can discuss should we create label to disable Webhook. |
f4e7bad
to
f6f63e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andreyvelich !
/lgtm
Tests are complete 🎉 |
Fixes: #1946.
I removed
failurePolicy: Ignore
from our Admission Webhooks, so the default policy will be used (which isFail
)./assign @kubeflow/wg-automl-leads @tenzen-y @ca-scribner