Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

certmanager install has race condition - try to create KF certmanager resources before cert manager is available #1121

Closed
jbinoos opened this issue Apr 8, 2020 · 14 comments

Comments

@jbinoos
Copy link

jbinoos commented Apr 8, 2020

/kind bug

Hello,
little problem with cert-manager during kubeflow install:

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
I encounter the problem
kfctl apply -f ${CONFIG} ends with
failed to apply: (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout285717195": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request

install exists after a loop of error messages:
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook configured WARN[0106] Encountered error applying application cert-manager: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout520652292": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request filename="kustomize/kustomize.go:202" WARN[0106] Will retry in 29 seconds. filename="kustomize/kustomize.go:203" namespace/cert-manager unchanged

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I follow the install page:
https://www.kubeflow.org/docs/started/k8s/kfctl-k8s-istio/

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard):
    kfctl_k8s_istio.v1.0.1.yaml

  • kfctl version: (use kfctl version):
    kfctl v1.0.1-0-gf3edb9b

  • Kubernetes platform: (e.g. minikube)
    one fresh node install with kubeadm

  • Kubernetes version: (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.11", GitCommit:"d94a81c724ea8e1ccc9002d89b7fe81d58f89ede", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

  • OS (e.g. from /etc/os-release):
    Ubuntu 16.04.6 LTS

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.96

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@jbinoos
Copy link
Author

jbinoos commented Apr 8, 2020

this pb disappear if I launch kfctl apply -f ${CONFIG} a second time just after....

@jtfogarty
Copy link

/platform minikube
/area install
/priority p2

@k8s-ci-robot
Copy link
Contributor

@jtfogarty: The label(s) area/install cannot be applied, because the repository doesn't have them

In response to this:

/platform minikube
/area install
/priority p2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jlewi
Copy link
Contributor

jlewi commented Apr 20, 2020

I suspect there is a race condition.
We are creating a ClusterIssuer for self-signed certificates here:
https://github.com/kubeflow/manifests/blob/master/cert-manager/cert-manager/overlays/self-signed/cluster-issuer.yaml

This will fail if CertManager custom resources haven't deployed and succeeded yet.

One way to handle this would be to refactor the cert-manager to split deploying cert-manager custom resources and webhooks from creating instances of the cert-manager resources (e.g. a self-signed issuer).

This way we could wait for cert-manager to be deployed before attempting to create cert-manager resources.

@jlewi jlewi changed the title pb with certmanager during kubeflow install certmanager install has race condition - try to create KF certmanager resources before cert manager is available Apr 20, 2020
@jlewi jlewi transferred this issue from kubeflow/kubeflow Apr 20, 2020
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.99

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@snowlover173
Copy link

snowlover173 commented May 6, 2020

Experienced the same problem in our private GKE.
version 1.0.0, 1.0.1, 1.0.2

I run kfctl build , and make changes to add the network and define that it is a private cluster then kfctl apply {CONFIG_FILE}
the cluster comes up and then after a few times, the webhook certmanager fails:
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook configured
ERRO[0656] Permanently failed applying application cert-manager; error: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout196638265": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request filename="kustomize/kustomize.go:206"
Error: failed to apply: (kubeflow.error): Code 500 with message: kfApp Apply failed for kustomize: (kubeflow.error): Code 500 with message: Apply.Run Error error when creating "/tmp/kout196638265": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/kfctl 0.97

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@snowlover173
Copy link

The problem for me didn't get resolved by retrying it. Is there anywhere that I need to make changes before running "kfctl apply" ?
as I mentioned I already make some changes to launch it in a private subnet.

@jlewi
Copy link
Contributor

jlewi commented May 9, 2020

@snowlover173 can you please check if the cert-manager webhook pods are running and if not see if you can find any information about why not?

kubectl -n cert-manager get pods

@snowlover173
Copy link

snowlover173 commented May 10, 2020

@jlewi it seems they are running:
NAME READY STATUS RESTARTS AGE
cert-manager-564b4bffd7-hwgbh 1/1 Running 1
cert-manager-cainjector-596986f94-h8t4m 1/1 Running 6
cert-manager-webhook-755d75845c-jrkbk 1/1 Running 0

I have killed the cluster now

@joepeskett
Copy link

@jlewi @snowlover173 I think this is related to a comment I left on #4932.

On private GKE we had a set up a firewall rule as mentioned in cert-manager docs here: https://docs.cert-manager.io/en/release-0.8/getting-started/webhook.html#running-on-private-gke-clusters

Link for firewall rule: https://www.revsys.com/tidbits/jetstackcert-manager-gke-private-clusters/

Apologies if I'm misunderstanding and adding to the confusion!

@snowlover173
Copy link

snowlover173 commented May 10, 2020

@joepeskett Thanks for your comment.
I have seen that document. But did you install it on an existing GKE or using kfctl yaml?
Because it creates firewall rules already, if you use kfctl yaml.
We might have some blockers in the upstream of our network.

@jlewi
Copy link
Contributor

jlewi commented May 18, 2020

Closing this issue because the race condition should be fixed by #1143.

@joepeskett @snowlover173 the firewall issue seems like a different issue. Please open up a new issue if you are still having trouble.

@jlewi jlewi closed this as completed May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants