Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual installation instructions and Kustomize 4 #2360

Closed
kimwnasptd opened this issue Jan 25, 2023 · 10 comments
Closed

Manual installation instructions and Kustomize 4 #2360

kimwnasptd opened this issue Jan 25, 2023 · 10 comments

Comments

@kimwnasptd
Copy link
Member

For KF 1.7 we were targeting for supporting Kustomize 4 #1797 (comment). Specifically, we were aiming for using a kustomize version that would include the fix in kubernetes-sigs/kustomize#4019.

But since at this point in time we don't have such a version we'll have to do some compromise.

While the one-liner installation that uses a single meta-package won't work due to the ordering issue we can still aim for ensuring the manual instructions can work with Kustomize 4. I'll use this issue to expose the progress of this effort.

cc @DomFleischmann, and also @surajkota since IIRC he had some experience with this effort

@kimwnasptd
Copy link
Member Author

The first hiccup I bumped into was with the Cert Manager instructions we have
https://github.com/kubeflow/manifests#cert-manager

kustomize build common/cert-manager/cert-manager/base | kubectl apply -f -
kustomize build common/cert-manager/kubeflow-issuer/base | kubectl apply -f -

When running the second command I got back

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "[webhook.cert-manager.io](http://webhook.cert-manager.io/)": failed to call webhook: Post "[https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s](https://cert-manager-webhook.cert-manager.svc/mutate?timeout=10s)": dial tcp 10.96.202.64:443: connect: connection refused

NOTE: I'm using a KinD cluster with K8s 1.25, to ensure I'm as close as possible to the testing environment https://github.com/kubeflow/manifests/tree/master/tests/gh-actions

@kimwnasptd
Copy link
Member Author

Looks like the Cert Manager folks have a whole section on the connection refused error https://cert-manager.io/docs/troubleshooting/webhook/#error-connect-connection-refused

They propose the following checks:

  1. Verify there are Endpoint objects for the Service ✅
  2. Verify the Pod is Ready ✅
  3. Hit the /healthz of the webhook and get 200 ✅

In my case all the above were true, yet I still got that error. After "some time" though the second command succeeded. At this point since /healthz requests return 200, yet I'm getting the connection refused error, I'll consider that this is an error with K8s (KinD?) since the Pod is ready yet it can't receive requests.

So, I'll add a manual wait for the webhook Pod to be ready and add some pointers in the README to this troubleshooting.

@kimwnasptd
Copy link
Member Author

Next up, the Istio, Dex and OIDC AuthService sections worked as expected.

The next problematic section is Knative. Specifically I, sometimes, bumped into the following error when applying the knative manifests:

namespace/knative-serving created
...
customresourcedefinition.apiextensions.k8s.io/images.caching.internal.knative.dev created
...
error: resource mapping not found for name: "queue-proxy" namespace: "knative-serving" from "STDIN": no matches for kind "Image" in version "caching.internal.knative.dev/v1alpha1"
ensure CRDs are installed first

This error happens when kubectl tries to apply the Image CR queue-proxy

apiVersion: caching.internal.knative.dev/v1alpha1
kind: Image
metadata:
name: queue-proxy
namespace: knative-serving
labels:
app.kubernetes.io/component: queue-proxy
app.kubernetes.io/name: knative-serving
app.kubernetes.io/version: "1.8.0"
spec:
image: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c

@kimwnasptd
Copy link
Member Author

kimwnasptd commented Jan 25, 2023

After some debugging with @thomaspant we saw that the CRDs have an Established Condition in their .status
https://github.com/kubernetes/apiextensions-apiserver/blob/a7ee7f91a2d0805f729998b85680a20cfba208d2/pkg/apis/apiextensions/types.go#L276-L279

What seems to happen is the following

  1. kubectl first applies the Image CRD as expected
  2. The CRD takes some seconds to become Established
  3. kubectl "quickly" tries to apply the queue-proxy Image CR, but the CRD is still not Established
  4. We see the above error

This issue was discussed both in kubernetes/kubectl#1117 and in helm/helm#4925.

Re-applying the manifests after a bit to ensure the CRDs are Established resolves the issue. I'll add a troubleshooting note in that section regarding this error. We can just tell users to re-apply the Knative manifests

@kimwnasptd
Copy link
Member Author

Next component that threw an error was KFP, but for the same reason as above regarding CRD readiness

error: resource mapping not found for name: "kubeflow-pipelines-profile-controller" namespace: "kubeflow" from "STDIN": no matches for kind "CompositeController" in version "metacontroller.k8s.io/v1alpha1"
ensure CRDs are installed first

Re-applying the manifests again solved the issue here as well

@kimwnasptd
Copy link
Member Author

Lastly, I bumped into an issue with the Profiles manifests

❯ ( git co master && cd components/profile-controller/config/manager && ./kustomize-4.5.5 -o master.template.yaml build )
Already on 'master'
Your branch is up to date with 'origin/master'.
Error: map[string]interface {}(nil): yaml: unmarshal errors:
  line 37: mapping key "livenessProbe" already defined at line 27
  line 49: mapping key "ports" already defined at line 33

But this is fixed from @arkaitzj in kubeflow/kubeflow#6604, which will be included in the KF 1.7 RC0

@kimwnasptd
Copy link
Member Author

kimwnasptd commented Jan 25, 2023

So the current summary is:

  1. kubectl is slightly broken, since it applies CRs without checking if the CRD is Established and retrying Wait for a CRD type to deploy before deploying resources that use the type kubernetes/kubectl#1117 crd-install hook possible race condition  helm/helm#4925
  2. Cert Manager seems to need some time to become functional, but we don't have a good way of waiting https://cert-manager.io/docs/troubleshooting/webhook/#error-connect-connection-refused

Aside from these the manual instructions work as expected with
{Version:kustomize/v4.5.7 GitCommit:56d82a8378dfc8dc3b3b1085e5a6e67b82966bd7 BuildDate:2022-08-02T16:35:54Z GoOs:linux GoArch:amd64}

I'll prepare a small PR to expose this info in the manual installation steps and that this is compatible with Kustomize 4.5.7

cc @DomFleischmann @jbottum

@juliusvonkohout
Copy link
Member

@kimwnasptd can we close this since we are now on kustomize 5 ?

@juliusvonkohout
Copy link
Member

/close

Copy link

@juliusvonkohout: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants