Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework deploy_kubeflow.sh for Kubeflow 1.4 #1104

Merged
merged 12 commits into from
Mar 9, 2022

Conversation

ajdecon
Copy link
Collaborator

@ajdecon ajdecon commented Feb 8, 2022

From the Kubeflow README:

Starting Kubeflow 1.3, all components should be deployable using kustomize only. Any automation tooling for deployment on top of the manifests should be maintained externally by distribution owners.

This changes a large part of the deployment process for Kubeflow, such that our deploy_kubeflow.sh script will need major changes for future versions.

This PR is currently a work in progress, starting with a fresh script deploy_kubeflow_new.sh. As the PR evolves, I'll actually go back and rework the original script to use the new kustomize-based install method, rather than kfctl. But starting with a fresh script simplified this process greatly during initial development.

The current script does successfully install Kubeflow! Though it doesn't do much else yet. 😉 Backporting this into the older script should allow us to get back login customizations, etc.

$ kubectl get pods -n kubeflow
NAME                                                        READY   STATUS    RESTARTS   AGE
admission-webhook-deployment-667bd68d94-7fsgq               1/1     Running   0          25m
cache-deployer-deployment-79fdf9c5c9-d9d2q                  2/2     Running   2          25m
cache-server-5bdf4f4457-wfmtw                               2/2     Running   0          25m
centraldashboard-8fc7d8cc-d7d6z                             1/1     Running   0          25m
jupyter-web-app-deployment-6f744fbc54-vnv5z                 1/1     Running   0          25m
katib-controller-68c47fbf8b-v54sz                           1/1     Running   0          25m
katib-db-manager-6c948b6b76-dd2cm                           1/1     Running   7          25m
katib-mysql-7894994f88-5vjsl                                1/1     Running   1          25m
katib-ui-64bb96d5bf-s62lr                                   1/1     Running   0          25m
kfserving-controller-manager-0                              2/2     Running   0          15m
kfserving-models-web-app-7884f597cf-2d6j4                   2/2     Running   0          25m
kubeflow-pipelines-profile-controller-7b947f4748-6lgfh      1/1     Running   0          25m
metacontroller-0                                            1/1     Running   0          15m
metadata-envoy-deployment-5b4856dd5-2dqnx                   1/1     Running   0          25m
metadata-grpc-deployment-6b5685488-ssl6m                    2/2     Running   7          25m
metadata-writer-548bd879bb-6q6gc                            2/2     Running   4          25m
minio-5b65df66c9-trdrb                                      2/2     Running   0          25m
ml-pipeline-8c4b99589-m2q92                                 2/2     Running   9          25m
ml-pipeline-persistenceagent-d6bdc77bd-5m4fk                2/2     Running   1          25m
ml-pipeline-scheduledworkflow-5db54d75c5-2r24c              2/2     Running   0          25m
ml-pipeline-ui-5bd8d6dc84-zn5kt                             2/2     Running   0          24m
ml-pipeline-viewer-crd-68fb5f4d58-m4kj2                     2/2     Running   1          24m
ml-pipeline-visualizationserver-8476b5c645-24ksb            2/2     Running   0          24m
mpi-operator-5c55d6cb8f-bqzwf                               1/1     Running   0          24m
mysql-f7b9b7dd4-jpzd8                                       2/2     Running   0          24m
notebook-controller-deployment-75b4f7b578-9lnkt             1/1     Running   0          24m
profiles-deployment-89f7d88b-mmll6                          2/2     Running   0          24m
tensorboard-controller-controller-manager-954b7c544-gmnr4   3/3     Running   1          24m
tensorboards-web-app-deployment-6ff79b7f44-886j8            1/1     Running   0          24m
training-operator-7d98f9dd88-phrb2                          1/1     Running   0          24m
volumes-web-app-deployment-8589d664cc-5jvsw                 1/1     Running   0          24m
workflow-controller-5cbbb49bd8-sqvxm                        2/2     Running   1          24m
$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

image

@ajdecon ajdecon force-pushed the kubeflow-1.4 branch 2 times, most recently from af5b54e to d848a67 Compare February 10, 2022 20:03
@ajdecon ajdecon marked this pull request as ready for review February 10, 2022 20:16
@ajdecon ajdecon changed the title WIP: Rework deploy_kubeflow.sh for Kubeflow 1.4 Rework deploy_kubeflow.sh for Kubeflow 1.4 Feb 10, 2022

# Speificy how long to poll for Kubeflow to start
# Specify how long to poll for Kubeflow to start
export KUBEFLOW_TIMEOUT="${KUBEFLOW_TIMEOUT:-600}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need 2 timeout variables? Do we use this in any of the test harness?

# Define Kubeflow manifests location
export KUBEFLOW_MANIFESTS_DEST="${KUBEFLOW_MANIFESTS_DEST:-${CONFIG_DIR}/kubeflow-install/manifests}"
export KUBEFLOW_MANIFESTS_URL="${KUBEFLOW_MANIFESTS_URL:-https://github.com/kubeflow/manifests}"
export KUBEFLOW_MANIFESTS_VERSION="${KUBEFLOW_MANIFESTS_VERSION:-v1.4.1}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been using hashes instead of branch names because they tend to change the files in the release branch. But that kept breaking anyways, so I'm good with this.


# Define Kustomize location
export KUSTOMIZE_URL="${KUSTOMIZE_URL:-https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jenkins servers we use had a custom download of this kustomize script at one point. We were having an issue where the test server was getting blocked for downloading too frequently, so I needed a workaround. We just to make sure that we aren't running an older version in test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's good to know! I'll comment that out in Jenkins.

Annoyingly, Kubeflow 1.4 is locked to this specific Kustomize version: https://github.com/kubeflow/manifests/blob/master/README.md?plain=1#L84

echo "To provision Ceph storage, run: ./scripts/k8s/deploy_rook.sh"
exit 1
fi
# kubectl get storageclass 2>&1 | grep "(default)" >/dev/null 2>&1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this no longer required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented this out during debugging, will test again with it restored.

tar xzf ./kustomize_v*_linux_*.tar.gz
mv kustomize ${KUSTOMIZE}

mkdir -p ${KUBEFLOW_MPI_DIR}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current install include the MPI Operator?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that the latest version of MPI Operator changed the control mechanism and multi-node workloads will no longer run without ssh capabilities in the container.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current install includes MPI Operator, yes. I haven't tested it yet though, will do so.

@@ -27,7 +27,7 @@ Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.define "virtual-gpu01" do |gpu|
gpu.vm.provider "libvirt" do |v|
v.memory = 16384
v.cpus = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to verify this doesn't break the manually kicked off multinode test.

@@ -1,20 +0,0 @@
# See GitHub for more details: https://github.com/kubeflow/kubeflow/pull/3856
# Automatically shutdown Jupyter Notebook containers if idle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad to see this go. Any idea if the ability to automatically cull idle gpu jobs still exists?

kind: Kustomization

# TODO: Remove this when the bug is fixed in v1.3 # BUG: https://github.com/kubeflow/manifests/pull/1686/files
images:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you notice any easy way to default NGC containers and start Jupyter by default in the new version?

supertetelman
supertetelman previously approved these changes Feb 12, 2022
Copy link
Collaborator

@supertetelman supertetelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as is, thanks for getting this in! Let's see it pass the testing.

@ajdecon ajdecon marked this pull request as draft February 15, 2022 21:41
@ajdecon
Copy link
Collaborator Author

ajdecon commented Feb 15, 2022

@supertetelman : CI tests not passing for reasons I don't understand yet, and I'd like to address at least some of your comments. Converted back to draft for now.

@ajdecon
Copy link
Collaborator Author

ajdecon commented Feb 17, 2022

@supertetelman : Looking for help running an issue down on this...

When I test this PR on a local VM cluster built using the DeepOps virtual/ scripts, Kubeflow deploys successfully.

But when it runs in CI, I see errors like this:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.istio.networking.internal.knative.dev": Post "https://istio-webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.25.125:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.serving.knative.dev": Post "https://webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.2.217:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.serving.knative.dev": Post "https://webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.2.217:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.serving.knative.dev": Post "https://webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.2.217:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.istio.networking.internal.knative.dev": Post "https://istio-webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.25.125:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.istio.networking.internal.knative.dev": Post "https://istio-webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.25.125:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.serving.knative.dev": Post "https://webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.2.217:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.serving.knative.dev": Post "https://webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.2.217:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.istio.networking.internal.knative.dev": Post "https://istio-webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.25.125:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.serving.knative.dev": Post "https://webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.2.217:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.istio.networking.internal.knative.dev": Post "https://istio-webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.25.125:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "config.webhook.serving.knative.dev": Post "https://webhook.knative-serving.svc:443/config-validation?timeout=10s": dial tcp 10.233.2.217:443: connect: connection refused

(Full log from most recent CI run)

I'm not sure what could cause the delta, as the specs of the VMs in the CI test should be the same as the specs in my local test.

Any thoughts?

web:
http: 0.0.0.0:5556
logger:
level: "debug"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to default this to info, will test this on my cluster.

@supertetelman
Copy link
Collaborator

A few minor updates needed to stay in-line with current behavior

Installation:

  • Should fail fast on the git clone if there is already a config present (so we don't accidentally install an old version)

Deletion:

  • Should clear out and/or backup the kubeflow config directory
  • Delete leaves behind two namespaces. We should either A) print a message indicating a need to manually delete cert-manager and istio-system namespaces or B) should delete these. Previously I thought I had some dynamic configuration detection to see if either of these were present prior to kubeflow install, to be on the side of caution here, I would say we go with A and just print a warning message here that they should manually delete those two namespaces for a full deletion.

Otherwise

@supertetelman
Copy link
Collaborator

supertetelman commented Mar 9, 2022

Just opened a PR with a last set of bugfixes needed, mostly around the kubeflow testing. Merge that and we can merge this through.

@supertetelman supertetelman marked this pull request as ready for review March 9, 2022 19:01
supertetelman
supertetelman previously approved these changes Mar 9, 2022
Copy link
Collaborator

@supertetelman supertetelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just merge in the testing changes in ajdecon#11 so that this passes in Jenkins.

@supertetelman supertetelman merged commit 223b214 into NVIDIA:master Mar 9, 2022
@ajdecon ajdecon mentioned this pull request Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants