-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework deploy_kubeflow.sh for Kubeflow 1.4 #1104
Conversation
af5b54e
to
d848a67
Compare
|
||
# Speificy how long to poll for Kubeflow to start | ||
# Specify how long to poll for Kubeflow to start | ||
export KUBEFLOW_TIMEOUT="${KUBEFLOW_TIMEOUT:-600}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need 2 timeout variables? Do we use this in any of the test harness?
# Define Kubeflow manifests location | ||
export KUBEFLOW_MANIFESTS_DEST="${KUBEFLOW_MANIFESTS_DEST:-${CONFIG_DIR}/kubeflow-install/manifests}" | ||
export KUBEFLOW_MANIFESTS_URL="${KUBEFLOW_MANIFESTS_URL:-https://github.com/kubeflow/manifests}" | ||
export KUBEFLOW_MANIFESTS_VERSION="${KUBEFLOW_MANIFESTS_VERSION:-v1.4.1}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been using hashes instead of branch names because they tend to change the files in the release branch. But that kept breaking anyways, so I'm good with this.
|
||
# Define Kustomize location | ||
export KUSTOMIZE_URL="${KUSTOMIZE_URL:-https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Jenkins servers we use had a custom download of this kustomize script at one point. We were having an issue where the test server was getting blocked for downloading too frequently, so I needed a workaround. We just to make sure that we aren't running an older version in test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's good to know! I'll comment that out in Jenkins.
Annoyingly, Kubeflow 1.4 is locked to this specific Kustomize version: https://github.com/kubeflow/manifests/blob/master/README.md?plain=1#L84
scripts/k8s/deploy_kubeflow.sh
Outdated
echo "To provision Ceph storage, run: ./scripts/k8s/deploy_rook.sh" | ||
exit 1 | ||
fi | ||
# kubectl get storageclass 2>&1 | grep "(default)" >/dev/null 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this no longer required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented this out during debugging, will test again with it restored.
tar xzf ./kustomize_v*_linux_*.tar.gz | ||
mv kustomize ${KUSTOMIZE} | ||
|
||
mkdir -p ${KUBEFLOW_MPI_DIR} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the current install include the MPI Operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that the latest version of MPI Operator changed the control mechanism and multi-node workloads will no longer run without ssh capabilities in the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current install includes MPI Operator, yes. I haven't tested it yet though, will do so.
@@ -27,7 +27,7 @@ Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| | |||
config.vm.define "virtual-gpu01" do |gpu| | |||
gpu.vm.provider "libvirt" do |v| | |||
v.memory = 16384 | |||
v.cpus = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to verify this doesn't break the manually kicked off multinode test.
@@ -1,20 +0,0 @@ | |||
# See GitHub for more details: https://github.com/kubeflow/kubeflow/pull/3856 | |||
# Automatically shutdown Jupyter Notebook containers if idle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sad to see this go. Any idea if the ability to automatically cull idle gpu jobs still exists?
kind: Kustomization | ||
|
||
# TODO: Remove this when the bug is fixed in v1.3 # BUG: https://github.com/kubeflow/manifests/pull/1686/files | ||
images: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you notice any easy way to default NGC containers and start Jupyter by default in the new version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as is, thanks for getting this in! Let's see it pass the testing.
@supertetelman : CI tests not passing for reasons I don't understand yet, and I'd like to address at least some of your comments. Converted back to draft for now. |
@supertetelman : Looking for help running an issue down on this... When I test this PR on a local VM cluster built using the DeepOps But when it runs in CI, I see errors like this:
(Full log from most recent CI run) I'm not sure what could cause the delta, as the specs of the VMs in the CI test should be the same as the specs in my local test. Any thoughts? |
web: | ||
http: 0.0.0.0:5556 | ||
logger: | ||
level: "debug" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to default this to info, will test this on my cluster.
A few minor updates needed to stay in-line with current behavior Installation:
Deletion:
Otherwise |
Just opened a PR with a last set of bugfixes needed, mostly around the kubeflow testing. Merge that and we can merge this through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just merge in the testing changes in ajdecon#11 so that this passes in Jenkins.
Kubeflow 1.4 minor bugfixes
From the Kubeflow README:
This changes a large part of the deployment process for Kubeflow, such that our
deploy_kubeflow.sh
script will need major changes for future versions.This PR is currently a work in progress, starting with a fresh script
deploy_kubeflow_new.sh
. As the PR evolves, I'll actually go back and rework the original script to use the newkustomize
-based install method, rather thankfctl
. But starting with a fresh script simplified this process greatly during initial development.The current script does successfully install Kubeflow! Though it doesn't do much else yet. 😉 Backporting this into the older script should allow us to get back login customizations, etc.