Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu example: wording cleanup #3682

Merged
merged 1 commit into from
May 5, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions samples/tutorials/gpu/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# GPU

This folder contains a GPU sample.
- Demo how to setup one GPU node pool with low cost via autoscaling.
- Demo how to setup more than one GPU node pools in one cluster.
- Demo how to use Kubeflow Pipeline SDK to consume GPU.
- Demo how to set up one GPU node pool with low cost via autoscaling.
- Demo how to set up more than one GPU node pool in one cluster.
- Demo how to use the Kubeflow Pipeline SDK to consume GPUs.
71 changes: 41 additions & 30 deletions samples/tutorials/gpu/gpu.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook walks you through how to use accelerators for Kubeflow Pipelines steps.\n",
"\n",
"# Preparation\n",
"\n",
"If you installed Kubeflow via [kfctl](https://www.kubeflow.org/docs/gke/customizing-gke/#common-customizations), you may already prepared GPU enviroment and can skip this section.\n",
"If you installed Kubeflow via [kfctl](https://www.kubeflow.org/docs/gke/customizing-gke/#common-customizations), these steps will have already been done, and you can skip this section.\n",
"\n",
"If you installed Kubeflow Pipelines via [Google Cloud AI Platform Pipelines UI](https://console.cloud.google.com/ai-platform/pipelines/) or [Standalone manifest](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize), please follow following steps to setup GPU enviroment.\n",
"If you installed Kubeflow Pipelines via [Google Cloud AI Platform Pipelines UI](https://console.cloud.google.com/ai-platform/pipelines/) or [Standalone manifest](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize), you willl need to follow these steps to set up your GPU enviroment.\n",
"\n",
"## Add GPU nodes to your cluster\n",
"\n",
Expand All @@ -20,7 +22,7 @@
"\n",
"You may also check or edit the GCP's **GPU Quota** to make sure you still have GPU quota in the region.\n",
"\n",
"To well saving the costs, it's possible you create a zero-sized node pool for GPU and enable the autoscaling.\n",
"To reduce costs, you may want to create a zero-sized node pool for GPU and enable autoscaling.\n",
"\n",
"Here is an example to create a P100 GPU node pool for a cluster.\n",
"\n",
Expand All @@ -34,31 +36,32 @@
"export MACHINE_TYPE=n1-highmem-16\n",
"\n",
"\n",
"# It may takes several minutes.\n",
"# Node pool creation may take several minutes.\n",
"gcloud container node-pools create ${GPU_POOL_NAME} \\\n",
" --accelerator type=${GPU_TYPE},count=${GPU_COUNT} \\\n",
" --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} \\\n",
" --num-nodes=0 --machine-type=${MACHINE_TYPE} --min-nodes=0 --max-nodes=5 --enable-autoscaling \\\n",
" --scopes=cloud-platform\n",
"```\n",
"\n",
"Here in this sample, we specified **--scopes=cloud-platform**. More info is [here](https://cloud.google.com/sdk/gcloud/reference/container/node-pools/create#--scopes). It will allow the job in the node pool to use GCE Default Service Account to access GCP APIs (e.x. GCS etc.). You also use [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) or [Application Default Credential](https://cloud.google.com/docs/authentication/production) to replace **--scopes=cloud-platform**.\n",
"Here in this sample, we specified **--scopes=cloud-platform**. More info is [here](https://cloud.google.com/sdk/gcloud/reference/container/node-pools/create#--scopes). This scope will allow node pool jobs to use the GCE Default Service Account to access GCP APIs (like GCS, etc.). You can also use [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) or [Application Default Credentials](https://cloud.google.com/docs/authentication/production) to replace **--scopes=cloud-platform**.\n",
"\n",
"## Install device driver to the cluster\n",
"## Install NVIDIA device driver to the cluster\n",
"\n",
"After adding GPU nodes to your cluster, you need to install NVIDIA’s device drivers to the nodes. Google provides a DaemonSet that automatically installs the drivers for you.\n",
"After adding GPU nodes to your cluster, you need to install NVIDIA’s device drivers to the nodes. Google provides a GKE `DaemonSet` that automatically installs the drivers for you.\n",
"\n",
"To deploy the installation DaemonSet, run the following command. It's an one-off work.\n",
"To deploy the installation DaemonSet, run the following command. You can run this command any time (even before you create your node pool), and you only need to do this once per cluster.\n",
"\n",
"```shell\n",
"kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml\n",
"```\n",
"\n",
"# Consume GPU via Kubeflow Pipelines SDK\n",
"\n",
"Here is a [document](https://www.kubeflow.org/docs/gke/pipelines/enable-gpu-and-tpu/).\n",
"Once your cluster is set up to support GPUs, the next step is to indicate which steps in your pipelines should use accelerators, and what type they should use. \n",
"Here is a [document](https://www.kubeflow.org/docs/gke/pipelines/enable-gpu-and-tpu/) that describes the options.\n",
"\n",
"Following is a sample quick smoking test.\n"
"The following is an example 'smoke test' pipeline, to see if your cluster setup is working properly.\n"
]
},
{
Expand All @@ -79,8 +82,8 @@
" ).set_gpu_limit(1)\n",
"\n",
"@dsl.pipeline(\n",
" name='GPU smoking check',\n",
" description='Smoking check whether GPU env is ready.'\n",
" name='GPU smoke check',\n",
" description='smoke check as to whether GPU env is ready.'\n",
")\n",
"def gpu_pipeline():\n",
" gpu_smoking_check = gpu_smoking_check_op()\n",
Expand All @@ -93,11 +96,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You may see warning message from Kubeflow Pipeline logs saying \"Insufficient nvidia.com/gpu\". Please wait for few minutes.\n",
"You may see a warning message from Kubeflow Pipeline logs saying \"Insufficient nvidia.com/gpu\". If so, this probably means that your GPU-enabled node is still spinning up; please wait for few minutes. You can check the current nodes in your cluster like this:\n",
"\n",
"```\n",
"kubectl get nodes -o wide\n",
"```\n",
"\n",
"If everything runs well, it's expected to see the results of \"nvidia-smi\" mentions the CUDA version, GPU type and usage etc.\n",
"If everything runs as expected, the `nvidia-smi` command should list the CUDA version, GPU type, usage, etc. (See the logs panel in the pipeline UI to view output).\n",
"\n",
"> You may also notice that after the pod got finished, the new GPU node is still there. GKE autoscale algorithm will free that node if no usage for certain time. More info is [here](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler)."
"> You may also notice that after the pipeline step's GKE pod has finished, the new GPU cluster node is still there. GKE autoscale algorithm will free that node if no usage for certain time. More info is [here](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler)."
]
},
{
Expand All @@ -106,17 +113,17 @@
"source": [
"# Multiple GPUs pool in one cluster\n",
"\n",
"It's possible you want more then 1 type of GPU to be supported in one cluster.\n",
"It's possible you want more than one type of GPU to be supported in one cluster.\n",
"\n",
"- There are several types of GPUs.\n",
"- Certain regions normally just support part of the GPUs ([document](https://cloud.google.com/compute/docs/gpus#gpus-list)).\n",
"- Certain regions often support a only subset of the GPUs ([document](https://cloud.google.com/compute/docs/gpus#gpus-list)).\n",
"\n",
"Since we can set \"--num-nodes=0\" for certain GPU node pool to save costs if no workload, we can create multiple node pools for different types of GPUs.\n",
"Since we can set `--num-nodes=0` for certain GPU node pool to save costs if no workload, we can create multiple node pools for different types of GPUs.\n",
"\n",
"## Add additional GPU nodes to your cluster\n",
"\n",
"\n",
"In upper section, we added a node pool for P100. Here we add another pool for V100.\n",
"In a previous section, we added a node pool for P100s. Here we add another pool for V100s.\n",
"\n",
"```shell\n",
"# You may customize these parameters.\n",
Expand All @@ -128,7 +135,7 @@
"export MACHINE_TYPE=n1-highmem-8\n",
"\n",
"\n",
"# It may takes several minutes.\n",
"# Node pool creation may take several minutes.\n",
"gcloud container node-pools create ${GPU_POOL_NAME} \\\n",
" --accelerator type=${GPU_TYPE},count=${GPU_COUNT} \\\n",
" --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} \\\n",
Expand All @@ -137,12 +144,13 @@
"\n",
"## Consume certain GPU via Kubeflow Pipelines SDK\n",
"\n",
"Please reference following sample which explictlly request to use certain GPU."
"If your cluster has multiple GPU node pools, you can explicitly specify that a given pipeline step should use a particular type of accelerator.\n",
"This example shows how to use P100s for one pipeline step, and V100s for another."
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -166,8 +174,8 @@
" ).set_gpu_limit(1).add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')\n",
"\n",
"@dsl.pipeline(\n",
" name='GPU smoking check',\n",
" description='Smoking check whether GPU env is ready.'\n",
" name='GPU smoke check',\n",
" description='Smoke check as to whether GPU env is ready.'\n",
")\n",
"def gpu_pipeline():\n",
" gpu_p100 = gpu_p100_op()\n",
Expand All @@ -181,17 +189,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It's expected it runs well and you will see different \"nvidia-smi\" logs from the two pipeline steps."
"You should see different \"nvidia-smi\" logs from the two pipeline steps."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preemptible GPU\n",
"Preemptible GPU resource is more cheaper but it also means your task requires retries.\n",
"## Using Preemptible GPUs\n",
"\n",
"A [Preemptible GPU resource](https://cloud.google.com/compute/docs/instances/preemptible#preemptible_with_gpu) is cheaper, but use of these instances means that a pipeline step has the potential to be aborted and then retried. This means that pipeline steps used with preemptible instances must be idempotent (the step gives the same results if run again), or creates some kind of checkpoint so that it can pick up where it left off. To use preemptible GPUs, create a node pool as follows. Then when specifying a pipeline, you can indicate use of a preemptible node pool for a step. \n",
"\n",
"Please notice the following only difference is that it added **--preemptible** and **--node-taints=preemptible=true:NoSchedule** parameters.\n",
"The only difference in the following node-pool creation example is that the **--preemptible** and **--node-taints=preemptible=true:NoSchedule** parameters have been added.\n",
"\n",
"```\n",
"export GPU_POOL_NAME=v100pool-preemptible\n",
Expand All @@ -207,7 +216,9 @@
" --preemptible \\\n",
" --node-taints=preemptible=true:NoSchedule \\\n",
" --num-nodes=0 --machine-type=${MACHINE_TYPE} --min-nodes=0 --max-nodes=5 --enable-autoscaling\n",
"```"
"```\n",
"\n",
"Then, you can define a pipeline as follows (note the use of `use_preemptible_nodepool()`)."
]
},
{
Expand Down Expand Up @@ -265,7 +276,7 @@
"metadata": {},
"source": [
"# TPU\n",
"Google's TPU is awesome. It's faster and lower TOC. To consume TPU, no need to create node-pool, just call KFP SDK to use it. Here is a [doc](https://www.kubeflow.org/docs/gke/pipelines/enable-gpu-and-tpu/#configure-containerop-to-consume-tpus). Please notice that not all regions has TPU yet.\n",
"Google's TPU is awesome. It's faster and lower TOC. To consume TPUs, there is no need to create a node-pool; just call KFP SDK to use it. Here is a [doc](https://www.kubeflow.org/docs/gke/pipelines/enable-gpu-and-tpu/#configure-containerop-to-consume-tpus). Note that not all regions have TPU yet.\n",
"\n"
]
},
Expand Down