Rewrote doc on GPU support in Kubernetes. #6736

rohitagarwal003 · 2017-12-22T02:05:07Z

Current doc: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

Updated doc: https://deploy-preview-6736--kubernetes-io-master-staging.netlify.com/docs/tasks/manage-gpus/scheduling-gpus/

This change is

k8sio-netlify-preview-bot · 2017-12-22T02:06:58Z

Deploy preview ready!

Built with commit 3e8d132

https://deploy-preview-6736--kubernetes-io-master-staging.netlify.com

rohitagarwal003 · 2017-12-22T02:27:10Z

/assign @jiayingz @vishh

tengqm

@mindprince How to you want to solve the conflicts with #6722?

tengqm · 2017-12-22T06:10:27Z

docs/tasks/manage-gpus/scheduling-gpus.md

+
+### Deploying NVIDIA GPU device plugin
+
+There are currently two device plugin implementations for NVIDIA GPUs:


two device plugin implementations

two sample device plugin implementations ?

To me, "sample" implies that they are not ready to be used as is and require some modification. However, both these implementations are ready to be used as-is.

RenaudWasTaken · 2017-12-22T11:00:20Z

docs/tasks/manage-gpus/scheduling-gpus.md

+multiple backwards incompatible iterations. This page describes how users can
+consume GPUs across different Kubernetes versions and the current limitations.
+
+## v1.6 and v1.7


Can we move this section at the end and have the current version be the first one ?
We could add a summary :)

It flows better this way. I made the recommendation to use device plugins bold.

RenaudWasTaken

Mostly lgtm will close my PR.
Three points though:

Can we have the device plugin documentation first + summary?
I'd rather we use an official NVIDIA supported image
Please add a disclaimer that NVIDIA does not support installing the drivers from a container on Ubuntu and that any questions or bug should be handled by the GKE team

RenaudWasTaken · 2017-12-22T11:00:48Z

docs/tasks/manage-gpus/scheduling-gpus.md

@@ -4,154 +4,203 @@ approvers:
 title: Schedule GPUs
 ---



Can we mark this as alpha ?
i.e: +{% include feature-state-alpha.md %}

This is describing a task ("how to use GPUs") and not a particular feature. The device plugin page linked from includes this already. In this page, we have mentioned multiple times that this is alpha etc.

Sounds good

RenaudWasTaken · 2017-12-22T11:02:15Z

docs/tasks/manage-gpus/scheduling-gpus.md


-Nvidia GPUs can be consumed via container level resource requirements using the resource name `alpha.kubernetes.io/nvidia-gpu`.
+Then you have to install NVIDIA drivers on the nodes and run a NVIDIA GPU device


Nit: an NVIDIA ?

RenaudWasTaken · 2017-12-22T11:03:00Z

docs/tasks/manage-gpus/scheduling-gpus.md

+  * You can specify GPU `limits` without specifying `requests` because
+    Kubernetes will use the limit as the request value by default.
+  * You can specify GPU in both `limits` and `requests` but these two values
+    must equal.


Nit: be equal

RenaudWasTaken · 2017-12-22T11:03:44Z

docs/tasks/manage-gpus/scheduling-gpus.md


-As part of your Node bootstrapping, identify the GPU hardware type on your nodes and expose it as a node label.
+Unlike with `alpha.kubernetes.io/nvidia-gpu`, when using `nvidia.com/gpu` as
+the resource, you don't have to mount any special directories in your pod


Nit: a resource ?

RenaudWasTaken · 2017-12-22T11:21:28Z

docs/tasks/manage-gpus/scheduling-gpus.md

  containers:
-    -
-      name: gpu-container-1
+    - name: cuda-vector-add


Can we use an officially supported nvidia image ?
Digits, CUDA, CAFFE or DL images such as TF-GPU ?

This is the image which is used in Kubernetes e2e tests to test GPU support.

Here's the Dockerfile for it: https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile

It's using nvidia/cuda:8.0-devel-ubuntu16.04 as the base image and other NVIDIA packages.

The reason for using this instead of the base image is that this is a good test of whether things are working or not. If they are not, the pod crash loops.

Added link to Dockerfile from the pod spec.

If there's a better image to test GPU support. I am happy to switch to that both here and in the e2e tests.

Quoting you :)

This is describing a task ("how to use GPUs") and not a particular feature.

How about giving access to a GPU tensorflow image in that case.
I don't think that people who want GPU support want to read about the cuda-vector-add image and what it does.

Enabling GPUs is mostly about enabling users to do DL so let's have a DL example

Is not there a limitation with Nvidia offficial images that nvidia-docker is mandatory to use and volume injection will not not be allowed on Nvidia images?

kubernetes/kubernetes#54011 (comment)

Is not there a limitation with Nvidia offficial images that nvidia-docker is mandatory to use and volume injection will not not be allowed on Nvidia images?

No, we don't support running NVIDIA images without nvidia-docker but we don't actively prevent it.

RenaudWasTaken · 2017-12-22T11:24:27Z

docs/tasks/manage-gpus/scheduling-gpus.md


-This will ensure that the pod will be scheduled to a node that has a `Tesla K80` or a `Tesla P100` Nvidia GPU.
+# Install NVIDIA drivers on Ubuntu:
+kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/nvidia-driver-installer/ubuntu/daemonset.yaml


Please add that NVIDIA does not officially support installing drivers from a container on Ubuntu and that any support will be handled by the GKE team

This is OSS documentation. I don't think there's any expectation of support without buying a support package from some vendor.

In any case, I added experimental at the top.

OSS doesn't mean people won't open issues on the nvidia-docker repository or the nvidia k8s-device plugin.

If you are going to do and advertise something that is not supported by NVIDIA (because there are know bugs and that this going to break in a lot of edge cases due to driver issues) at the very least you could say that it's not supported by NVIDIA.

Added directions to report issues with each device plugin.

RenaudWasTaken · 2017-12-22T11:25:42Z

docs/tasks/manage-gpus/scheduling-gpus.md

-      path: /usr/lib/nvidia-375
-    name: lib
+    - name: cuda-vector-add
+      image: "gcr.io/google_containers/cuda-vector-add:v0.1"


Same request as above for the image

See reply above.

rohitagarwal003 · 2017-12-22T18:06:23Z

Thanks for the review @RenaudWasTaken. Replied/Addressed your comments.

vikaschoudhary16 · 2017-12-23T00:43:53Z

docs/tasks/manage-gpus/scheduling-gpus.md

+Further, the Kubernetes nodes have to be pre-installed with NVIDIA drivers.
+Kubelet will not detect NVIDIA GPUs otherwise.
+
+When you start Kubernetes components after all the above conditions are true,


s/components/components,

any differences?

vikaschoudhary16 · 2017-12-23T00:59:09Z

docs/tasks/manage-gpus/scheduling-gpus.md

  containers:
-    -
-      name: gpu-container-1
+    - name: cuda-vector-add


Is not there a limitation with Nvidia offficial images that nvidia-docker is mandatory to use and volume injection will not not be allowed on Nvidia images?

kubernetes/kubernetes#54011 (comment)

vikaschoudhary16 · 2017-12-23T01:01:17Z

docs/tasks/manage-gpus/scheduling-gpus.md

+  for docker instead of runc.
+- NVIDIA drivers ~= 361.93
+
+To deploy the NVIDIA device plugin once your cluster is running and above


s/plugin/plugin,

vikaschoudhary16 · 2017-12-23T01:05:00Z

docs/tasks/manage-gpus/scheduling-gpus.md

- Support for hardware accelerators is in its early stages in Kubernetes.
- GPUs and other accelerators will soon be a native compute resource across the system.
+## Future
+- Support for hardware accelerators is Kubernetes is still in alpha.


s/is Kubernetes/in kubernetes

jiayingz · 2017-12-27T19:08:54Z

/lgtm

ScorpioCPH

@mindprince Thanks for this doc update, generally LGTM.

ScorpioCPH · 2017-12-28T05:58:47Z

docs/tasks/manage-gpus/scheduling-gpus.md

+Further, the Kubernetes nodes have to be pre-installed with NVIDIA drivers.
+Kubelet will not detect NVIDIA GPUs otherwise.
+
+When you start Kubernetes components after all the above conditions are true,


any differences?

ScorpioCPH · 2017-12-28T07:54:06Z

docs/tasks/manage-gpus/scheduling-gpus.md


-{% endcapture %}
+**From 1.8 onwards, the recommended way to consume GPUs is to use [device
+plugins](/docs/concepts/cluster-administration/device-plugins).**


nit: this link is 404

It is working correctly.

ScorpioCPH · 2017-12-28T08:01:20Z

docs/tasks/manage-gpus/scheduling-gpus.md

+
+Report issues with this device plugin and installation method to [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators)
+
+## Clusters containging different types of NVIDIA GPUs


nit: s/containging/containing

steveperry-53 · 2017-12-28T18:51:22Z

docs/tasks/manage-gpus/scheduling-gpus.md

+resource.
+
+You can consume these GPUs from your containers by requesting for
+`alpha.kubernetes.io/nvidia-gpu` just like you request `cpu` or `memory`.


requesting for -> requesting

steveperry-53 · 2017-12-28T18:52:14Z

docs/tasks/manage-gpus/scheduling-gpus.md

-2. A special **alpha** feature gate `Accelerators` has to be set to true across the system: `--feature-gates="Accelerators=true"`.
-3. Nodes must be using `docker engine` as the container runtime.
+The `Accelerators` feature gate and `alpha.kubernetes.io/nvidia-gpu` resource
+works on 1.8 and 1.9 as well. It will be deprecated from 1.10 and removed in


deprecated from -> deprecated in

steveperry-53 · 2017-12-28T18:53:37Z

docs/tasks/manage-gpus/scheduling-gpus.md


+You can consume these GPUs from your containers by requesting for
+`nvidia.com/gpu` just like you request `cpu` or `memory`.


requesting for -> requesting

steveperry-53 · 2017-12-28T19:04:27Z

docs/tasks/manage-gpus/scheduling-gpus.md

+  for docker instead of runc.
+- NVIDIA drivers ~= 361.93
+
+To deploy the NVIDIA device plugin once your cluster is running and above


and above -> and the above

steveperry-53 · 2017-12-28T19:17:32Z

docs/tasks/manage-gpus/scheduling-gpus.md

+The [NVIDIA GPU device plugin used by GKE/GCE](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu)
+doesn't require using nvidia-docker and should work with any CRI compatible container
+runtime. It's tested on [COS](https://cloud.google.com/container-optimized-os/)
+and has experimental code for Ubuntu from 1.9 onwards.



Suggestion: Spell out CRI and rephrase. Maybe "work with any container runtime that is compatible the Kubernetes Container Runtime Interface.

We should change COS to Container-Optimized OS. Google's branding guidelines say not to use acronyms except GCP and Cloud ML.

steveperry-53 · 2017-12-28T19:21:25Z

docs/tasks/manage-gpus/scheduling-gpus.md


-### Warning
+Report issues with this device plugin to [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)


Period at end of this sentence

steveperry-53 · 2017-12-28T19:22:16Z

docs/tasks/manage-gpus/scheduling-gpus.md

+has the following requirements:
+- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
+- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
+- nvidia-container-runtime configured as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime)


configured -> must be configured

steveperry-53 · 2017-12-28T19:23:41Z

docs/tasks/manage-gpus/scheduling-gpus.md

+- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
+- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
+- nvidia-container-runtime configured as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime)
+  for docker instead of runc.


docker -> Docker

steveperry-53 · 2017-12-28T19:24:30Z

docs/tasks/manage-gpus/scheduling-gpus.md


-As of now, CUDA libraries are expected to be pre-installed on the nodes.
+On your 1.9 cluster, you can use the following commands to install NVIDIA drivers and device plugin:


install NVIDIA -> install the NVIDIA

steveperry-53 · 2017-12-28T19:26:58Z

docs/tasks/manage-gpus/scheduling-gpus.md

+kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
+```
+
+Report issues with this device plugin and installation method to [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators)


Period at end of this sentence.

steveperry-53 · 2017-12-28T19:29:27Z

@mindprince This looks good. I put a few comments in the text. Otherwise docs lgtm.

rohitagarwal003 · 2017-12-28T19:36:32Z

Thanks for the review @steveperry-53! Updated the PR with the fixes.

steveperry-53 · 2017-12-28T19:40:55Z

@mindprince Are you ready for me to merge this?

rohitagarwal003 · 2017-12-28T19:42:34Z

Yes.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 22, 2017

k8s-github-robot assigned steveperry-53 Dec 22, 2017

rohitagarwal003 force-pushed the update-accelerator branch 2 times, most recently from 2e4032c to 9dc9983 Compare December 22, 2017 02:22

k8s-ci-robot assigned jiayingz and vishh Dec 22, 2017

rohitagarwal003 force-pushed the update-accelerator branch 2 times, most recently from f73dfc0 to 28bb551 Compare December 22, 2017 02:35

rohitagarwal003 mentioned this pull request Dec 22, 2017

Update Scheduling GPUs document to use device plugins #6722

Closed

1 task

tengqm reviewed Dec 22, 2017

View reviewed changes

RenaudWasTaken reviewed Dec 22, 2017

View reviewed changes

RenaudWasTaken suggested changes Dec 22, 2017

View reviewed changes

rohitagarwal003 force-pushed the update-accelerator branch from 28bb551 to 7b7fa3a Compare December 22, 2017 18:03

vikaschoudhary16 reviewed Dec 23, 2017

View reviewed changes

ScorpioCPH mentioned this pull request Dec 26, 2017

Refactor the TfJob to use K8s libraries kubeflow/training-operator#215

Merged

rohitagarwal003 force-pushed the update-accelerator branch 4 times, most recently from 538fedd to d91b39e Compare December 27, 2017 18:37

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 27, 2017

rohitagarwal003 mentioned this pull request Dec 27, 2017

Proposal: kubeflow-scheduler kubeflow/kubeflow#68

Closed

ScorpioCPH approved these changes Dec 28, 2017

View reviewed changes

rohitagarwal003 force-pushed the update-accelerator branch 2 times, most recently from eace0b5 to 76332ba Compare December 28, 2017 18:26

rohitagarwal003 mentioned this pull request Dec 28, 2017

Can't expose Nvidia GPU on v1.8.1-gke.0 Node kubernetes/kubernetes#55005

Closed

steveperry-53 reviewed Dec 28, 2017

View reviewed changes

steveperry-53 added the Docs LGTM label Dec 28, 2017

Rewrote doc on GPU support in Kubernetes.

3e8d132

rohitagarwal003 force-pushed the update-accelerator branch from 76332ba to 3e8d132 Compare December 28, 2017 19:35

steveperry-53 merged commit 9b99280 into kubernetes:master Dec 28, 2017

rohitagarwal003 mentioned this pull request Dec 29, 2017

should move GPU support into DevicePlugins option? kubernetes/kubernetes#56553

Closed

bhack mentioned this pull request Jan 2, 2018

Can we preinstall / prebuild CUDA drivers? kubernetes/kops#1726

Closed


		### Deploying NVIDIA GPU device plugin

		There are currently two device plugin implementations for NVIDIA GPUs:


		Nvidia GPUs can be consumed via container level resource requirements using the resource name `alpha.kubernetes.io/nvidia-gpu`.
		Then you have to install NVIDIA drivers on the nodes and run a NVIDIA GPU device


		Report issues with this device plugin and installation method to [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators)

		## Clusters containging different types of NVIDIA GPUs


		You can consume these GPUs from your containers by requesting for
		`nvidia.com/gpu` just like you request `cpu` or `memory`.


		### Warning
		Report issues with this device plugin to [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)


		As of now, CUDA libraries are expected to be pre-installed on the nodes.
		On your 1.9 cluster, you can use the following commands to install NVIDIA drivers and device plugin:

Rewrote doc on GPU support in Kubernetes. #6736

Rewrote doc on GPU support in Kubernetes. #6736

Conversation

rohitagarwal003 commented Dec 22, 2017 • edited Loading

k8sio-netlify-preview-bot commented Dec 22, 2017 • edited Loading

rohitagarwal003 commented Dec 22, 2017

tengqm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RenaudWasTaken left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohitagarwal003 commented Dec 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayingz commented Dec 27, 2017

ScorpioCPH left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveperry-53 commented Dec 28, 2017

rohitagarwal003 commented Dec 28, 2017

steveperry-53 commented Dec 28, 2017

rohitagarwal003 commented Dec 28, 2017

rohitagarwal003 commented Dec 22, 2017 •

edited

Loading

k8sio-netlify-preview-bot commented Dec 22, 2017 •

edited

Loading

RenaudWasTaken left a comment •

edited

Loading