-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrote doc on GPU support in Kubernetes. #6736
Rewrote doc on GPU support in Kubernetes. #6736
Conversation
Deploy preview ready! Built with commit 3e8d132 https://deploy-preview-6736--kubernetes-io-master-staging.netlify.com |
2e4032c
to
9dc9983
Compare
f73dfc0
to
28bb551
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mindprince How to you want to solve the conflicts with #6722?
|
||
### Deploying NVIDIA GPU device plugin | ||
|
||
There are currently two device plugin implementations for NVIDIA GPUs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two device plugin implementations
two sample device plugin implementations ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, "sample" implies that they are not ready to be used as is and require some modification. However, both these implementations are ready to be used as-is.
multiple backwards incompatible iterations. This page describes how users can | ||
consume GPUs across different Kubernetes versions and the current limitations. | ||
|
||
## v1.6 and v1.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this section at the end and have the current version be the first one ?
We could add a summary :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It flows better this way. I made the recommendation to use device plugins bold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly lgtm will close my PR.
Three points though:
- Can we have the device plugin documentation first + summary?
- I'd rather we use an official NVIDIA supported image
- Please add a disclaimer that NVIDIA does not support installing the drivers from a container on Ubuntu and that any questions or bug should be handled by the GKE team
@@ -4,154 +4,203 @@ approvers: | |||
title: Schedule GPUs | |||
--- | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we mark this as alpha ?
i.e: +{% include feature-state-alpha.md %}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is describing a task ("how to use GPUs") and not a particular feature. The device plugin page linked from includes this already. In this page, we have mentioned multiple times that this is alpha etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
|
||
Nvidia GPUs can be consumed via container level resource requirements using the resource name `alpha.kubernetes.io/nvidia-gpu`. | ||
Then you have to install NVIDIA drivers on the nodes and run a NVIDIA GPU device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: an NVIDIA
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* You can specify GPU `limits` without specifying `requests` because | ||
Kubernetes will use the limit as the request value by default. | ||
* You can specify GPU in both `limits` and `requests` but these two values | ||
must equal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: be equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
As part of your Node bootstrapping, identify the GPU hardware type on your nodes and expose it as a node label. | ||
Unlike with `alpha.kubernetes.io/nvidia-gpu`, when using `nvidia.com/gpu` as | ||
the resource, you don't have to mount any special directories in your pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: a resource
?
containers: | ||
- | ||
name: gpu-container-1 | ||
- name: cuda-vector-add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use an officially supported nvidia image ?
Digits, CUDA, CAFFE or DL images such as TF-GPU ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the image which is used in Kubernetes e2e tests to test GPU support.
Here's the Dockerfile for it: https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
It's using nvidia/cuda:8.0-devel-ubuntu16.04
as the base image and other NVIDIA packages.
The reason for using this instead of the base image is that this is a good test of whether things are working or not. If they are not, the pod crash loops.
Added link to Dockerfile from the pod spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a better image to test GPU support. I am happy to switch to that both here and in the e2e tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting you :)
This is describing a task ("how to use GPUs") and not a particular feature.
How about giving access to a GPU tensorflow image in that case.
I don't think that people who want GPU support want to read about the cuda-vector-add image and what it does.
Enabling GPUs is mostly about enabling users to do DL so let's have a DL example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not there a limitation with Nvidia offficial images that nvidia-docker is mandatory to use and volume injection will not not be allowed on Nvidia images?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not there a limitation with Nvidia offficial images that nvidia-docker is mandatory to use and volume injection will not not be allowed on Nvidia images?
No, we don't support running NVIDIA images without nvidia-docker but we don't actively prevent it.
|
||
This will ensure that the pod will be scheduled to a node that has a `Tesla K80` or a `Tesla P100` Nvidia GPU. | ||
# Install NVIDIA drivers on Ubuntu: | ||
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/nvidia-driver-installer/ubuntu/daemonset.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add that NVIDIA does not officially support installing drivers from a container on Ubuntu and that any support will be handled by the GKE team
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OSS documentation. I don't think there's any expectation of support without buying a support package from some vendor.
In any case, I added experimental at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OSS doesn't mean people won't open issues on the nvidia-docker repository or the nvidia k8s-device plugin.
If you are going to do and advertise something that is not supported by NVIDIA (because there are know bugs and that this going to break in a lot of edge cases due to driver issues) at the very least you could say that it's not supported by NVIDIA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added directions to report issues with each device plugin.
path: /usr/lib/nvidia-375 | ||
name: lib | ||
- name: cuda-vector-add | ||
image: "gcr.io/google_containers/cuda-vector-add:v0.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same request as above for the image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See reply above.
28bb551
to
7b7fa3a
Compare
Thanks for the review @RenaudWasTaken. Replied/Addressed your comments. |
Further, the Kubernetes nodes have to be pre-installed with NVIDIA drivers. | ||
Kubelet will not detect NVIDIA GPUs otherwise. | ||
|
||
When you start Kubernetes components after all the above conditions are true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/components/components,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any differences?
containers: | ||
- | ||
name: gpu-container-1 | ||
- name: cuda-vector-add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not there a limitation with Nvidia offficial images that nvidia-docker is mandatory to use and volume injection will not not be allowed on Nvidia images?
for docker instead of runc. | ||
- NVIDIA drivers ~= 361.93 | ||
|
||
To deploy the NVIDIA device plugin once your cluster is running and above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/plugin/plugin,
- Support for hardware accelerators is in its early stages in Kubernetes. | ||
- GPUs and other accelerators will soon be a native compute resource across the system. | ||
## Future | ||
- Support for hardware accelerators is Kubernetes is still in alpha. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/is Kubernetes/in kubernetes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
538fedd
to
d91b39e
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mindprince Thanks for this doc update, generally LGTM.
Further, the Kubernetes nodes have to be pre-installed with NVIDIA drivers. | ||
Kubelet will not detect NVIDIA GPUs otherwise. | ||
|
||
When you start Kubernetes components after all the above conditions are true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any differences?
|
||
{% endcapture %} | ||
**From 1.8 onwards, the recommended way to consume GPUs is to use [device | ||
plugins](/docs/concepts/cluster-administration/device-plugins).** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this link is 404
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is working correctly.
|
||
Report issues with this device plugin and installation method to [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators) | ||
|
||
## Clusters containging different types of NVIDIA GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/containging/containing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
eace0b5
to
76332ba
Compare
resource. | ||
|
||
You can consume these GPUs from your containers by requesting for | ||
`alpha.kubernetes.io/nvidia-gpu` just like you request `cpu` or `memory`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requesting for -> requesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
2. A special **alpha** feature gate `Accelerators` has to be set to true across the system: `--feature-gates="Accelerators=true"`. | ||
3. Nodes must be using `docker engine` as the container runtime. | ||
The `Accelerators` feature gate and `alpha.kubernetes.io/nvidia-gpu` resource | ||
works on 1.8 and 1.9 as well. It will be deprecated from 1.10 and removed in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deprecated from -> deprecated in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
You can consume these GPUs from your containers by requesting for | ||
`nvidia.com/gpu` just like you request `cpu` or `memory`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requesting for -> requesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for docker instead of runc. | ||
- NVIDIA drivers ~= 361.93 | ||
|
||
To deploy the NVIDIA device plugin once your cluster is running and above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and above -> and the above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
The [NVIDIA GPU device plugin used by GKE/GCE](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu) | ||
doesn't require using nvidia-docker and should work with any CRI compatible container | ||
runtime. It's tested on [COS](https://cloud.google.com/container-optimized-os/) | ||
and has experimental code for Ubuntu from 1.9 onwards. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Spell out CRI and rephrase. Maybe "work with any container runtime that is compatible the Kubernetes Container Runtime Interface.
We should change COS to Container-Optimized OS. Google's branding guidelines say not to use acronyms except GCP and Cloud ML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
### Warning | ||
Report issues with this device plugin to [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Period at end of this sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
has the following requirements: | ||
- Kubernetes nodes have to be pre-installed with NVIDIA drivers. | ||
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker) | ||
- nvidia-container-runtime configured as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configured -> must be configured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- Kubernetes nodes have to be pre-installed with NVIDIA drivers. | ||
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker) | ||
- nvidia-container-runtime configured as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime) | ||
for docker instead of runc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker -> Docker
|
||
As of now, CUDA libraries are expected to be pre-installed on the nodes. | ||
On your 1.9 cluster, you can use the following commands to install NVIDIA drivers and device plugin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
install NVIDIA -> install the NVIDIA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml | ||
``` | ||
|
||
Report issues with this device plugin and installation method to [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Period at end of this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@mindprince This looks good. I put a few comments in the text. Otherwise docs lgtm. |
76332ba
to
3e8d132
Compare
Thanks for the review @steveperry-53! Updated the PR with the fixes. |
@mindprince Are you ready for me to merge this? |
Yes. |
Current doc: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Updated doc: https://deploy-preview-6736--kubernetes-io-master-staging.netlify.com/docs/tasks/manage-gpus/scheduling-gpus/
This change is