Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrote doc on GPU support in Kubernetes. #6736

Merged
merged 1 commit into from
Dec 28, 2017

Conversation

rohitagarwal003
Copy link
Member

@rohitagarwal003 rohitagarwal003 commented Dec 22, 2017

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 22, 2017
@k8sio-netlify-preview-bot
Copy link
Collaborator

k8sio-netlify-preview-bot commented Dec 22, 2017

Deploy preview ready!

Built with commit 3e8d132

https://deploy-preview-6736--kubernetes-io-master-staging.netlify.com

@rohitagarwal003 rohitagarwal003 force-pushed the update-accelerator branch 2 times, most recently from 2e4032c to 9dc9983 Compare December 22, 2017 02:22
@rohitagarwal003
Copy link
Member Author

/assign @jiayingz @vishh

Copy link
Contributor

@tengqm tengqm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mindprince How to you want to solve the conflicts with #6722?


### Deploying NVIDIA GPU device plugin

There are currently two device plugin implementations for NVIDIA GPUs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two device plugin implementations

two sample device plugin implementations ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, "sample" implies that they are not ready to be used as is and require some modification. However, both these implementations are ready to be used as-is.

multiple backwards incompatible iterations. This page describes how users can
consume GPUs across different Kubernetes versions and the current limitations.

## v1.6 and v1.7

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this section at the end and have the current version be the first one ?
We could add a summary :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It flows better this way. I made the recommendation to use device plugins bold.

Copy link

@RenaudWasTaken RenaudWasTaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly lgtm will close my PR.
Three points though:

  • Can we have the device plugin documentation first + summary?
  • I'd rather we use an official NVIDIA supported image
  • Please add a disclaimer that NVIDIA does not support installing the drivers from a container on Ubuntu and that any questions or bug should be handled by the GKE team

@@ -4,154 +4,203 @@ approvers:
title: Schedule GPUs
---

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mark this as alpha ?
i.e: +{% include feature-state-alpha.md %}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is describing a task ("how to use GPUs") and not a particular feature. The device plugin page linked from includes this already. In this page, we have mentioned multiple times that this is alpha etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good


Nvidia GPUs can be consumed via container level resource requirements using the resource name `alpha.kubernetes.io/nvidia-gpu`.
Then you have to install NVIDIA drivers on the nodes and run a NVIDIA GPU device

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: an NVIDIA ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* You can specify GPU `limits` without specifying `requests` because
Kubernetes will use the limit as the request value by default.
* You can specify GPU in both `limits` and `requests` but these two values
must equal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: be equal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


As part of your Node bootstrapping, identify the GPU hardware type on your nodes and expose it as a node label.
Unlike with `alpha.kubernetes.io/nvidia-gpu`, when using `nvidia.com/gpu` as
the resource, you don't have to mount any special directories in your pod

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a resource ?

containers:
-
name: gpu-container-1
- name: cuda-vector-add

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use an officially supported nvidia image ?
Digits, CUDA, CAFFE or DL images such as TF-GPU ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the image which is used in Kubernetes e2e tests to test GPU support.

Here's the Dockerfile for it: https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile

It's using nvidia/cuda:8.0-devel-ubuntu16.04 as the base image and other NVIDIA packages.

The reason for using this instead of the base image is that this is a good test of whether things are working or not. If they are not, the pod crash loops.

Added link to Dockerfile from the pod spec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a better image to test GPU support. I am happy to switch to that both here and in the e2e tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting you :)

This is describing a task ("how to use GPUs") and not a particular feature.

How about giving access to a GPU tensorflow image in that case.
I don't think that people who want GPU support want to read about the cuda-vector-add image and what it does.

Enabling GPUs is mostly about enabling users to do DL so let's have a DL example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not there a limitation with Nvidia offficial images that nvidia-docker is mandatory to use and volume injection will not not be allowed on Nvidia images?

kubernetes/kubernetes#54011 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not there a limitation with Nvidia offficial images that nvidia-docker is mandatory to use and volume injection will not not be allowed on Nvidia images?

No, we don't support running NVIDIA images without nvidia-docker but we don't actively prevent it.


This will ensure that the pod will be scheduled to a node that has a `Tesla K80` or a `Tesla P100` Nvidia GPU.
# Install NVIDIA drivers on Ubuntu:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/nvidia-driver-installer/ubuntu/daemonset.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add that NVIDIA does not officially support installing drivers from a container on Ubuntu and that any support will be handled by the GKE team

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OSS documentation. I don't think there's any expectation of support without buying a support package from some vendor.

In any case, I added experimental at the top.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSS doesn't mean people won't open issues on the nvidia-docker repository or the nvidia k8s-device plugin.

If you are going to do and advertise something that is not supported by NVIDIA (because there are know bugs and that this going to break in a lot of edge cases due to driver issues) at the very least you could say that it's not supported by NVIDIA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added directions to report issues with each device plugin.

path: /usr/lib/nvidia-375
name: lib
- name: cuda-vector-add
image: "gcr.io/google_containers/cuda-vector-add:v0.1"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same request as above for the image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See reply above.

@rohitagarwal003
Copy link
Member Author

Thanks for the review @RenaudWasTaken. Replied/Addressed your comments.

Further, the Kubernetes nodes have to be pre-installed with NVIDIA drivers.
Kubelet will not detect NVIDIA GPUs otherwise.

When you start Kubernetes components after all the above conditions are true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/components/components,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any differences?

containers:
-
name: gpu-container-1
- name: cuda-vector-add
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not there a limitation with Nvidia offficial images that nvidia-docker is mandatory to use and volume injection will not not be allowed on Nvidia images?

kubernetes/kubernetes#54011 (comment)

for docker instead of runc.
- NVIDIA drivers ~= 361.93

To deploy the NVIDIA device plugin once your cluster is running and above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/plugin/plugin,

- Support for hardware accelerators is in its early stages in Kubernetes.
- GPUs and other accelerators will soon be a native compute resource across the system.
## Future
- Support for hardware accelerators is Kubernetes is still in alpha.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is Kubernetes/in kubernetes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jiayingz
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 27, 2017
Copy link

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mindprince Thanks for this doc update, generally LGTM.

Further, the Kubernetes nodes have to be pre-installed with NVIDIA drivers.
Kubelet will not detect NVIDIA GPUs otherwise.

When you start Kubernetes components after all the above conditions are true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any differences?


{% endcapture %}
**From 1.8 onwards, the recommended way to consume GPUs is to use [device
plugins](/docs/concepts/cluster-administration/device-plugins).**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this link is 404

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is working correctly.


Report issues with this device plugin and installation method to [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators)

## Clusters containging different types of NVIDIA GPUs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/containging/containing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

resource.

You can consume these GPUs from your containers by requesting for
`alpha.kubernetes.io/nvidia-gpu` just like you request `cpu` or `memory`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting for -> requesting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

2. A special **alpha** feature gate `Accelerators` has to be set to true across the system: `--feature-gates="Accelerators=true"`.
3. Nodes must be using `docker engine` as the container runtime.
The `Accelerators` feature gate and `alpha.kubernetes.io/nvidia-gpu` resource
works on 1.8 and 1.9 as well. It will be deprecated from 1.10 and removed in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprecated from -> deprecated in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


You can consume these GPUs from your containers by requesting for
`nvidia.com/gpu` just like you request `cpu` or `memory`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting for -> requesting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for docker instead of runc.
- NVIDIA drivers ~= 361.93

To deploy the NVIDIA device plugin once your cluster is running and above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and above -> and the above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

The [NVIDIA GPU device plugin used by GKE/GCE](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu)
doesn't require using nvidia-docker and should work with any CRI compatible container
runtime. It's tested on [COS](https://cloud.google.com/container-optimized-os/)
and has experimental code for Ubuntu from 1.9 onwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Spell out CRI and rephrase. Maybe "work with any container runtime that is compatible the Kubernetes Container Runtime Interface.

We should change COS to Container-Optimized OS. Google's branding guidelines say not to use acronyms except GCP and Cloud ML.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


### Warning
Report issues with this device plugin to [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period at end of this sentence

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

has the following requirements:
- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
- nvidia-container-runtime configured as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configured -> must be configured

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
- nvidia-container-runtime configured as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime)
for docker instead of runc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker -> Docker


As of now, CUDA libraries are expected to be pre-installed on the nodes.
On your 1.9 cluster, you can use the following commands to install NVIDIA drivers and device plugin:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

install NVIDIA -> install the NVIDIA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
```

Report issues with this device plugin and installation method to [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period at end of this sentence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@steveperry-53
Copy link
Contributor

@mindprince This looks good. I put a few comments in the text. Otherwise docs lgtm.

@rohitagarwal003
Copy link
Member Author

Thanks for the review @steveperry-53! Updated the PR with the fixes.

@steveperry-53
Copy link
Contributor

@mindprince Are you ready for me to merge this?

@steveperry-53 steveperry-53 merged commit 9b99280 into kubernetes:master Dec 28, 2017
@rohitagarwal003
Copy link
Member Author

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants