-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Dynamic Resource Allocation examples #49079
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
--- | ||
title: Assign Resources to Containers and Pods with Dynamic Resource Allocation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -title: Assign Resources to Containers and Pods with Dynamic Resource Allocation
+title: Learn About Dynamic Resource Allocation ? |
||
content_type: task | ||
weight: 270 | ||
--- | ||
|
||
<!-- overview --> | ||
|
||
{{< feature-state feature_gate_name="DynamicResourceAllocation" >}} | ||
|
||
This page shows how to assign resources defined with the Dynamic Resource | ||
Allocation (DRA) APIs to containers. | ||
|
||
|
||
## {{% heading "prerequisites" %}} | ||
|
||
- `kind` | ||
- `kubectl` | ||
- `helm` | ||
|
||
|
||
<!-- steps --> | ||
|
||
## Deploy an example DRA driver | ||
|
||
- Reproduce the steps from https://github.com/kubernetes-sigs/dra-example-driver?tab=readme-ov-file#demo to create a cluster and install the driver | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. early feedback: we avoid sending people to GitHub repos to discover parts of the documentation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When I fill out this section, I was intending to essentially copy-paste some of the steps from the linked doc into this one, so here the steps would be things like "run this script" instead of "follow the steps in this linked document." Is that in line with your suggestion here? |
||
|
||
- Show DeviceClass | ||
- Show ResourceSlice for a Node | ||
|
||
|
||
## Allocate one device for a container | ||
|
||
- Create a ResourceClaim requesting one device | ||
- Create a Pod with one container referencing the ResourceClaim | ||
- Show that the ResourceClaim is allocated | ||
|
||
|
||
## Allocate one device to be shared among multiple Pods | ||
|
||
- Same as first example, with multiple Pods referencing the same | ||
ResourceClaim | ||
|
||
|
||
## Allocate one device per replica of a Deployment | ||
|
||
- Same as first example, using a Deployment with several replicas and the Pod | ||
template references a ResourceClaimTemplate. | ||
|
||
- Show how several ResourceClaims are generated based on the one | ||
ResourceClaimTemplate | ||
|
||
- Scale the Deployment beyond the number of available devices | ||
- Show the unallocatable ResourceClaims | ||
|
||
|
||
## Clean up | ||
|
||
- Delete the kind cluster | ||
|
||
|
||
## {{% heading "whatsnext" %}} | ||
|
||
### For workload administrators | ||
|
||
* [Schedule GPUs](/docs/tasks/manage-gpus/scheduling-gpus/) | ||
|
||
### For device driver authors | ||
|
||
* [Example Resource Driver for Dynamic Resource Allocation](https://github.com/kubernetes-sigs/dra-example-driver) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: "Dynamic Resource Allocation" | ||
weight: 80 | ||
--- | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
--- | ||
title: Comparing Dynamic Resource Allocation to Device Plugins | ||
content_type: tutorial | ||
weight: 10 | ||
--- | ||
|
||
<!-- overview --> | ||
|
||
Both Dynamic Resource Allocation (DRA) and device plugins enable Kubernetes | ||
workloads to leverage specialized hardware from various vendors. This tutorial | ||
will show how to configure the same GPU-enabled workload with DRA and device | ||
plugins to illustrate the differences between the two sets of APIs. | ||
|
||
|
||
## {{% heading "objectives" %}} | ||
|
||
* Learn when to prefer using device plugins or DRA when configuring containers' | ||
requests for devices. | ||
|
||
|
||
## {{% heading "prerequisites" %}} | ||
|
||
* An NVIDIA GPU-enabled cluster with GPU Operator installed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something vendor-specific like this might not fit well in these docs, but NVIDIA GPUs are probably the most common DRA use case right now and NVIDIA's device plugin and DRA driver make for the most apples-to-apples comparison of these APIs at the moment I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we use tabs to let people take part even with different vendors? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that would work well if we can map the same use cases 1:1 across vendors, like if AMD eventually exposes a similar GPU and "security/network isolation" device like NVIDIA's IMEX channels. Different use cases might be better expressed as separate sections though, but we can definitely play around with how those look when we can produce more examples here. |
||
* `kubectl` | ||
* `helm` | ||
|
||
|
||
<!-- lessoncontent --> | ||
|
||
## Deploy a workload using GPUs configured via device plugin | ||
|
||
```yaml | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: device-plugin-deploy | ||
labels: | ||
app: device-plugin | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: device-plugin | ||
template: | ||
metadata: | ||
labels: | ||
app: device-plugin | ||
spec: | ||
containers: | ||
- name: ctr | ||
image: ubuntu:22.04 | ||
command: ["bash", "-c"] | ||
args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"] | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 1 | ||
affinity: | ||
podAffinity: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
- labelSelector: | ||
matchExpressions: | ||
- key: app | ||
operator: In | ||
values: | ||
- device-plugin | ||
topologyKey: nvidia.com/gpu.imex-domain | ||
podAntiAffinity: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
- labelSelector: | ||
matchExpressions: | ||
- key: app | ||
operator: NotIn | ||
values: | ||
- device-plugin | ||
topologyKey: nvidia.com/gpu.imex-domain | ||
``` | ||
|
||
- GPU resources are specified in the container's `resources.limits` and | ||
`resources.requests` | ||
- `podAffinity` keeps this Deployment's Pods together distributed among Nodes | ||
within the same IMEX domain | ||
- `podAntiAffinity` ensures this Deployment's Pods will all run | ||
|
||
|
||
## Deploy a workload using GPUs configured via DRA | ||
|
||
```yaml | ||
apiVersion: resource.k8s.io/v1beta1 | ||
kind: ResourceClaimTemplate | ||
metadata: | ||
name: test-gpu-claim | ||
spec: | ||
spec: | ||
devices: | ||
requests: | ||
- name: gpu | ||
deviceClassName: gpu.nvidia.com | ||
--- | ||
apiVersion: resource.k8s.io/v1beta1 | ||
kind: ResourceClaim | ||
metadata: | ||
name: test-imex-claim | ||
spec: | ||
devices: | ||
requests: | ||
- name: imex | ||
deviceClassName: imex.nvidia.com | ||
--- | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: dra-deploy | ||
labels: | ||
app: dra | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: dra | ||
template: | ||
metadata: | ||
labels: | ||
app: dra | ||
spec: | ||
containers: | ||
- name: ctr | ||
image: ubuntu:22.04 | ||
command: ["bash", "-c"] | ||
args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"] | ||
resources: | ||
claims: | ||
- name: imex | ||
- name: gpu | ||
resourceClaims: | ||
- name: imex | ||
resourceClaimName: test-imex-claim | ||
- name: gpu | ||
resourceClaimTemplateName: test-gpu-claim | ||
``` | ||
|
||
- GPU resources are specified in the container's `resources.claims` which maps | ||
to a ResourceClaimTemplate in this example. | ||
- All of the Deployment's Pods share a single ResourceClaim for one distinct | ||
NVIDIA IMEX channel. This ensures all of these Pods are running within the same | ||
IMEX domain and that other Pods will not run in that IMEX domain without also | ||
referring to the same ResourceClaim. | ||
|
||
|
||
## Clean up | ||
|
||
|
||
## Conclusion | ||
|
||
### Reasons to prefer device plugins | ||
|
||
### Reasons to prefer DRA | ||
|
||
|
||
## {{% heading "whatsnext" %}} | ||
|
||
* Learn more about [Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/) | ||
* Learn more about [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) | ||
* See more examples of how to [Assign Resources to Containers and Pods with Dynamic Resource Allocation](/docs/tasks/configure-pod-container/assign-dra-resource/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early feedback: this sounds like it belongs in the Tutorials section, not Tasks.
Why:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be specific other tasks to cover, though.
For example:
All of those are questions we could cover with a task page (typically a task page per question, though sometimes we combine them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking another look at the example driver, I'm thinking it's likely we can describe workable examples here given "a cluster with DRA enabled" without requiring the exact kind cluster the driver's docs describe. I forgot that the example driver doesn't publish the Helm chart anywhere though, so it needs to be built locally. If we can publish that chart somewhere publicly like GitHub and we find that any DRA-enabled cluster works, do you think that would simplify the setup enough to justify keeping this as a Task?
+1 to those other topics, I think those would be great to include. I'll add placeholders for those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I will never meet a cluster admin who wants guidance on setting up the example driver in their existing production cluster.
If you'd never do it outside of learning context, it's unlikely to be a task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I've split out the examples requiring the example driver into a new tutorial.