-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Dynamic Resource Allocation examples #49079
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
||
## {{% heading "prerequisites" %}} | ||
|
||
* An NVIDIA GPU-enabled cluster with GPU Operator installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something vendor-specific like this might not fit well in these docs, but NVIDIA GPUs are probably the most common DRA use case right now and NVIDIA's device plugin and DRA driver make for the most apples-to-apples comparison of these APIs at the moment I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use tabs to let people take part even with different vendors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would work well if we can map the same use cases 1:1 across vendors, like if AMD eventually exposes a similar GPU and "security/network isolation" device like NVIDIA's IMEX channels. Different use cases might be better expressed as separate sections though, but we can definitely play around with how those look when we can produce more examples here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A tutorial will really help people learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early feedback: this sounds like it belongs in the Tutorials section, not Tasks.
Why:
- this suggests deploying a whole new cluster
- if we had cluster admins saying "look, I'm in a hurry, just show me how to deploy the Example Hardware driver, where are the docs?" then a task would be the right fit; this isn't like that though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be specific other tasks to cover, though.
For example:
- how do I troubleshoot resource allocation?
- how do I check what devices are allocatable?
- how do I find out about the utilization of my dynamically-allocated devices?
All of those are questions we could cover with a task page (typically a task page per question, though sometimes we combine them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking another look at the example driver, I'm thinking it's likely we can describe workable examples here given "a cluster with DRA enabled" without requiring the exact kind cluster the driver's docs describe. I forgot that the example driver doesn't publish the Helm chart anywhere though, so it needs to be built locally. If we can publish that chart somewhere publicly like GitHub and we find that any DRA-enabled cluster works, do you think that would simplify the setup enough to justify keeping this as a Task?
+1 to those other topics, I think those would be great to include. I'll add placeholders for those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I will never meet a cluster admin who wants guidance on setting up the example driver in their existing production cluster.
If you'd never do it outside of learning context, it's unlikely to be a task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I've split out the examples requiring the example driver into a new tutorial.
@@ -0,0 +1,70 @@ | |||
--- | |||
title: Assign Resources to Containers and Pods with Dynamic Resource Allocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-title: Assign Resources to Containers and Pods with Dynamic Resource Allocation
+title: Learn About Dynamic Resource Allocation
?
|
||
## Deploy an example DRA driver | ||
|
||
- Reproduce the steps from https://github.com/kubernetes-sigs/dra-example-driver?tab=readme-ov-file#demo to create a cluster and install the driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
early feedback: we avoid sending people to GitHub repos to discover parts of the documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I fill out this section, I was intending to essentially copy-paste some of the steps from the linked doc into this one, so here the steps would be things like "run this script" instead of "follow the steps in this linked document." Is that in line with your suggestion here?
Something to watch out for: a minority of readers may arrive at these pages looking for in-place changes to the resource allocations for existing Pods. Now, we (Kubernetes) don't call that DRA, but the reader may not know this. So a clear page introduction for each guide will help readers spot if they are looking in the wrong place. |
Description
This PR adds new Dynamic Resource Allocation examples to complement the existing Concept page. It adds one new Task document showing various use cases and one new Tutorial document comparing workloads configured via DRA and via device plugins.
This PR is currently a work-in-progress as we iterate on the higher-level details of the new docs.
Issue
Closes: #