Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Dynamic Resource Allocation examples #49079

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Dec 13, 2024

Description

This PR adds new Dynamic Resource Allocation examples to complement the existing Concept page. It adds one new Task document showing various use cases and one new Tutorial document comparing workloads configured via DRA and via device plugins.

This PR is currently a work-in-progress as we iterate on the higher-level details of the new docs.

Issue

Closes: #

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 13, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign reylejano for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added language/en Issues or PRs related to English language size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 13, 2024
Copy link

netlify bot commented Dec 13, 2024

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit ff618c0
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/675cb4337f213d000897d41c
😎 Deploy Preview https://deploy-preview-49079--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.


## {{% heading "prerequisites" %}}

* An NVIDIA GPU-enabled cluster with GPU Operator installed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something vendor-specific like this might not fit well in these docs, but NVIDIA GPUs are probably the most common DRA use case right now and NVIDIA's device plugin and DRA driver make for the most apples-to-apples comparison of these APIs at the moment I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use tabs to let people take part even with different vendors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would work well if we can map the same use cases 1:1 across vendors, like if AMD eventually exposes a similar GPU and "security/network isolation" device like NVIDIA's IMEX channels. Different use cases might be better expressed as separate sections though, but we can definitely play around with how those look when we can produce more examples here.

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A tutorial will really help people learn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early feedback: this sounds like it belongs in the Tutorials section, not Tasks.

Why:

  • this suggests deploying a whole new cluster
  • if we had cluster admins saying "look, I'm in a hurry, just show me how to deploy the Example Hardware driver, where are the docs?" then a task would be the right fit; this isn't like that though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be specific other tasks to cover, though.

For example:

  • how do I troubleshoot resource allocation?
  • how do I check what devices are allocatable?
  • how do I find out about the utilization of my dynamically-allocated devices?

All of those are questions we could cover with a task page (typically a task page per question, though sometimes we combine them).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another look at the example driver, I'm thinking it's likely we can describe workable examples here given "a cluster with DRA enabled" without requiring the exact kind cluster the driver's docs describe. I forgot that the example driver doesn't publish the Helm chart anywhere though, so it needs to be built locally. If we can publish that chart somewhere publicly like GitHub and we find that any DRA-enabled cluster works, do you think that would simplify the setup enough to justify keeping this as a Task?

+1 to those other topics, I think those would be great to include. I'll add placeholders for those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I will never meet a cluster admin who wants guidance on setting up the example driver in their existing production cluster.

If you'd never do it outside of learning context, it's unlikely to be a task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I've split out the examples requiring the example driver into a new tutorial.

@@ -0,0 +1,70 @@
---
title: Assign Resources to Containers and Pods with Dynamic Resource Allocation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-title: Assign Resources to Containers and Pods with Dynamic Resource Allocation
+title: Learn About Dynamic Resource Allocation

?


## Deploy an example DRA driver

- Reproduce the steps from https://github.com/kubernetes-sigs/dra-example-driver?tab=readme-ov-file#demo to create a cluster and install the driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

early feedback: we avoid sending people to GitHub repos to discover parts of the documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I fill out this section, I was intending to essentially copy-paste some of the steps from the linked doc into this one, so here the steps would be things like "run this script" instead of "follow the steps in this linked document." Is that in line with your suggestion here?

@sftim
Copy link
Contributor

sftim commented Dec 13, 2024

Something to watch out for: a minority of readers may arrive at these pages looking for in-place changes to the resource allocations for existing Pods.

Now, we (Kubernetes) don't call that DRA, but the reader may not know this. So a clear page introduction for each guide will help readers spot if they are looking in the wrong place.

- split examples requiring driver into new tutorial
- add troubleshooting section to task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. language/en Issues or PRs related to English language size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

3 participants