Skip to content

Conversation

@Jpsassine
Copy link
Contributor

Description

This commit adds documentation for the Dynamic Resource Allocation (DRA) portion of KEP-4680, which was implemented in kubernetes/kubernetes#130606.

The following changes are included:

  • A new "Device Health Monitoring" section is added to the main DRA concepts page. This section explains how the ResourceHealthStatus feature gate and the new DRAResourceHealth gRPC service enable device health reporting in the pod.status.
  • The documentation for the ResourceHealthStatus feature gate is updated to clarify that it applies to both Device Plugins and Dynamic Resource Allocation.

This aligns with the documentation requirements for graduating the DRA implementation of KEP-4680 to Alpha.

KEP: kubernetes/enhancements#4680

Issue

kubernetes/enhancements#4680 (comment)

Closes: #

@k8s-ci-robot k8s-ci-robot added this to the 1.34 milestone Jul 29, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 29, 2025
@k8s-ci-robot k8s-ci-robot requested a review from klueska July 29, 2025 20:46
@netlify
Copy link

netlify bot commented Jul 29, 2025

👷 Deploy Preview for kubernetes-io-vnext-staging processing.

Name Link
🔨 Latest commit 6218cef
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-io-vnext-staging/deploys/68893bf99323760008974330

@k8s-ci-robot k8s-ci-robot requested a review from salaxander July 29, 2025 20:46
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 29, 2025
@Jpsassine
Copy link
Contributor Author

cc @SergeyKanzhelev

@Jpsassine Jpsassine marked this pull request as ready for review July 29, 2025 20:56
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 29, 2025
@netlify
Copy link

netlify bot commented Jul 29, 2025

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit 6218cef
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-io-main-staging/deploys/68893bf9443ed4000883a6a6
😎 Deploy Preview https://deploy-preview-51731--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

This commit adds documentation for the Dynamic Resource Allocation (DRA) portion of KEP-4680, which was implemented in kubernetes/kubernetes#130606.

The following changes are included:

- A new "Device Health Monitoring" section is added to the main DRA concepts page. This section explains how the `ResourceHealthStatus` feature gate and the new `DRAResourceHealth` gRPC service enable device health reporting in the `pod.status`.
- The documentation for the `ResourceHealthStatus` feature gate is updated to clarify that it applies to both Device Plugins and Dynamic Resource Allocation.

This aligns with the documentation requirements for graduating the DRA implementation of KEP-4680 to Alpha.

KEP: kubernetes/enhancements#4680
@Jpsassine Jpsassine requested a review from lmktfy July 29, 2025 22:13
@michellengnx
Copy link
Contributor

Hello @Jpsassine 👋! I'm reaching out from the Docs team.

Just checking in as we approach Docs Freeze on Wednesday August 6, 2025 18:00 PDT. This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. Thank you!

@Jpsassine
Copy link
Contributor Author

@lmktfy could you give this another pass please, I believe I addressed both of your comments? Thanks

@lmktfy
Copy link
Member

lmktfy commented Aug 4, 2025

This looks great for alpha.

Looking ahead to beta, we'd hope to update the troubleshooting docs as well.

/lgtm
for docs
Can I get a tech review from WG device management as well?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 4, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7e42d53bd1fb5e4e91844b73ffddab7d2f5817d8

@johnbelamaric
Copy link
Member

/approve
for WG Dev Mgmt

@lmktfy
Copy link
Member

lmktfy commented Aug 4, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, lmktfy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2025
@k8s-ci-robot k8s-ci-robot merged commit 2c0d450 into kubernetes:dev-1.34 Aug 4, 2025
6 checks passed
with the health information for each device assigned to the Pod.
See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details.

This feature applies to devices managed by both [Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) and [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring). See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thje sentence See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details. is redundant.


To enable this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/resource-health-status/) must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service.

When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the `allocatedResourcesStatus` field in the status of each container, detailing the health of each device assigned to that container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we populate devices that only belong to a Pod, not to specific container? I do not remember if we ended up implementing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants