docs(DRA): Document device health monitoring in PodStatus #51731

Jpsassine · 2025-07-29T20:46:47Z

Description

This commit adds documentation for the Dynamic Resource Allocation (DRA) portion of KEP-4680, which was implemented in kubernetes/kubernetes#130606.

The following changes are included:

A new "Device Health Monitoring" section is added to the main DRA concepts page. This section explains how the ResourceHealthStatus feature gate and the new DRAResourceHealth gRPC service enable device health reporting in the pod.status.
The documentation for the ResourceHealthStatus feature gate is updated to clarify that it applies to both Device Plugins and Dynamic Resource Allocation.

This aligns with the documentation requirements for graduating the DRA implementation of KEP-4680 to Alpha.

KEP: kubernetes/enhancements#4680

Issue

kubernetes/enhancements#4680 (comment)

Closes: #

netlify · 2025-07-29T20:46:53Z

👷 Deploy Preview for kubernetes-io-vnext-staging processing.

Name	Link
🔨 Latest commit	`6218cef`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-io-vnext-staging/deploys/68893bf99323760008974330

Jpsassine · 2025-07-29T20:47:06Z

cc @SergeyKanzhelev

netlify · 2025-07-29T20:57:16Z

✅ Pull request preview available for checking

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`6218cef`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-io-main-staging/deploys/68893bf9443ed4000883a6a6
😎 Deploy Preview	https://deploy-preview-51731--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

This commit adds documentation for the Dynamic Resource Allocation (DRA) portion of KEP-4680, which was implemented in kubernetes/kubernetes#130606. The following changes are included: - A new "Device Health Monitoring" section is added to the main DRA concepts page. This section explains how the `ResourceHealthStatus` feature gate and the new `DRAResourceHealth` gRPC service enable device health reporting in the `pod.status`. - The documentation for the `ResourceHealthStatus` feature gate is updated to clarify that it applies to both Device Plugins and Dynamic Resource Allocation. This aligns with the documentation requirements for graduating the DRA implementation of KEP-4680 to Alpha. KEP: kubernetes/enhancements#4680

michellengnx · 2025-08-01T15:31:19Z

Hello @Jpsassine 👋! I'm reaching out from the Docs team.

Just checking in as we approach Docs Freeze on Wednesday August 6, 2025 18:00 PDT. This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. Thank you!

Jpsassine · 2025-08-04T17:01:45Z

@lmktfy could you give this another pass please, I believe I addressed both of your comments? Thanks

lmktfy · 2025-08-04T17:23:26Z

This looks great for alpha.

Looking ahead to beta, we'd hope to update the troubleshooting docs as well.

/lgtm
for docs
Can I get a tech review from WG device management as well?

k8s-ci-robot · 2025-08-04T17:23:33Z

LGTM label has been added.

Git tree hash: 7e42d53bd1fb5e4e91844b73ffddab7d2f5817d8

johnbelamaric · 2025-08-04T20:30:25Z

/approve
for WG Dev Mgmt

lmktfy · 2025-08-04T22:16:38Z

/approve

k8s-ci-robot · 2025-08-04T22:16:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, lmktfy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~content/en/docs/OWNERS~~ [lmktfy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SergeyKanzhelev · 2025-08-04T23:15:21Z

content/en/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus.md

 with the health information for each device assigned to the Pod.
-See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details.
+
+This feature applies to devices managed by both [Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) and [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring). See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details.


thje sentence See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details. is redundant.

SergeyKanzhelev · 2025-08-04T23:16:08Z

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

+
+To enable this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/resource-health-status/) must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service.
+
+When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the `allocatedResourcesStatus` field in the status of each container, detailing the health of each device assigned to that container.


do we populate devices that only belong to a Pod, not to specific container? I do not remember if we ended up implementing this.

k8s-ci-robot added this to the 1.34 milestone Jul 29, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 29, 2025

k8s-ci-robot requested a review from klueska July 29, 2025 20:46

k8s-ci-robot requested a review from salaxander July 29, 2025 20:46

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 29, 2025

Jpsassine marked this pull request as ready for review July 29, 2025 20:56

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 29, 2025

k8s-ci-robot requested a review from divya-mohan0209 July 29, 2025 20:56

lmktfy reviewed Jul 29, 2025

View reviewed changes

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md Outdated Show resolved Hide resolved

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md Outdated Show resolved Hide resolved

Jpsassine force-pushed the dra-health-docs branch from 1b095e3 to 99fff25 Compare July 29, 2025 21:23

Jpsassine force-pushed the dra-health-docs branch from 99fff25 to 6218cef Compare July 29, 2025 21:24

Jpsassine requested a review from lmktfy July 29, 2025 22:13

k8s-ci-robot assigned lmktfy Aug 4, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 4, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2025

k8s-ci-robot merged commit 2c0d450 into kubernetes:dev-1.34 Aug 4, 2025
6 checks passed

SergeyKanzhelev reviewed Aug 4, 2025

View reviewed changes

SergeyKanzhelev mentioned this pull request Jul 25, 2025

Add Resource Health Status to the Pod Status for Device Plugin and DRA kubernetes/enhancements#4680

Open

18 tasks


		To enable this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/resource-health-status/) must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service.

		When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the `allocatedResourcesStatus` field in the status of each container, detailing the health of each device assigned to that container.

docs(DRA): Document device health monitoring in PodStatus #51731

docs(DRA): Document device health monitoring in PodStatus #51731

Uh oh!

Conversation

Jpsassine commented Jul 29, 2025

Description

Issue

Uh oh!

netlify bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

👷 Deploy Preview for kubernetes-io-vnext-staging processing.

Uh oh!

Jpsassine commented Jul 29, 2025

Uh oh!

netlify bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Pull request preview available for checking

Uh oh!

Uh oh!

Uh oh!

michellengnx commented Aug 1, 2025

Uh oh!

Jpsassine commented Aug 4, 2025

Uh oh!

lmktfy commented Aug 4, 2025

Uh oh!

k8s-ci-robot commented Aug 4, 2025

Uh oh!

johnbelamaric commented Aug 4, 2025

Uh oh!

lmktfy commented Aug 4, 2025

Uh oh!

k8s-ci-robot commented Aug 4, 2025

Uh oh!

Uh oh!

SergeyKanzhelev Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

netlify bot commented Jul 29, 2025 •

edited

Loading

netlify bot commented Jul 29, 2025 •

edited

Loading