-
Notifications
You must be signed in to change notification settings - Fork 15.1k
docs(DRA): Document device health monitoring in PodStatus #51731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👷 Deploy Preview for kubernetes-io-vnext-staging processing.
|
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
Outdated
Show resolved
Hide resolved
1b095e3 to
99fff25
Compare
This commit adds documentation for the Dynamic Resource Allocation (DRA) portion of KEP-4680, which was implemented in kubernetes/kubernetes#130606. The following changes are included: - A new "Device Health Monitoring" section is added to the main DRA concepts page. This section explains how the `ResourceHealthStatus` feature gate and the new `DRAResourceHealth` gRPC service enable device health reporting in the `pod.status`. - The documentation for the `ResourceHealthStatus` feature gate is updated to clarify that it applies to both Device Plugins and Dynamic Resource Allocation. This aligns with the documentation requirements for graduating the DRA implementation of KEP-4680 to Alpha. KEP: kubernetes/enhancements#4680
99fff25 to
6218cef
Compare
|
Hello @Jpsassine 👋! I'm reaching out from the Docs team. Just checking in as we approach Docs Freeze on Wednesday August 6, 2025 18:00 PDT. This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. Thank you! |
|
@lmktfy could you give this another pass please, I believe I addressed both of your comments? Thanks |
|
This looks great for alpha. Looking ahead to beta, we'd hope to update the troubleshooting docs as well. /lgtm |
|
LGTM label has been added. Git tree hash: 7e42d53bd1fb5e4e91844b73ffddab7d2f5817d8
|
|
/approve |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnbelamaric, lmktfy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| with the health information for each device assigned to the Pod. | ||
| See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details. | ||
|
|
||
| This feature applies to devices managed by both [Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) and [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring). See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thje sentence See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details. is redundant.
|
|
||
| To enable this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/resource-health-status/) must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service. | ||
|
|
||
| When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the `allocatedResourcesStatus` field in the status of each container, detailing the health of each device assigned to that container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we populate devices that only belong to a Pod, not to specific container? I do not remember if we ended up implementing this.
Description
This commit adds documentation for the Dynamic Resource Allocation (DRA) portion of KEP-4680, which was implemented in kubernetes/kubernetes#130606.
The following changes are included:
ResourceHealthStatusfeature gate and the newDRAResourceHealthgRPC service enable device health reporting in thepod.status.ResourceHealthStatusfeature gate is updated to clarify that it applies to both Device Plugins and Dynamic Resource Allocation.This aligns with the documentation requirements for graduating the DRA implementation of KEP-4680 to Alpha.
KEP: kubernetes/enhancements#4680
Issue
kubernetes/enhancements#4680 (comment)
Closes: #