- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This document presents the kubelet endpoint which allows third party consumers to inspect the mapping between devices and pods.
Device Monitoring in Kubernetes is expected to be implemented out of the kubernetes tree.
For the metrics to be relevant to cluster administrators or Pod owners, they need to be able to be matched to specific containers and pod (e.g: GPU utilization for pod X). As such the external monitoring agents need to be able to determine the set of devices in-use by containers and attach pod and container metadata to the metrics.
- Deprecate and remove current device-specific knowledge from the kubelet, such as accelerator metrics
- Enable external device monitoring agents to provide metrics relevant to Kubernetes
- Enable cluster components to consume the API. The API is node-local only.
As a Cluster Administrator, I provide a set of devices from various vendors in my cluster. Each vendor independently maintains their own agent, so I run monitoring agents only for devices I provide. Each agent adheres to to the node monitoring guidelines, so I can use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even though they are maintained by different vendors.
As a Device Vendor, I manufacture devices and I have deep domain expertise in how to run and monitor them. Because I maintain my own Device Plugin implementation, as well as Device Monitoring Agent, I can provide consumers of my devices an easy way to consume and monitor my devices without requiring open-source contributions. The Device Monitoring Agent doesn't have any dependencies on the Device Plugin, so I can decouple monitoring from device lifecycle management. My Device Monitoring Agent works by periodically querying the /devices/<ResourceName>
endpoint to discover which devices are being used, and to get the container/pod metadata associated with the metrics:
This API is read-only, which removes a large class of risks. The aspects that we consider below are as follows:
- What are the risks associated with the API service itself?
- What are the risks associated with the data itself?
Risk | Impact | Mitigation |
---|---|---|
Too many requests risk impacting the kubelet performances | High | Implement rate limiting and or passive caching, follow best practices for gRPC resource management. |
Improper access to the data | Low | Server is listening on a root owned unix socket. This can be limited with proper pod security policies. |
We propose to add a new gRPC service to the Kubelet. This gRPC service would be listening on a unix socket at /var/lib/kubelet/pod-resources/kubelet.sock
and return information about the kubelet's assignment of devices to containers.
This information is obtained from the internal state of the kubelet's Device Manager. The GRPC Service has a single function named List
, you will find below the v1 API:
// PodResourcesLister is a service provided by the kubelet that provides information about the
// node resources consumed by pods and containers on the node
service PodResourcesLister {
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
}
// ListPodResourcesRequest is the request made to the PodResourcesLister service
message ListPodResourcesRequest {}
// ListPodResourcesResponse is the response returned by List function
message ListPodResourcesResponse {
repeated PodResources pod_resources = 1;
}
// PodResources contains information about the node resources assigned to a pod
message PodResources {
string name = 1;
string namespace = 2;
repeated ContainerResources containers = 3;
}
// ContainerResources contains information about the resources assigned to a container
message ContainerResources {
string name = 1;
repeated ContainerDevices devices = 2;
}
// ContainerDevices contains information about the devices assigned to a container
message ContainerDevices {
string resource_name = 1;
repeated string device_ids = 2;
}
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Given that the API allows observing what device has been associated to what container, we need to be testing different configurations, such as:
- Pods without devices assigned to any containers.
- Pods with devices assigned to some but not all containers.
- Pods with devices assigned to init containers.
- ...
We have identified two main ways of testing this API:
- Unit Tests which won't rely on gRPC. They will test different configurations of pods and devices.
- Node E2E tests which will allow us to test the service itself.
E2E tests are explicitly not written because they would require us to generate and deploy a custom container. The infrastructure required is expensive and it is not clear what additional testing (and hence risk reduction) this would provide compare to node e2e tests.
k8s.io/kubernetes/pkg/kubelet/apis/podresources
:20230127
-61.5%
covered by e2e tests
k8s.io/kubernetes/test/e2e_node/podresources_test.go
: https://storage.googleapis.com/k8s-triage/index.html?test=POD%20Resources
- Implement the new service API.
- Ensure proper e2e node tests are in place.
- Demonstrate that the endpoint can be used to replace in-tree GPU device metrics in production environments (NVIDIA, sig-node April 30, 2019).
- Multiple real world examples (Multus CNI).
- Allowing time for feedback (2 years).
- Start Deprecation of Accelerator metrics in kubelet.
- The API endpoint should be available on all the platforms kubelet runs and supports device plugins (linux, windows, ...).
- Rate limiting mechanisms are implemented in the server to prevent excessive load from malfunctioning/rogue clients.
- Risks have been addressed.
With gRPC the version is part of the service name. Old versions and new versions should always be served and listened by the kubelet.
To a cluster admin upgrading to the newest API version, means upgrading Kubernetes to a newer version as well as upgrading the monitoring component.
To a vendor changes in the API should always be backwards compatible.
Downgrades here are related to downgrading the plugin
Kubelet will always be backwards compatible, so going forward existing plugins are not expected to break.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
KubeletPodResources
. - Components depending on the feature gate: N/A
- Feature gate name:
No.
- Are there any tests for feature enablement/disablement? No, however no data is created or deleted.
Yes, through feature gates.
The service recovers state from kubelet.
No, however no data is created or deleted.
Kubelet would fail to start. Errors would be caught in the CI.
Not Applicable, metrics wouldn't be available.
Not Applicable.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
- Look at the
pod_resources_endpoint_requests_total
metric exposed by the kubelet. - Look at hostPath mounts of privileged containers.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: check the kubelet metrics
pod_resources_endpoint_*
- Details: check the kubelet metrics
N/A or refer to Kubelet SLIs.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
pod_resources_endpoint_requests_total
- Components exposing the metric: kubelet
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Not applicable.
No.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No. Feature is out of existing any paths in kubelet.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
In 1.18, DOSing the API can lead to resource exhaustion. It is planned to be addressed as part of G.A. The API is exposed only through a unix-domain socket local to the node, so malicious agents can only be among pods running on the same node (e.g. no network access) which have been granted permission to access the unix domain socket with volume mounts and filesystem permissions. Feature only collects data when requests comes in, data is then garbage collected. Data collected is proportional to the number of pods on the node.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No. Clients consume the API through the gRPC interface exposed from the unix domain socket. Only a single socket is created and managed by the kubelet, shared among all the clients (typically one). No resources are reserved when a client connects, and the API is stateless (no state preserved across calls, not concept of session). All the data needed to serve the calls is fetched by internal, already existing data structures internal to resource managers.
No effect.
No known failure modes
N/A
- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. Slides
- 2018-10-30: Demo with example gpu monitoring daemonset
- 2018-11-10: KEP lgtm'd and approved
- 2018-11-15: Implementation and e2e test merged before 1.13 release: kubernetes/kubernetes#70508
- 2019-04-30: Demo of production GPU monitoring by NVIDIA
- 2019-04-30: Agreement in sig-node to move feature to beta in 1.15
- 2020-06-17: Agreement in sig-node to move feature to G.A in 1.19 or 1.20
- 2023-01-27: KEP translate to the most recent template
- 2023-05-23: KEP GA moved to 1.28
N/A
Add v1alpha1 Kubelet GRPC service, at /var/lib/kubelet/pod-resources/kubelet.sock
, which returns a list of CreateContainerRequests used to create containers.
- Pros:
- Reuse an existing API for describing containers rather than inventing a new one
- Cons:
- It ties the endpoint to the CreateContainerRequest, and may prevent us from adding other information we want in the future
- It does not contain any additional information that will be useful to monitoring agents other than device, and contains lots of irrelevant information for this use-case.
- Notes:
- Does not include any reference to resource names. Monitoring agentes must identify devices by the device or environment variables passed to the pod or container.
- Pros:
- Allows for observation of container to device bindings local to the node through the
/pods
endpoint
- Allows for observation of container to device bindings local to the node through the
- Cons:
- Only consumed locally, which doesn't justify an API change
- Device Bindings are immutable after allocation, and are debatably observable (they can be "observed" from the local checkpoint file). Device bindings are generally a poor fit for status.
- Allows for observability of device to container bindings through what exists in the checkpoint file
- Requires adding additional metadata to the checkpoint file as required by the monitoring agent
- Requires implementing versioning for the checkpoint file, and handling version skew between readers and the kubelet
- Future modifications to the checkpoint file are more difficult.
- A new object
ComputeDevice
will be defined and a new variableComputeDevices
will be added in theContainer
(Spec) object which will represent a list ofComputeDevice
objects.
// ComputeDevice describes the devices assigned to this container for a given ResourceName
type ComputeDevice struct {
// DeviceIDs is the list of devices assigned to this container
DeviceIDs []string
// ResourceName is the name of the compute resource
ResourceName string
}
// Container represents a single container that is expected to be run on the host.
type Container struct {
...
// ComputeDevices contains the devices assigned to this container
// This field is alpha-level and is only honored by servers that enable the ComputeDevices feature.
// +optional
ComputeDevices []ComputeDevice
...
}
- During Kubelet pod admission, if
ComputeDevices
is found non-empty, specified devices will be allocated otherwise behaviour will remain same as it is today. - Before starting the pod, the kubelet writes the assigned
ComputeDevices
back to the pod spec.- Note: Writing to the Api Server and waiting to observe the updated pod spec in the kubelet's pod watch may add significant latency to pod startup.
- Allows devices to potentially be assigned by a custom scheduler.
- Serves as a permanent record of device assignments for the kubelet, and eliminates the need for the kubelet to maintain this state locally.