Kubelet endpoint for device assignment observation details

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories
  - Story 1: Cluster administators: easier monitoring
  - Story 2: Device Vendors: decouple monitoring from device lifecycle management
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This document presents the kubelet endpoint which allows third party consumers to inspect the mapping between devices and pods.

Motivation

Device Monitoring in Kubernetes is expected to be implemented out of the kubernetes tree.

For the metrics to be relevant to cluster administrators or Pod owners, they need to be able to be matched to specific containers and pod (e.g: GPU utilization for pod X). As such the external monitoring agents need to be able to determine the set of devices in-use by containers and attach pod and container metadata to the metrics.

Goals

Deprecate and remove current device-specific knowledge from the kubelet, such as accelerator metrics
Enable external device monitoring agents to provide metrics relevant to Kubernetes

Non-Goals

Enable cluster components to consume the API. The API is node-local only.

Proposal

User Stories

Story 1: Cluster administators: easier monitoring

As a Cluster Administrator, I provide a set of devices from various vendors in my cluster. Each vendor independently maintains their own agent, so I run monitoring agents only for devices I provide. Each agent adheres to to the node monitoring guidelines, so I can use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even though they are maintained by different vendors.

Story 2: Device Vendors: decouple monitoring from device lifecycle management

As a Device Vendor, I manufacture devices and I have deep domain expertise in how to run and monitor them. Because I maintain my own Device Plugin implementation, as well as Device Monitoring Agent, I can provide consumers of my devices an easy way to consume and monitor my devices without requiring open-source contributions. The Device Monitoring Agent doesn't have any dependencies on the Device Plugin, so I can decouple monitoring from device lifecycle management. My Device Monitoring Agent works by periodically querying the /devices/<ResourceName> endpoint to discover which devices are being used, and to get the container/pod metadata associated with the metrics:

Risks and Mitigations

This API is read-only, which removes a large class of risks. The aspects that we consider below are as follows:

What are the risks associated with the API service itself?
What are the risks associated with the data itself?

Risk	Impact	Mitigation
Too many requests risk impacting the kubelet performances	High	Implement rate limiting and or passive caching, follow best practices for gRPC resource management.
Improper access to the data	Low	Server is listening on a root owned unix socket. This can be limited with proper pod security policies.

Design Details

Proposed API

We propose to add a new gRPC service to the Kubelet. This gRPC service would be listening on a unix socket at /var/lib/kubelet/pod-resources/kubelet.sock and return information about the kubelet's assignment of devices to containers.

This information is obtained from the internal state of the kubelet's Device Manager. The GRPC Service has a single function named List, you will find below the v1 API:

// PodResourcesLister is a service provided by the kubelet that provides information about the
// node resources consumed by pods and containers on the node
service PodResourcesLister {
    rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
}

// ListPodResourcesRequest is the request made to the PodResourcesLister service
message ListPodResourcesRequest {}

// ListPodResourcesResponse is the response returned by List function
message ListPodResourcesResponse {
    repeated PodResources pod_resources = 1;
}

// PodResources contains information about the node resources assigned to a pod
message PodResources {
    string name = 1;
    string namespace = 2;
    repeated ContainerResources containers = 3;
}

// ContainerResources contains information about the resources assigned to a container
message ContainerResources {
    string name = 1;
    repeated ContainerDevices devices = 2;
}

// ContainerDevices contains information about the devices assigned to a container
message ContainerDevices {
    string resource_name = 1;
    repeated string device_ids = 2;
}

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Given that the API allows observing what device has been associated to what container, we need to be testing different configurations, such as:

Pods without devices assigned to any containers.
Pods with devices assigned to some but not all containers.

Pods with devices assigned to init containers.
...

We have identified two main ways of testing this API:

Unit Tests which won't rely on gRPC. They will test different configurations of pods and devices.
Node E2E tests which will allow us to test the service itself.

E2E tests are explicitly not written because they would require us to generate and deploy a custom container. The infrastructure required is expensive and it is not clear what additional testing (and hence risk reduction) this would provide compare to node e2e tests.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/kubelet/apis/podresources: 20230127 - 61.5%

Integration tests

covered by e2e tests

e2e tests

k8s.io/kubernetes/test/e2e_node/podresources_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=POD%20Resources

Graduation Criteria

Alpha

Implement the new service API.
Ensure proper e2e node tests are in place.

Alpha to Beta Graduation

Demonstrate that the endpoint can be used to replace in-tree GPU device metrics in production environments (NVIDIA, sig-node April 30, 2019).

Beta to G.A Graduation

Multiple real world examples (Multus CNI).
Allowing time for feedback (2 years).
Start Deprecation of Accelerator metrics in kubelet.
The API endpoint should be available on all the platforms kubelet runs and supports device plugins (linux, windows, ...).
Rate limiting mechanisms are implemented in the server to prevent excessive load from malfunctioning/rogue clients.
Risks have been addressed.

Upgrade / Downgrade Strategy

With gRPC the version is part of the service name. Old versions and new versions should always be served and listened by the kubelet.

To a cluster admin upgrading to the newest API version, means upgrading Kubernetes to a newer version as well as upgrading the monitoring component.

To a vendor changes in the API should always be backwards compatible.

Downgrades here are related to downgrading the plugin

Version Skew Strategy

Kubelet will always be backwards compatible, so going forward existing plugins are not expected to break.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: KubeletPodResources.
- Components depending on the feature gate: N/A

Does enabling the feature change any default behavior?

No.

Are there any tests for feature enablement/disablement? No, however no data is created or deleted.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, through feature gates.

What happens if we reenable the feature if it was previously rolled back?

The service recovers state from kubelet.

Are there any tests for feature enablement/disablement?

No, however no data is created or deleted.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Kubelet would fail to start. Errors would be caught in the CI.

What specific metrics should inform a rollback?

Not Applicable, metrics wouldn't be available.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not Applicable.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Look at the pod_resources_endpoint_requests_total metric exposed by the kubelet.
Look at hostPath mounts of privileged containers.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details: check the kubelet metrics pod_resources_endpoint_*

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A or refer to Kubelet SLIs.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: pod_resources_endpoint_requests_total
- Components exposing the metric: kubelet

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

Not applicable.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No. Feature is out of existing any paths in kubelet.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

In 1.18, DOSing the API can lead to resource exhaustion. It is planned to be addressed as part of G.A. The API is exposed only through a unix-domain socket local to the node, so malicious agents can only be among pods running on the same node (e.g. no network access) which have been granted permission to access the unix domain socket with volume mounts and filesystem permissions. Feature only collects data when requests comes in, data is then garbage collected. Data collected is proportional to the number of pods on the node.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No. Clients consume the API through the gRPC interface exposed from the unix domain socket. Only a single socket is created and managed by the kubelet, shared among all the clients (typically one). No resources are reserved when a client connects, and the API is stateless (no state preserved across calls, not concept of session). All the data needed to serve the calls is fetched by internal, already existing data structures internal to resource managers.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No effect.

What are other known failure modes?

No known failure modes

What steps should be taken if SLOs are not being met to determine the problem?

N/A

Implementation History

2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. Slides
2018-10-30: Demo with example gpu monitoring daemonset
2018-11-10: KEP lgtm'd and approved
2018-11-15: Implementation and e2e test merged before 1.13 release: kubernetes/kubernetes#70508
2019-04-30: Demo of production GPU monitoring by NVIDIA
2019-04-30: Agreement in sig-node to move feature to beta in 1.15
2020-06-17: Agreement in sig-node to move feature to G.A in 1.19 or 1.20
2023-01-27: KEP translate to the most recent template
2023-05-23: KEP GA moved to 1.28

Drawbacks

N/A

Alternatives

Add v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns a list of CreateContainerRequests used to create containers.

Pros:
- Reuse an existing API for describing containers rather than inventing a new one
Cons:
- It ties the endpoint to the CreateContainerRequest, and may prevent us from adding other information we want in the future
- It does not contain any additional information that will be useful to monitoring agents other than device, and contains lots of irrelevant information for this use-case.
Notes:
- Does not include any reference to resource names. Monitoring agentes must identify devices by the device or environment variables passed to the pod or container.

Add a field to Pod Status.

Pros:
- Allows for observation of container to device bindings local to the node through the /pods endpoint
Cons:
- Only consumed locally, which doesn't justify an API change
- Device Bindings are immutable after allocation, and are debatably observable (they can be "observed" from the local checkpoint file). Device bindings are generally a poor fit for status.

Use the Kubelet Device Manager Checkpoint file

Allows for observability of device to container bindings through what exists in the checkpoint file
- Requires adding additional metadata to the checkpoint file as required by the monitoring agent
Requires implementing versioning for the checkpoint file, and handling version skew between readers and the kubelet
Future modifications to the checkpoint file are more difficult.

Add a field to the Pod Spec:

A new object ComputeDevice will be defined and a new variable ComputeDevices will be added in the Container (Spec) object which will represent a list of ComputeDevice objects.

// ComputeDevice describes the devices assigned to this container for a given ResourceName
type ComputeDevice struct {
	// DeviceIDs is the list of devices assigned to this container
	DeviceIDs []string
	// ResourceName is the name of the compute resource
	ResourceName string
}

// Container represents a single container that is expected to be run on the host.
type Container struct {
    ...
	// ComputeDevices contains the devices assigned to this container
	// This field is alpha-level and is only honored by servers that enable the ComputeDevices feature.
	// +optional
	ComputeDevices []ComputeDevice
	...
}

During Kubelet pod admission, if ComputeDevices is found non-empty, specified devices will be allocated otherwise behaviour will remain same as it is today.
Before starting the pod, the kubelet writes the assigned ComputeDevices back to the pod spec.
- Note: Writing to the Api Server and waiting to observe the updated pod spec in the kubelet's pod watch may add significant latency to pod startup.
Allows devices to potentially be assigned by a custom scheduler.
Serves as a permanent record of device assignments for the kubelet, and eliminates the need for the kubelet to maintain this state locally.

Files

README.md

Latest commit

History