From 0c6948a180ef7a37208fc5218c0f6892bb53ff94 Mon Sep 17 00:00:00 2001 From: Francesco Romani Date: Thu, 6 Aug 2020 13:19:57 +0200 Subject: [PATCH 1/5] podresources: add ListAndWatch function Extend the protocol with a simple implementation of ListAndWatch to enable monitoring agents to be notified of resource allocation changes. Signed-off-by: Francesco Romani --- keps/sig-node/compute-device-assignment.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/compute-device-assignment.md b/keps/sig-node/compute-device-assignment.md index 36228a6dff2..60025640568 100644 --- a/keps/sig-node/compute-device-assignment.md +++ b/keps/sig-node/compute-device-assignment.md @@ -60,18 +60,24 @@ In this document we will discuss the motivation and code changes required for in ## Changes -Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below: +Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. +The GRPC Service can return: +- a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node. +- a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates. + +This is shown in proto below: ```protobuf // PodResources is a service provided by the kubelet that provides information about the // node resources consumed by pods and containers on the node service PodResources { rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} + rpc ListAndWatch(ListPodResourcesRequest) returns (stream ListPodResourcesResponse) {} } // ListPodResourcesRequest is the request made to the PodResources service message ListPodResourcesRequest {} -// ListPodResourcesResponse is the response returned by List function +// ListPodResourcesResponse is the response returned by List and ListAndWatch functions message ListPodResourcesResponse { repeated PodResources pod_resources = 1; } @@ -98,7 +104,6 @@ message ContainerDevices { ### Potential Future Improvements -* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll. * Add identifiers for other resources used by pods to the `PodResources` message. * For example, persistent volume location on disk @@ -164,6 +169,7 @@ Beta: ## Implementation History +- 2020-08-XX: KEP extended with ListAndWatch function - 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing) - 2018-10-30: Demo with example gpu monitoring daemonset - 2018-11-10: KEP lgtm'd and approved From 86596df22b4f154f9df1617f19875a6cbefd21eb Mon Sep 17 00:00:00 2001 From: Francesco Romani Date: Tue, 11 Aug 2020 10:22:34 +0200 Subject: [PATCH 2/5] podresources: watch: split endpoint, add action Address reviewers comment: 1. Add explicit Watch endpoint so APIs are composable (not bundled in ListAndWatch) 2. Add explicit action field in the Watch() endpoint response Signed-off-by: Francesco Romani --- keps/sig-node/compute-device-assignment.md | 25 +++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/keps/sig-node/compute-device-assignment.md b/keps/sig-node/compute-device-assignment.md index 60025640568..8c9e3696727 100644 --- a/keps/sig-node/compute-device-assignment.md +++ b/keps/sig-node/compute-device-assignment.md @@ -61,9 +61,9 @@ In this document we will discuss the motivation and code changes required for in ## Changes Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. -The GRPC Service can return: -- a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node. -- a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates. +The GRPC Service exposes two endpoints: +- `List`, which returns a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node. +- `Watch`, which returns a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates, using the `action` field in the response. This is shown in proto below: ```protobuf @@ -71,17 +71,32 @@ This is shown in proto below: // node resources consumed by pods and containers on the node service PodResources { rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} - rpc ListAndWatch(ListPodResourcesRequest) returns (stream ListPodResourcesResponse) {} + rpc Watch(WatchPodResourcesRequest) returns (stream WatchPodResourcesResponse) {} } // ListPodResourcesRequest is the request made to the PodResources service message ListPodResourcesRequest {} -// ListPodResourcesResponse is the response returned by List and ListAndWatch functions +// ListPodResourcesResponse is the response returned by List function message ListPodResourcesResponse { repeated PodResources pod_resources = 1; } +// WatchPodResourcesRequest is the request made to the Watch PodResourcesLister service +message WatchPodResourcesRequest {} + +enum WatchPodAction { + UPDATED = 0; + DELETED = 1; + ADDED = 2; +} + +// WatchPodResourcesResponse is the response returned by Watch function +message WatchPodResourcesResponse { + WatchPodAction action = 1; + repeated PodResources pod_resources = 2; +} + // PodResources contains information about the node resources assigned to a pod message PodResources { string name = 1; From c4b38e3c792bdf32b6e8a394d63d33949e5795f9 Mon Sep 17 00:00:00 2001 From: Francesco Romani Date: Tue, 11 Aug 2020 16:46:22 +0200 Subject: [PATCH 3/5] misc minor fixes: * Missed reference to "ListAndWatch", now replaced by "Watch" * renamed UPDATED->MODIFIED To be more compliant with kube naming standards (https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes) Signed-off-by: Francesco Romani --- keps/sig-node/compute-device-assignment.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/compute-device-assignment.md b/keps/sig-node/compute-device-assignment.md index 8c9e3696727..b4b924beed4 100644 --- a/keps/sig-node/compute-device-assignment.md +++ b/keps/sig-node/compute-device-assignment.md @@ -86,9 +86,9 @@ message ListPodResourcesResponse { message WatchPodResourcesRequest {} enum WatchPodAction { - UPDATED = 0; - DELETED = 1; - ADDED = 2; + ADDED = 0; + MODIFIED = 1; + DELETED = 2; } // WatchPodResourcesResponse is the response returned by Watch function @@ -184,7 +184,7 @@ Beta: ## Implementation History -- 2020-08-XX: KEP extended with ListAndWatch function +- 2020-08-XX: KEP extended with Watch function - 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing) - 2018-10-30: Demo with example gpu monitoring daemonset - 2018-11-10: KEP lgtm'd and approved From 45dd5d9cbdd20572b5e6dd24af67710324bb1ede Mon Sep 17 00:00:00 2001 From: Francesco Romani Date: Thu, 20 Aug 2020 14:30:15 +0200 Subject: [PATCH 4/5] podresources: watch: expose resourceVersion Add new field in the API responses objects to allow client applications to consume both `List` and `Watch` endpoints. The issue here is enabling client applications to not lose any updates when both APIs are used. The straightforward option is to follow the generic k8s approach (see link below) and let kubelet keep a historical window of the last recent changes, so client applications have the chance to issue `List` and shortly after `Watch`, starting from the resourceVersion returned in `List`. The underlying assumption is indeed that `Watch` happens "shortly" after `List`, otherwise the system cannot guarantee the lack of gaps. However implementing this support requires to keep the aforementioned sliding window of changes, which however requires careful implementation to address scalability and safety guarantees. However, the `podresources` API is a specific API, so, while is good to follow as much as possible the generic API concepts, it also allows some possible little differences which can help keep the implementation simple and safe. This patch proposes a simplest possible approach to reconcile the `List` and `Watch` responses, providing the `resource_version` field and suggesting a little change in the client applications programming model. Inspired by the concepts found on https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes Signed-off-by: Francesco Romani --- keps/sig-node/compute-device-assignment.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/keps/sig-node/compute-device-assignment.md b/keps/sig-node/compute-device-assignment.md index b4b924beed4..5d0aa7baeea 100644 --- a/keps/sig-node/compute-device-assignment.md +++ b/keps/sig-node/compute-device-assignment.md @@ -102,6 +102,7 @@ message PodResources { string name = 1; string namespace = 2; repeated ContainerResources containers = 3; + int64 resource_version = 4; } // ContainerResources contains information about the resources assigned to a container @@ -117,10 +118,24 @@ message ContainerDevices { } ``` +### Consuming the Watch endpoint in client applications + +Using the `Watch` endpoint, client applications can be notified of the pod resource allocation changes as soon as possible. +However, the state of a pod will not be sent up until the first resource allocation change, which is the pod deletion in the worst case. +Client applications who need to have the complete resource allocation picture thus need to consume both `List` and `Watch` endpoints. +The `resourceVersion` found in the responses of both APIs allows client applications to identify the most recent information. +To keep the implementation simple as possible, the kubelet does *not* store any historical list of changes. + +In order to make sure not to miss any updates, client application can: +1. call the `Watch` endpoint to get a stream of changes. +2. call the `List` endpoint to get the state of all the pods in the node. +3. reconcile updates using the `resourceVersion`. + ### Potential Future Improvements * Add identifiers for other resources used by pods to the `PodResources` message. * For example, persistent volume location on disk +* Implement historical list of changes, allowing client applications to call `List` and `Watch` endpoints in a more natural order. ## Alternatives Considered From 1346c2ac10576ec24bd941feb8797745a8d33ff3 Mon Sep 17 00:00:00 2001 From: Francesco Romani Date: Thu, 1 Oct 2020 13:19:39 +0200 Subject: [PATCH 5/5] podresources: watch: address review comments Signed-off-by: Francesco Romani --- keps/sig-node/compute-device-assignment.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/compute-device-assignment.md b/keps/sig-node/compute-device-assignment.md index 5d0aa7baeea..cfb01d78cb9 100644 --- a/keps/sig-node/compute-device-assignment.md +++ b/keps/sig-node/compute-device-assignment.md @@ -87,14 +87,14 @@ message WatchPodResourcesRequest {} enum WatchPodAction { ADDED = 0; - MODIFIED = 1; - DELETED = 2; + DELETED = 1; } // WatchPodResourcesResponse is the response returned by Watch function message WatchPodResourcesResponse { WatchPodAction action = 1; - repeated PodResources pod_resources = 2; + string uid = 2; + repeated PodResources pod_resources = 3; } // PodResources contains information about the node resources assigned to a pod @@ -123,7 +123,10 @@ message ContainerDevices { Using the `Watch` endpoint, client applications can be notified of the pod resource allocation changes as soon as possible. However, the state of a pod will not be sent up until the first resource allocation change, which is the pod deletion in the worst case. Client applications who need to have the complete resource allocation picture thus need to consume both `List` and `Watch` endpoints. + The `resourceVersion` found in the responses of both APIs allows client applications to identify the most recent information. +The `resourceVersion` value is updated following the same semantics of pod `resourceVersion` value, and the implementation +may use the same value from the corresponding pods. To keep the implementation simple as possible, the kubelet does *not* store any historical list of changes. In order to make sure not to miss any updates, client application can: @@ -131,6 +134,13 @@ In order to make sure not to miss any updates, client application can: 2. call the `List` endpoint to get the state of all the pods in the node. 3. reconcile updates using the `resourceVersion`. +In order to make the resource accounting on the client side, safe and easy as possible the `Watch` implementation +will guarantee ordering of the event delivery in such a way that the capacity invariants are always preserved, and the value +will be consistent after each event received - not only at steady state. +Consider the following scenario with 10 devices, all allocated: pod A with device D1 allocated gets deleted, then +pod B starts and gets device D1 again. In this case `Watch` will guarantee that `DELETE` and `ADDED` events are delivered +in the correct order. + ### Potential Future Improvements * Add identifiers for other resources used by pods to the `PodResources` message. @@ -199,7 +209,7 @@ Beta: ## Implementation History -- 2020-08-XX: KEP extended with Watch function +- 2020-10-01: KEP extended with Watch API - 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing) - 2018-10-30: Demo with example gpu monitoring daemonset - 2018-11-10: KEP lgtm'd and approved