Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podresources: add Watch endpoint #1926

Closed
wants to merge 5 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 48 additions & 2 deletions keps/sig-node/compute-device-assignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,12 +60,18 @@ In this document we will discuss the motivation and code changes required for in

## Changes

Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below:
Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager.
The GRPC Service exposes two endpoints:
- `List`, which returns a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node.
- `Watch`, which returns a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates, using the `action` field in the response.

This is shown in proto below:
```protobuf
// PodResources is a service provided by the kubelet that provides information about the
// node resources consumed by pods and containers on the node
service PodResources {
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
rpc Watch(WatchPodResourcesRequest) returns (stream WatchPodResourcesResponse) {}
}

// ListPodResourcesRequest is the request made to the PodResources service
Expand All @@ -76,11 +82,27 @@ message ListPodResourcesResponse {
repeated PodResources pod_resources = 1;
}

// WatchPodResourcesRequest is the request made to the Watch PodResourcesLister service
message WatchPodResourcesRequest {}

enum WatchPodAction {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is each action emitted? can you clarify when modified would be used in life of pod?

Copy link
Contributor Author

@ffromani ffromani Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actions should be emitted:

  • ADDED: when resources are assigned to the pod (I'm thinking about HintProvider's Allocate())
  • DELETED: when resources are claimed back (I'm thinking about UpdateAllocatedDevices())
    I'll document better in the KEP text.

In Hindsight we most likely don't need MODIFED, will just remove it.

ADDED = 0;
DELETED = 1;
}

// WatchPodResourcesResponse is the response returned by Watch function
message WatchPodResourcesResponse {
WatchPodAction action = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if just exposing the pod resourceVersion here is a good way forward

string uid = 2;
repeated PodResources pod_resources = 3;
}

// PodResources contains information about the node resources assigned to a pod
message PodResources {
string name = 1;
string namespace = 2;
repeated ContainerResources containers = 3;
int64 resource_version = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the pod resource version as stored in etcd or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the intention, in order to enable client code to reconcile data from Watch() with the data they get from List().

}

// ContainerResources contains information about the resources assigned to a container
Expand All @@ -96,11 +118,34 @@ message ContainerDevices {
}
```

### Consuming the Watch endpoint in client applications

Using the `Watch` endpoint, client applications can be notified of the pod resource allocation changes as soon as possible.
However, the state of a pod will not be sent up until the first resource allocation change, which is the pod deletion in the worst case.
Client applications who need to have the complete resource allocation picture thus need to consume both `List` and `Watch` endpoints.

The `resourceVersion` found in the responses of both APIs allows client applications to identify the most recent information.
The `resourceVersion` value is updated following the same semantics of pod `resourceVersion` value, and the implementation
may use the same value from the corresponding pods.
To keep the implementation simple as possible, the kubelet does *not* store any historical list of changes.

In order to make sure not to miss any updates, client application can:
1. call the `Watch` endpoint to get a stream of changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what triggers watch events from getting emitted in kubelet code flows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some experiments, I think ADDED should be triggered after succesfull allocation from topology manager (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager/topology_manager.go#L232)
while DELETED should be triggered once device are claimed back (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/container_manager_linux.go#L1034)

2. call the `List` endpoint to get the state of all the pods in the node.
3. reconcile updates using the `resourceVersion`.

In order to make the resource accounting on the client side, safe and easy as possible the `Watch` implementation
will guarantee ordering of the event delivery in such a way that the capacity invariants are always preserved, and the value
will be consistent after each event received - not only at steady state.
Consider the following scenario with 10 devices, all allocated: pod A with device D1 allocated gets deleted, then
pod B starts and gets device D1 again. In this case `Watch` will guarantee that `DELETE` and `ADDED` events are delivered
in the correct order.

### Potential Future Improvements

* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll.
* Add identifiers for other resources used by pods to the `PodResources` message.
* For example, persistent volume location on disk
* Implement historical list of changes, allowing client applications to call `List` and `Watch` endpoints in a more natural order.

## Alternatives Considered

Expand Down Expand Up @@ -164,6 +209,7 @@ Beta:

## Implementation History

- 2020-10-01: KEP extended with Watch API
- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing)
- 2018-10-30: Demo with example gpu monitoring daemonset
- 2018-11-10: KEP lgtm'd and approved
Expand Down